* [PATCH 0/1] prctl: allow overriding system THP policy to always
@ 2025-05-07 14:00 Usama Arif
2025-05-07 14:00 ` [PATCH 1/1] prctl: allow overriding system THP policy to always per process Usama Arif
` (2 more replies)
0 siblings, 3 replies; 29+ messages in thread
From: Usama Arif @ 2025-05-07 14:00 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes,
Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team,
Usama Arif
Allowing override of global THP policy per process allows workloads
that have shown to benefit from hugepages to do so, without regressing
workloads that wouldn't benefit. This will allow such types of
workloads to be run/stacked on the same machine.
It also helps in rolling out hugepages in hyperscaler configurations
for workloads that benefit from them, where a single THP policy is
likely to be used across the entire fleet, and prctl will help override it.
An advantage of doing it via prctl vs creating a cgroup specific
option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
that this will work even when there are no cgroups present, and my
understanding is there is a strong preference of cgroups controls being
hierarchical which usually means them having a numerical value.
The output and code of test program is below:
[root@vm4 vmuser]# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
[root@vm4 vmuser]# echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
[root@vm4 vmuser]# ./a.out
Default THP setting:
THP is not set to 'always'.
PR_SET_THP_ALWAYS = 1
THP is set to 'always'.
PR_SET_THP_ALWAYS = 0
THP is not set to 'always'.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#define PR_SET_THP_ALWAYS 78
#define SIZE 12 * (2 * 1024 * 1024) // 24 MB
void check_smaps(void) {
FILE *file = fopen("/proc/self/smaps", "r");
if (!file) {
perror("fopen");
return;
}
char line[256];
int is_hugepage = 0;
while (fgets(line, sizeof(line), file)) {
// if (strstr(line, "AnonHugePages:"))
// printf("%s\n", line);
if (strstr(line, "AnonHugePages:") && strstr(line, "24576 kB"))
{
// printf("%s\n", line);
is_hugepage = 1;
break;
}
}
fclose(file);
if (is_hugepage) {
printf("THP is set to 'always'.\n");
} else {
printf("THP is not set to 'always'.\n");
}
}
void test_mmap_thp(void) {
char *buffer = (char *)mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (buffer == MAP_FAILED) {
perror("mmap");
return;
}
// Touch the memory to ensure it's allocated
memset(buffer, 0, SIZE);
check_smaps();
munmap(buffer, SIZE);
}
int main() {
printf("Default THP setting: \n");
test_mmap_thp();
printf("PR_SET_THP_ALWAYS = 1 \n");
prctl(PR_SET_THP_ALWAYS, 1, NULL, NULL, NULL);
test_mmap_thp();
printf("PR_SET_THP_ALWAYS = 0 \n");
prctl(PR_SET_THP_ALWAYS, 0, NULL, NULL, NULL);
test_mmap_thp();
return 0;
}
Usama Arif (1):
prctl: allow overriding system THP policy to always per process
include/linux/huge_mm.h | 3 ++-
include/linux/mm_types.h | 7 ++-----
include/uapi/linux/prctl.h | 3 +++
kernel/sys.c | 16 ++++++++++++++++
tools/include/uapi/linux/prctl.h | 3 +++
.../perf/trace/beauty/include/uapi/linux/prctl.h | 3 +++
6 files changed, 29 insertions(+), 6 deletions(-)
--
2.47.1
^ permalink raw reply [flat|nested] 29+ messages in thread* [PATCH 1/1] prctl: allow overriding system THP policy to always per process 2025-05-07 14:00 [PATCH 0/1] prctl: allow overriding system THP policy to always Usama Arif @ 2025-05-07 14:00 ` Usama Arif 2025-05-07 15:02 ` Usama Arif 2025-05-07 20:14 ` Zi Yan 2025-05-07 14:57 ` [PATCH 0/1] prctl: allow overriding system THP policy to always Zi Yan 2025-05-08 11:06 ` David Hildenbrand 2 siblings, 2 replies; 29+ messages in thread From: Usama Arif @ 2025-05-07 14:00 UTC (permalink / raw) To: Andrew Morton, david, linux-mm Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team, Usama Arif Allowing override of global THP policy per process allows workloads that have shown to benefit from hugepages to do so, without regressing workloads that wouldn't benefit. This will allow such types of workloads to be run/stacked on the same machine. It also helps in rolling out hugepages in hyperscaler configurations for workloads that benefit from them, where a single THP policy is likely to be used across the entire fleet, and prctl will help override it. Signed-off-by: Usama Arif <usamaarif642@gmail.com> --- include/linux/huge_mm.h | 3 ++- include/linux/mm_types.h | 7 ++----- include/uapi/linux/prctl.h | 3 +++ kernel/sys.c | 16 ++++++++++++++++ tools/include/uapi/linux/prctl.h | 3 +++ .../perf/trace/beauty/include/uapi/linux/prctl.h | 3 +++ 6 files changed, 29 insertions(+), 6 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2f190c90192d..0587dc4b8e2d 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -293,7 +293,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, if (vm_flags & VM_HUGEPAGE) mask |= READ_ONCE(huge_anon_orders_madvise); if (hugepage_global_always() || - ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled())) + ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()) || + test_bit(MMF_THP_ALWAYS, &vma->vm_mm->flags)) mask |= READ_ONCE(huge_anon_orders_inherit); orders &= mask; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index e76bade9ebb1..9bcd72b2b191 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1704,11 +1704,8 @@ enum { #define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */ #define MMF_VM_HUGEPAGE 17 /* set when mm is available for khugepaged */ -/* - * This one-shot flag is dropped due to necessity of changing exe once again - * on NFS restore - */ -//#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ +/* override inherited page sizes to always for the entire process */ + #define MMF_THP_ALWAYS 18 #define MMF_HAS_UPROBES 19 /* has uprobes */ #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 15c18ef4eb11..22c526681562 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -364,4 +364,7 @@ struct prctl_mm_map { # define PR_TIMER_CREATE_RESTORE_IDS_ON 1 # define PR_TIMER_CREATE_RESTORE_IDS_GET 2 +#define PR_SET_THP_ALWAYS 78 +#define PR_GET_THP_ALWAYS 79 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/sys.c b/kernel/sys.c index c434968e9f5d..ee56b059ff1f 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2658,6 +2658,22 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, clear_bit(MMF_DISABLE_THP, &me->mm->flags); mmap_write_unlock(me->mm); break; + case PR_GET_THP_ALWAYS: + if (arg2 || arg3 || arg4 || arg5) + return -EINVAL; + error = !!test_bit(MMF_THP_ALWAYS, &me->mm->flags); + break; + case PR_SET_THP_ALWAYS: + if (arg3 || arg4 || arg5) + return -EINVAL; + if (mmap_write_lock_killable(me->mm)) + return -EINTR; + if (arg2) + set_bit(MMF_THP_ALWAYS, &me->mm->flags); + else + clear_bit(MMF_THP_ALWAYS, &me->mm->flags); + mmap_write_unlock(me->mm); + break; case PR_MPX_ENABLE_MANAGEMENT: case PR_MPX_DISABLE_MANAGEMENT: /* No longer implemented: */ diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h index 35791791a879..f5f6cff42b3f 100644 --- a/tools/include/uapi/linux/prctl.h +++ b/tools/include/uapi/linux/prctl.h @@ -328,4 +328,7 @@ struct prctl_mm_map { # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC 0x10 /* Clear the aspect on exec */ # define PR_PPC_DEXCR_CTRL_MASK 0x1f +#define PR_GET_THP_ALWAYS 78 +#define PR_SET_THP_ALWAYS 79 + #endif /* _LINUX_PRCTL_H */ diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h index 15c18ef4eb11..680996d56faf 100644 --- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h +++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h @@ -364,4 +364,7 @@ struct prctl_mm_map { # define PR_TIMER_CREATE_RESTORE_IDS_ON 1 # define PR_TIMER_CREATE_RESTORE_IDS_GET 2 +#define PR_GET_THP_ALWAYS 78 +#define PR_SET_THP_ALWAYS 79 + #endif /* _LINUX_PRCTL_H */ -- 2.47.1 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/1] prctl: allow overriding system THP policy to always per process 2025-05-07 14:00 ` [PATCH 1/1] prctl: allow overriding system THP policy to always per process Usama Arif @ 2025-05-07 15:02 ` Usama Arif 2025-05-07 20:14 ` Zi Yan 1 sibling, 0 replies; 29+ messages in thread From: Usama Arif @ 2025-05-07 15:02 UTC (permalink / raw) To: Andrew Morton, david, linux-mm Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On 07/05/2025 15:00, Usama Arif wrote: > Allowing override of global THP policy per process allows workloads > that have shown to benefit from hugepages to do so, without regressing > workloads that wouldn't benefit. This will allow such types of workloads > to be run/stacked on the same machine. > > It also helps in rolling out hugepages in hyperscaler configurations > for workloads that benefit from them, where a single THP policy is likely > to be used across the entire fleet, and prctl will help override it. > > Signed-off-by: Usama Arif <usamaarif642@gmail.com> > --- > include/linux/huge_mm.h | 3 ++- > include/linux/mm_types.h | 7 ++----- > include/uapi/linux/prctl.h | 3 +++ > kernel/sys.c | 16 ++++++++++++++++ > tools/include/uapi/linux/prctl.h | 3 +++ > .../perf/trace/beauty/include/uapi/linux/prctl.h | 3 +++ > 6 files changed, 29 insertions(+), 6 deletions(-) > I forgot to include the change for non-anon VMA, which Johannes pointed out (Thanks!) The patch will require the below fixlet on top of it: From b719cd28ae78de699c0f801a4283449f6ac767ad Mon Sep 17 00:00:00 2001 From: Usama Arif <usamaarif642@gmail.com> Date: Wed, 7 May 2025 15:58:39 +0100 Subject: [PATCH] prctl: Allowing override of global THP policy per process This is a fixlet for doing it for non-anon VMAs as well. Signed-off-by: Usama Arif <usamaarif642@gmail.com> --- mm/huge_memory.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2780a12b25f0..c4bf8eae420c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -177,7 +177,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, */ if (enforce_sysfs && (!hugepage_global_enabled() || (!(vm_flags & VM_HUGEPAGE) && - !hugepage_global_always()))) + !hugepage_global_always()) || + test_bit(MMF_THP_ALWAYS, &vma->vm_mm->flags))) return 0; /* -- 2.47.1 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/1] prctl: allow overriding system THP policy to always per process 2025-05-07 14:00 ` [PATCH 1/1] prctl: allow overriding system THP policy to always per process Usama Arif 2025-05-07 15:02 ` Usama Arif @ 2025-05-07 20:14 ` Zi Yan 2025-05-08 10:53 ` Usama Arif 1 sibling, 1 reply; 29+ messages in thread From: Zi Yan @ 2025-05-07 20:14 UTC (permalink / raw) To: Usama Arif Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On 7 May 2025, at 10:00, Usama Arif wrote: > Allowing override of global THP policy per process allows workloads > that have shown to benefit from hugepages to do so, without regressing > workloads that wouldn't benefit. This will allow such types of workloads > to be run/stacked on the same machine. > > It also helps in rolling out hugepages in hyperscaler configurations > for workloads that benefit from them, where a single THP policy is likely > to be used across the entire fleet, and prctl will help override it. > > Signed-off-by: Usama Arif <usamaarif642@gmail.com> > --- > include/linux/huge_mm.h | 3 ++- > include/linux/mm_types.h | 7 ++----- > include/uapi/linux/prctl.h | 3 +++ > kernel/sys.c | 16 ++++++++++++++++ > tools/include/uapi/linux/prctl.h | 3 +++ > .../perf/trace/beauty/include/uapi/linux/prctl.h | 3 +++ > 6 files changed, 29 insertions(+), 6 deletions(-) > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 2f190c90192d..0587dc4b8e2d 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -293,7 +293,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, > if (vm_flags & VM_HUGEPAGE) > mask |= READ_ONCE(huge_anon_orders_madvise); > if (hugepage_global_always() || > - ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled())) > + ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()) || > + test_bit(MMF_THP_ALWAYS, &vma->vm_mm->flags)) > mask |= READ_ONCE(huge_anon_orders_inherit); > > orders &= mask; > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index e76bade9ebb1..9bcd72b2b191 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -1704,11 +1704,8 @@ enum { > #define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */ > #define MMF_VM_HUGEPAGE 17 /* set when mm is available for khugepaged */ > > -/* > - * This one-shot flag is dropped due to necessity of changing exe once again > - * on NFS restore > - */ > -//#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ > +/* override inherited page sizes to always for the entire process */ > + #define MMF_THP_ALWAYS 18 Could we have something like MMF_THP_POLICY_SET and another field for "always"? Otherwise, how are we going to set MMF_THP_MADVISE if we want it in the future? > > #define MMF_HAS_UPROBES 19 /* has uprobes */ > #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 15c18ef4eb11..22c526681562 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -364,4 +364,7 @@ struct prctl_mm_map { > # define PR_TIMER_CREATE_RESTORE_IDS_ON 1 > # define PR_TIMER_CREATE_RESTORE_IDS_GET 2 > > +#define PR_SET_THP_ALWAYS 78 > +#define PR_GET_THP_ALWAYS 79 > + > #endif /* _LINUX_PRCTL_H */ > diff --git a/kernel/sys.c b/kernel/sys.c > index c434968e9f5d..ee56b059ff1f 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2658,6 +2658,22 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > clear_bit(MMF_DISABLE_THP, &me->mm->flags); > mmap_write_unlock(me->mm); > break; > + case PR_GET_THP_ALWAYS: > + if (arg2 || arg3 || arg4 || arg5) > + return -EINVAL; > + error = !!test_bit(MMF_THP_ALWAYS, &me->mm->flags); > + break; > + case PR_SET_THP_ALWAYS: > + if (arg3 || arg4 || arg5) > + return -EINVAL; > + if (mmap_write_lock_killable(me->mm)) > + return -EINTR; > + if (arg2) > + set_bit(MMF_THP_ALWAYS, &me->mm->flags); > + else > + clear_bit(MMF_THP_ALWAYS, &me->mm->flags); > + mmap_write_unlock(me->mm); > + break; prctl can take more than one arguments. Would it be better to add PR_SET_THP_POLICY and PR_GET_THP_POLICY and specify PR_THP_POLICY_ALWAYS in the second argument? So that in the future, if we want to add more THP policies, we do not need to keep piling up PR_{GET,SET}_THP_*? > case PR_MPX_ENABLE_MANAGEMENT: > case PR_MPX_DISABLE_MANAGEMENT: > /* No longer implemented: */ > diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h > index 35791791a879..f5f6cff42b3f 100644 > --- a/tools/include/uapi/linux/prctl.h > +++ b/tools/include/uapi/linux/prctl.h > @@ -328,4 +328,7 @@ struct prctl_mm_map { > # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC 0x10 /* Clear the aspect on exec */ > # define PR_PPC_DEXCR_CTRL_MASK 0x1f > > +#define PR_GET_THP_ALWAYS 78 > +#define PR_SET_THP_ALWAYS 79 > + > #endif /* _LINUX_PRCTL_H */ > diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h > index 15c18ef4eb11..680996d56faf 100644 > --- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h > +++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h > @@ -364,4 +364,7 @@ struct prctl_mm_map { > # define PR_TIMER_CREATE_RESTORE_IDS_ON 1 > # define PR_TIMER_CREATE_RESTORE_IDS_GET 2 > > +#define PR_GET_THP_ALWAYS 78 > +#define PR_SET_THP_ALWAYS 79 > + > #endif /* _LINUX_PRCTL_H */ > -- > 2.47.1 -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/1] prctl: allow overriding system THP policy to always per process 2025-05-07 20:14 ` Zi Yan @ 2025-05-08 10:53 ` Usama Arif 2025-05-08 20:29 ` Zi Yan 0 siblings, 1 reply; 29+ messages in thread From: Usama Arif @ 2025-05-08 10:53 UTC (permalink / raw) To: Zi Yan Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On 07/05/2025 21:14, Zi Yan wrote: > On 7 May 2025, at 10:00, Usama Arif wrote: > >> Allowing override of global THP policy per process allows workloads >> that have shown to benefit from hugepages to do so, without regressing >> workloads that wouldn't benefit. This will allow such types of workloads >> to be run/stacked on the same machine. >> >> It also helps in rolling out hugepages in hyperscaler configurations >> for workloads that benefit from them, where a single THP policy is likely >> to be used across the entire fleet, and prctl will help override it. >> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com> >> --- >> include/linux/huge_mm.h | 3 ++- >> include/linux/mm_types.h | 7 ++----- >> include/uapi/linux/prctl.h | 3 +++ >> kernel/sys.c | 16 ++++++++++++++++ >> tools/include/uapi/linux/prctl.h | 3 +++ >> .../perf/trace/beauty/include/uapi/linux/prctl.h | 3 +++ >> 6 files changed, 29 insertions(+), 6 deletions(-) >> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >> index 2f190c90192d..0587dc4b8e2d 100644 >> --- a/include/linux/huge_mm.h >> +++ b/include/linux/huge_mm.h >> @@ -293,7 +293,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, >> if (vm_flags & VM_HUGEPAGE) >> mask |= READ_ONCE(huge_anon_orders_madvise); >> if (hugepage_global_always() || >> - ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled())) >> + ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()) || >> + test_bit(MMF_THP_ALWAYS, &vma->vm_mm->flags)) >> mask |= READ_ONCE(huge_anon_orders_inherit); >> >> orders &= mask; >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >> index e76bade9ebb1..9bcd72b2b191 100644 >> --- a/include/linux/mm_types.h >> +++ b/include/linux/mm_types.h >> @@ -1704,11 +1704,8 @@ enum { >> #define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */ >> #define MMF_VM_HUGEPAGE 17 /* set when mm is available for khugepaged */ >> >> -/* >> - * This one-shot flag is dropped due to necessity of changing exe once again >> - * on NFS restore >> - */ >> -//#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ >> +/* override inherited page sizes to always for the entire process */ >> + #define MMF_THP_ALWAYS 18 > > Could we have something like MMF_THP_POLICY_SET and another field > for "always"? Otherwise, how are we going to set MMF_THP_MADVISE if > we want it in the future? > So we actually need MMF_THP_MADVISE as well. If we have the entire fleet system policy set to always, but if certain workloads regress with it, then we need to prctl MADVISE them. As you said, either we have: - MMF_THP_MADVISE as another flag. The issue with this is we have run out of bits in mm->flags for 32 bit machines. So either we introduce another member in mm_struct (maybe mm_struct->flags2?), or we start using bits 32 and above of flags field, limit this to 64 bit architectures and wrap the code with #ifdefs. This is probably ugly, but if its ok for upstream, happy to do that. - MMF_THP_POLICY_SET + another field to specify what the policy is, the issue with this might be another field per mm_struct/process. Having an entire field just for this might be wasteful, so I think I would prefer having just mm_struct->flags2 and use up only 1 bit of it, and the rest can be used for anything else in the future. I think the flags2 approach is likely the best, but let me know what you think. >> >> #define MMF_HAS_UPROBES 19 /* has uprobes */ >> #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ >> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h >> index 15c18ef4eb11..22c526681562 100644 >> --- a/include/uapi/linux/prctl.h >> +++ b/include/uapi/linux/prctl.h >> @@ -364,4 +364,7 @@ struct prctl_mm_map { >> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1 >> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2 >> >> +#define PR_SET_THP_ALWAYS 78 >> +#define PR_GET_THP_ALWAYS 79 >> + >> #endif /* _LINUX_PRCTL_H */ >> diff --git a/kernel/sys.c b/kernel/sys.c >> index c434968e9f5d..ee56b059ff1f 100644 >> --- a/kernel/sys.c >> +++ b/kernel/sys.c >> @@ -2658,6 +2658,22 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, >> clear_bit(MMF_DISABLE_THP, &me->mm->flags); >> mmap_write_unlock(me->mm); >> break; >> + case PR_GET_THP_ALWAYS: >> + if (arg2 || arg3 || arg4 || arg5) >> + return -EINVAL; >> + error = !!test_bit(MMF_THP_ALWAYS, &me->mm->flags); >> + break; >> + case PR_SET_THP_ALWAYS: >> + if (arg3 || arg4 || arg5) >> + return -EINVAL; >> + if (mmap_write_lock_killable(me->mm)) >> + return -EINTR; >> + if (arg2) >> + set_bit(MMF_THP_ALWAYS, &me->mm->flags); >> + else >> + clear_bit(MMF_THP_ALWAYS, &me->mm->flags); >> + mmap_write_unlock(me->mm); >> + break; > > prctl can take more than one arguments. Would it be better to add > PR_SET_THP_POLICY and PR_GET_THP_POLICY and specify PR_THP_POLICY_ALWAYS > in the second argument? So that in the future, if we want to add > more THP policies, we do not need to keep piling up PR_{GET,SET}_THP_*? I am ok with either, with prctl values there isnt any limit like MMF. I went with how the other prctls are used, i.e. PR_SET_THP_DISABLE uses arg2 to check if it needs to be enabled (non 0)/disabled (0 value). I think having PR_GET/SET_THP_ALWAYS/MADVISE could be better as it would work similar to how existing PR_* (PR_SET_THP_DISABLE) works with arg2 used for enabling/disabling and we dont have any limits for the number of PR_* that can exist. But also happy to change it to PR_GET/SET_THP_POLICY. > >> case PR_MPX_ENABLE_MANAGEMENT: >> case PR_MPX_DISABLE_MANAGEMENT: >> /* No longer implemented: */ >> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h >> index 35791791a879..f5f6cff42b3f 100644 >> --- a/tools/include/uapi/linux/prctl.h >> +++ b/tools/include/uapi/linux/prctl.h >> @@ -328,4 +328,7 @@ struct prctl_mm_map { >> # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC 0x10 /* Clear the aspect on exec */ >> # define PR_PPC_DEXCR_CTRL_MASK 0x1f >> >> +#define PR_GET_THP_ALWAYS 78 >> +#define PR_SET_THP_ALWAYS 79 >> + >> #endif /* _LINUX_PRCTL_H */ >> diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h >> index 15c18ef4eb11..680996d56faf 100644 >> --- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h >> +++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h >> @@ -364,4 +364,7 @@ struct prctl_mm_map { >> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1 >> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2 >> >> +#define PR_GET_THP_ALWAYS 78 >> +#define PR_SET_THP_ALWAYS 79 >> + >> #endif /* _LINUX_PRCTL_H */ >> -- >> 2.47.1 > > > -- > Best Regards, > Yan, Zi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/1] prctl: allow overriding system THP policy to always per process 2025-05-08 10:53 ` Usama Arif @ 2025-05-08 20:29 ` Zi Yan 0 siblings, 0 replies; 29+ messages in thread From: Zi Yan @ 2025-05-08 20:29 UTC (permalink / raw) To: Usama Arif Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On 8 May 2025, at 6:53, Usama Arif wrote: > On 07/05/2025 21:14, Zi Yan wrote: >> On 7 May 2025, at 10:00, Usama Arif wrote: >> >>> Allowing override of global THP policy per process allows workloads >>> that have shown to benefit from hugepages to do so, without regressing >>> workloads that wouldn't benefit. This will allow such types of workloads >>> to be run/stacked on the same machine. >>> >>> It also helps in rolling out hugepages in hyperscaler configurations >>> for workloads that benefit from them, where a single THP policy is likely >>> to be used across the entire fleet, and prctl will help override it. >>> >>> Signed-off-by: Usama Arif <usamaarif642@gmail.com> >>> --- >>> include/linux/huge_mm.h | 3 ++- >>> include/linux/mm_types.h | 7 ++----- >>> include/uapi/linux/prctl.h | 3 +++ >>> kernel/sys.c | 16 ++++++++++++++++ >>> tools/include/uapi/linux/prctl.h | 3 +++ >>> .../perf/trace/beauty/include/uapi/linux/prctl.h | 3 +++ >>> 6 files changed, 29 insertions(+), 6 deletions(-) >>> >>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >>> index 2f190c90192d..0587dc4b8e2d 100644 >>> --- a/include/linux/huge_mm.h >>> +++ b/include/linux/huge_mm.h >>> @@ -293,7 +293,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, >>> if (vm_flags & VM_HUGEPAGE) >>> mask |= READ_ONCE(huge_anon_orders_madvise); >>> if (hugepage_global_always() || >>> - ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled())) >>> + ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()) || >>> + test_bit(MMF_THP_ALWAYS, &vma->vm_mm->flags)) >>> mask |= READ_ONCE(huge_anon_orders_inherit); >>> >>> orders &= mask; >>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >>> index e76bade9ebb1..9bcd72b2b191 100644 >>> --- a/include/linux/mm_types.h >>> +++ b/include/linux/mm_types.h >>> @@ -1704,11 +1704,8 @@ enum { >>> #define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */ >>> #define MMF_VM_HUGEPAGE 17 /* set when mm is available for khugepaged */ >>> >>> -/* >>> - * This one-shot flag is dropped due to necessity of changing exe once again >>> - * on NFS restore >>> - */ >>> -//#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ >>> +/* override inherited page sizes to always for the entire process */ >>> + #define MMF_THP_ALWAYS 18 >> >> Could we have something like MMF_THP_POLICY_SET and another field >> for "always"? Otherwise, how are we going to set MMF_THP_MADVISE if >> we want it in the future? >> > > So we actually need MMF_THP_MADVISE as well. If we have the entire fleet system > policy set to always, but if certain workloads regress with it, then we need to > prctl MADVISE them. > > As you said, either we have: > - MMF_THP_MADVISE as another flag. The issue with this is we have run out > of bits in mm->flags for 32 bit machines. So either we introduce another member > in mm_struct (maybe mm_struct->flags2?), or we start using bits 32 and above of > flags field, limit this to 64 bit architectures and wrap the code with #ifdefs. > This is probably ugly, but if its ok for upstream, happy to do that. > - MMF_THP_POLICY_SET + another field to specify what the policy is, the issue > with this might be another field per mm_struct/process. Having an entire field > just for this might be wasteful, so I think I would prefer having just > mm_struct->flags2 and use up only 1 bit of it, and the rest can be used for > anything else in the future. > > I think the flags2 approach is likely the best, but let me know what you think. Sounds good to me. If you are adding flags2, maybe put both MMF_THP_ALWAYS and MMF_THP_MADVISE in flags2 to make code look more consistent? Just a thought. >>> >>> #define MMF_HAS_UPROBES 19 /* has uprobes */ >>> #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ >>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h >>> index 15c18ef4eb11..22c526681562 100644 >>> --- a/include/uapi/linux/prctl.h >>> +++ b/include/uapi/linux/prctl.h >>> @@ -364,4 +364,7 @@ struct prctl_mm_map { >>> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1 >>> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2 >>> >>> +#define PR_SET_THP_ALWAYS 78 >>> +#define PR_GET_THP_ALWAYS 79 >>> + >>> #endif /* _LINUX_PRCTL_H */ >>> diff --git a/kernel/sys.c b/kernel/sys.c >>> index c434968e9f5d..ee56b059ff1f 100644 >>> --- a/kernel/sys.c >>> +++ b/kernel/sys.c >>> @@ -2658,6 +2658,22 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, >>> clear_bit(MMF_DISABLE_THP, &me->mm->flags); >>> mmap_write_unlock(me->mm); >>> break; >>> + case PR_GET_THP_ALWAYS: >>> + if (arg2 || arg3 || arg4 || arg5) >>> + return -EINVAL; >>> + error = !!test_bit(MMF_THP_ALWAYS, &me->mm->flags); >>> + break; >>> + case PR_SET_THP_ALWAYS: >>> + if (arg3 || arg4 || arg5) >>> + return -EINVAL; >>> + if (mmap_write_lock_killable(me->mm)) >>> + return -EINTR; >>> + if (arg2) >>> + set_bit(MMF_THP_ALWAYS, &me->mm->flags); >>> + else >>> + clear_bit(MMF_THP_ALWAYS, &me->mm->flags); >>> + mmap_write_unlock(me->mm); >>> + break; >> >> prctl can take more than one arguments. Would it be better to add >> PR_SET_THP_POLICY and PR_GET_THP_POLICY and specify PR_THP_POLICY_ALWAYS >> in the second argument? So that in the future, if we want to add >> more THP policies, we do not need to keep piling up PR_{GET,SET}_THP_*? > > I am ok with either, with prctl values there isnt any limit like MMF. > I went with how the other prctls are used, i.e. PR_SET_THP_DISABLE uses > arg2 to check if it needs to be enabled (non 0)/disabled (0 value). > > I think having PR_GET/SET_THP_ALWAYS/MADVISE could be better as it would > work similar to how existing PR_* (PR_SET_THP_DISABLE) works with arg2 > used for enabling/disabling and we dont have any limits for the number > of PR_* that can exist. But also happy to change it to PR_GET/SET_THP_POLICY. OK, if the number of PR_* is not a concern, I do not have a strong opinion on this one. Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-07 14:00 [PATCH 0/1] prctl: allow overriding system THP policy to always Usama Arif 2025-05-07 14:00 ` [PATCH 1/1] prctl: allow overriding system THP policy to always per process Usama Arif @ 2025-05-07 14:57 ` Zi Yan 2025-05-07 15:12 ` Usama Arif 2025-05-08 11:06 ` David Hildenbrand 2 siblings, 1 reply; 29+ messages in thread From: Zi Yan @ 2025-05-07 14:57 UTC (permalink / raw) To: Usama Arif Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team, Yafang Shao +Yafang, who is also looking at changing THP config at cgroup/container level. On 7 May 2025, at 10:00, Usama Arif wrote: > Allowing override of global THP policy per process allows workloads > that have shown to benefit from hugepages to do so, without regressing > workloads that wouldn't benefit. This will allow such types of > workloads to be run/stacked on the same machine. > > It also helps in rolling out hugepages in hyperscaler configurations > for workloads that benefit from them, where a single THP policy is > likely to be used across the entire fleet, and prctl will help override it. > > An advantage of doing it via prctl vs creating a cgroup specific > option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is > that this will work even when there are no cgroups present, and my > understanding is there is a strong preference of cgroups controls being > hierarchical which usually means them having a numerical value. Hi Usama, Do you mind giving an example on how to change THP policy for a set of processes running in a container (under a cgroup)? Yafang mentioned that the prctl approach would require restarting all running services[1] and other inflexiblities, so he proposed to use BPF to change THP policy[2]. I wonder if Yafang's issues also apply to your case and if you have a solution to them. Thanks. [1] https://lore.kernel.org/linux-mm/CALOAHbCXMi2GaZdHJaNLXxGsJf-hkDTrztsQiceaBcJ8d8p3cA@mail.gmail.com/ [2] https://lore.kernel.org/linux-mm/20250429024139.34365-1-laoar.shao@gmail.com/ > > > The output and code of test program is below: > > [root@vm4 vmuser]# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled > [root@vm4 vmuser]# echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > [root@vm4 vmuser]# ./a.out > Default THP setting: > THP is not set to 'always'. > PR_SET_THP_ALWAYS = 1 > THP is set to 'always'. > PR_SET_THP_ALWAYS = 0 > THP is not set to 'always'. > > > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <unistd.h> > #include <sys/mman.h> > #include <sys/prctl.h> > > #define PR_SET_THP_ALWAYS 78 > #define SIZE 12 * (2 * 1024 * 1024) // 24 MB > > void check_smaps(void) { > FILE *file = fopen("/proc/self/smaps", "r"); > if (!file) { > perror("fopen"); > return; > } > > char line[256]; > int is_hugepage = 0; > while (fgets(line, sizeof(line), file)) { > // if (strstr(line, "AnonHugePages:")) > // printf("%s\n", line); > if (strstr(line, "AnonHugePages:") && strstr(line, "24576 kB")) > { > // printf("%s\n", line); > is_hugepage = 1; > break; > } > } > fclose(file); > if (is_hugepage) { > printf("THP is set to 'always'.\n"); > } else { > printf("THP is not set to 'always'.\n"); > } > } > > void test_mmap_thp(void) { > char *buffer = (char *)mmap(NULL, SIZE, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (buffer == MAP_FAILED) { > perror("mmap"); > return; > } > // Touch the memory to ensure it's allocated > memset(buffer, 0, SIZE); > check_smaps(); > munmap(buffer, SIZE); > } > > int main() { > printf("Default THP setting: \n"); > test_mmap_thp(); > printf("PR_SET_THP_ALWAYS = 1 \n"); > prctl(PR_SET_THP_ALWAYS, 1, NULL, NULL, NULL); > test_mmap_thp(); > printf("PR_SET_THP_ALWAYS = 0 \n"); > prctl(PR_SET_THP_ALWAYS, 0, NULL, NULL, NULL); > test_mmap_thp(); > > return 0; > } > > > Usama Arif (1): > prctl: allow overriding system THP policy to always per process > > include/linux/huge_mm.h | 3 ++- > include/linux/mm_types.h | 7 ++----- > include/uapi/linux/prctl.h | 3 +++ > kernel/sys.c | 16 ++++++++++++++++ > tools/include/uapi/linux/prctl.h | 3 +++ > .../perf/trace/beauty/include/uapi/linux/prctl.h | 3 +++ > 6 files changed, 29 insertions(+), 6 deletions(-) > > -- > 2.47.1 -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-07 14:57 ` [PATCH 0/1] prctl: allow overriding system THP policy to always Zi Yan @ 2025-05-07 15:12 ` Usama Arif 2025-05-07 15:57 ` Zi Yan 0 siblings, 1 reply; 29+ messages in thread From: Usama Arif @ 2025-05-07 15:12 UTC (permalink / raw) To: Zi Yan Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team, Yafang Shao On 07/05/2025 15:57, Zi Yan wrote: > +Yafang, who is also looking at changing THP config at cgroup/container level. > > On 7 May 2025, at 10:00, Usama Arif wrote: > >> Allowing override of global THP policy per process allows workloads >> that have shown to benefit from hugepages to do so, without regressing >> workloads that wouldn't benefit. This will allow such types of >> workloads to be run/stacked on the same machine. >> >> It also helps in rolling out hugepages in hyperscaler configurations >> for workloads that benefit from them, where a single THP policy is >> likely to be used across the entire fleet, and prctl will help override it. >> >> An advantage of doing it via prctl vs creating a cgroup specific >> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is >> that this will work even when there are no cgroups present, and my >> understanding is there is a strong preference of cgroups controls being >> hierarchical which usually means them having a numerical value. > > Hi Usama, > > Do you mind giving an example on how to change THP policy for a set of > processes running in a container (under a cgroup)? Hi Zi, In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always for processes in a cgroup is in the same way we enable KSM for the cgroup. The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS in exec-invoke. This is at the start of the process, but you would already know at the start of the process whether you want THP=always for it or not. [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 Thanks, Usama > > Yafang mentioned that the prctl approach would require restarting all running > services[1] and other inflexiblities, so he proposed to use BPF to change THP > policy[2]. I wonder if Yafang's issues also apply to your case and if you > have a solution to them. > > Thanks. > > [1] https://lore.kernel.org/linux-mm/CALOAHbCXMi2GaZdHJaNLXxGsJf-hkDTrztsQiceaBcJ8d8p3cA@mail.gmail.com/ > [2] https://lore.kernel.org/linux-mm/20250429024139.34365-1-laoar.shao@gmail.com/ >> >> >> The output and code of test program is below: >> >> [root@vm4 vmuser]# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled >> [root@vm4 vmuser]# echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled >> [root@vm4 vmuser]# ./a.out >> Default THP setting: >> THP is not set to 'always'. >> PR_SET_THP_ALWAYS = 1 >> THP is set to 'always'. >> PR_SET_THP_ALWAYS = 0 >> THP is not set to 'always'. >> >> >> #include <stdio.h> >> #include <stdlib.h> >> #include <string.h> >> #include <unistd.h> >> #include <sys/mman.h> >> #include <sys/prctl.h> >> >> #define PR_SET_THP_ALWAYS 78 >> #define SIZE 12 * (2 * 1024 * 1024) // 24 MB >> >> void check_smaps(void) { >> FILE *file = fopen("/proc/self/smaps", "r"); >> if (!file) { >> perror("fopen"); >> return; >> } >> >> char line[256]; >> int is_hugepage = 0; >> while (fgets(line, sizeof(line), file)) { >> // if (strstr(line, "AnonHugePages:")) >> // printf("%s\n", line); >> if (strstr(line, "AnonHugePages:") && strstr(line, "24576 kB")) >> { >> // printf("%s\n", line); >> is_hugepage = 1; >> break; >> } >> } >> fclose(file); >> if (is_hugepage) { >> printf("THP is set to 'always'.\n"); >> } else { >> printf("THP is not set to 'always'.\n"); >> } >> } >> >> void test_mmap_thp(void) { >> char *buffer = (char *)mmap(NULL, SIZE, PROT_READ | PROT_WRITE, >> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); >> if (buffer == MAP_FAILED) { >> perror("mmap"); >> return; >> } >> // Touch the memory to ensure it's allocated >> memset(buffer, 0, SIZE); >> check_smaps(); >> munmap(buffer, SIZE); >> } >> >> int main() { >> printf("Default THP setting: \n"); >> test_mmap_thp(); >> printf("PR_SET_THP_ALWAYS = 1 \n"); >> prctl(PR_SET_THP_ALWAYS, 1, NULL, NULL, NULL); >> test_mmap_thp(); >> printf("PR_SET_THP_ALWAYS = 0 \n"); >> prctl(PR_SET_THP_ALWAYS, 0, NULL, NULL, NULL); >> test_mmap_thp(); >> >> return 0; >> } >> >> >> Usama Arif (1): >> prctl: allow overriding system THP policy to always per process >> >> include/linux/huge_mm.h | 3 ++- >> include/linux/mm_types.h | 7 ++----- >> include/uapi/linux/prctl.h | 3 +++ >> kernel/sys.c | 16 ++++++++++++++++ >> tools/include/uapi/linux/prctl.h | 3 +++ >> .../perf/trace/beauty/include/uapi/linux/prctl.h | 3 +++ >> 6 files changed, 29 insertions(+), 6 deletions(-) >> >> -- >> 2.47.1 > > > -- > Best Regards, > Yan, Zi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-07 15:12 ` Usama Arif @ 2025-05-07 15:57 ` Zi Yan 2025-05-07 16:09 ` Usama Arif 0 siblings, 1 reply; 29+ messages in thread From: Zi Yan @ 2025-05-07 15:57 UTC (permalink / raw) To: Usama Arif Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team, Yafang Shao On 7 May 2025, at 11:12, Usama Arif wrote: > On 07/05/2025 15:57, Zi Yan wrote: >> +Yafang, who is also looking at changing THP config at cgroup/container level. >> >> On 7 May 2025, at 10:00, Usama Arif wrote: >> >>> Allowing override of global THP policy per process allows workloads >>> that have shown to benefit from hugepages to do so, without regressing >>> workloads that wouldn't benefit. This will allow such types of >>> workloads to be run/stacked on the same machine. >>> >>> It also helps in rolling out hugepages in hyperscaler configurations >>> for workloads that benefit from them, where a single THP policy is >>> likely to be used across the entire fleet, and prctl will help override it. >>> >>> An advantage of doing it via prctl vs creating a cgroup specific >>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is >>> that this will work even when there are no cgroups present, and my >>> understanding is there is a strong preference of cgroups controls being >>> hierarchical which usually means them having a numerical value. >> >> Hi Usama, >> >> Do you mind giving an example on how to change THP policy for a set of >> processes running in a container (under a cgroup)? > > Hi Zi, > > In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always > for processes in a cgroup is in the same way we enable KSM for the cgroup. > The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS > in exec-invoke. > This is at the start of the process, but you would already know at the start of the process > whether you want THP=always for it or not. > > [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 You also need to add a new systemd.directives, e.g., MemoryTHP, to pass the THP enablement or disablement info from a systemd config file. And if you find those processes do not benefit from using THPs, you can just change the new "MemoryTHP" config and restart the processes. Am I getting it? Thanks. >> >> Yafang mentioned that the prctl approach would require restarting all running >> services[1] and other inflexiblities, so he proposed to use BPF to change THP >> policy[2]. I wonder if Yafang's issues also apply to your case and if you >> have a solution to them. >> >> Thanks. >> >> [1] https://lore.kernel.org/linux-mm/CALOAHbCXMi2GaZdHJaNLXxGsJf-hkDTrztsQiceaBcJ8d8p3cA@mail.gmail.com/ >> [2] https://lore.kernel.org/linux-mm/20250429024139.34365-1-laoar.shao@gmail.com/ >>> >>> >>> The output and code of test program is below: >>> >>> [root@vm4 vmuser]# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled >>> [root@vm4 vmuser]# echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled >>> [root@vm4 vmuser]# ./a.out >>> Default THP setting: >>> THP is not set to 'always'. >>> PR_SET_THP_ALWAYS = 1 >>> THP is set to 'always'. >>> PR_SET_THP_ALWAYS = 0 >>> THP is not set to 'always'. >>> >>> >>> #include <stdio.h> >>> #include <stdlib.h> >>> #include <string.h> >>> #include <unistd.h> >>> #include <sys/mman.h> >>> #include <sys/prctl.h> >>> >>> #define PR_SET_THP_ALWAYS 78 >>> #define SIZE 12 * (2 * 1024 * 1024) // 24 MB >>> >>> void check_smaps(void) { >>> FILE *file = fopen("/proc/self/smaps", "r"); >>> if (!file) { >>> perror("fopen"); >>> return; >>> } >>> >>> char line[256]; >>> int is_hugepage = 0; >>> while (fgets(line, sizeof(line), file)) { >>> // if (strstr(line, "AnonHugePages:")) >>> // printf("%s\n", line); >>> if (strstr(line, "AnonHugePages:") && strstr(line, "24576 kB")) >>> { >>> // printf("%s\n", line); >>> is_hugepage = 1; >>> break; >>> } >>> } >>> fclose(file); >>> if (is_hugepage) { >>> printf("THP is set to 'always'.\n"); >>> } else { >>> printf("THP is not set to 'always'.\n"); >>> } >>> } >>> >>> void test_mmap_thp(void) { >>> char *buffer = (char *)mmap(NULL, SIZE, PROT_READ | PROT_WRITE, >>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); >>> if (buffer == MAP_FAILED) { >>> perror("mmap"); >>> return; >>> } >>> // Touch the memory to ensure it's allocated >>> memset(buffer, 0, SIZE); >>> check_smaps(); >>> munmap(buffer, SIZE); >>> } >>> >>> int main() { >>> printf("Default THP setting: \n"); >>> test_mmap_thp(); >>> printf("PR_SET_THP_ALWAYS = 1 \n"); >>> prctl(PR_SET_THP_ALWAYS, 1, NULL, NULL, NULL); >>> test_mmap_thp(); >>> printf("PR_SET_THP_ALWAYS = 0 \n"); >>> prctl(PR_SET_THP_ALWAYS, 0, NULL, NULL, NULL); >>> test_mmap_thp(); >>> >>> return 0; >>> } >>> >>> >>> Usama Arif (1): >>> prctl: allow overriding system THP policy to always per process >>> >>> include/linux/huge_mm.h | 3 ++- >>> include/linux/mm_types.h | 7 ++----- >>> include/uapi/linux/prctl.h | 3 +++ >>> kernel/sys.c | 16 ++++++++++++++++ >>> tools/include/uapi/linux/prctl.h | 3 +++ >>> .../perf/trace/beauty/include/uapi/linux/prctl.h | 3 +++ >>> 6 files changed, 29 insertions(+), 6 deletions(-) >>> >>> -- >>> 2.47.1 >> >> >> -- >> Best Regards, >> Yan, Zi -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-07 15:57 ` Zi Yan @ 2025-05-07 16:09 ` Usama Arif 2025-05-08 5:41 ` Yafang Shao 0 siblings, 1 reply; 29+ messages in thread From: Usama Arif @ 2025-05-07 16:09 UTC (permalink / raw) To: Zi Yan Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team, Yafang Shao On 07/05/2025 16:57, Zi Yan wrote: > On 7 May 2025, at 11:12, Usama Arif wrote: > >> On 07/05/2025 15:57, Zi Yan wrote: >>> +Yafang, who is also looking at changing THP config at cgroup/container level. >>> >>> On 7 May 2025, at 10:00, Usama Arif wrote: >>> >>>> Allowing override of global THP policy per process allows workloads >>>> that have shown to benefit from hugepages to do so, without regressing >>>> workloads that wouldn't benefit. This will allow such types of >>>> workloads to be run/stacked on the same machine. >>>> >>>> It also helps in rolling out hugepages in hyperscaler configurations >>>> for workloads that benefit from them, where a single THP policy is >>>> likely to be used across the entire fleet, and prctl will help override it. >>>> >>>> An advantage of doing it via prctl vs creating a cgroup specific >>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is >>>> that this will work even when there are no cgroups present, and my >>>> understanding is there is a strong preference of cgroups controls being >>>> hierarchical which usually means them having a numerical value. >>> >>> Hi Usama, >>> >>> Do you mind giving an example on how to change THP policy for a set of >>> processes running in a container (under a cgroup)? >> >> Hi Zi, >> >> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always >> for processes in a cgroup is in the same way we enable KSM for the cgroup. >> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS >> in exec-invoke. >> This is at the start of the process, but you would already know at the start of the process >> whether you want THP=always for it or not. >> >> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > You also need to add a new systemd.directives, e.g., MemoryTHP, to > pass the THP enablement or disablement info from a systemd config file. > And if you find those processes do not benefit from using THPs, > you can just change the new "MemoryTHP" config and restart the processes. > > Am I getting it? Thanks. > Yes, thats right. They would exactly the same as what we (Meta) do for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set, the ExecContext->memory_thp would be set similar to memory_ksm [2], and when that is set, the prctl will be called at exec_invoke of the process [3]. The systemd changes should be quite simple to do. [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978 [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151 [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 >>> >>> Yafang mentioned that the prctl approach would require restarting all running >>> services[1] and other inflexiblities, so he proposed to use BPF to change THP >>> policy[2]. I wonder if Yafang's issues also apply to your case and if you >>> have a solution to them. >>> >>> Thanks. >>> >>> [1] https://lore.kernel.org/linux-mm/CALOAHbCXMi2GaZdHJaNLXxGsJf-hkDTrztsQiceaBcJ8d8p3cA@mail.gmail.com/ >>> [2] https://lore.kernel.org/linux-mm/20250429024139.34365-1-laoar.shao@gmail.com/ >>>> >>>> >>>> The output and code of test program is below: >>>> >>>> [root@vm4 vmuser]# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled >>>> [root@vm4 vmuser]# echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled >>>> [root@vm4 vmuser]# ./a.out >>>> Default THP setting: >>>> THP is not set to 'always'. >>>> PR_SET_THP_ALWAYS = 1 >>>> THP is set to 'always'. >>>> PR_SET_THP_ALWAYS = 0 >>>> THP is not set to 'always'. >>>> >>>> >>>> #include <stdio.h> >>>> #include <stdlib.h> >>>> #include <string.h> >>>> #include <unistd.h> >>>> #include <sys/mman.h> >>>> #include <sys/prctl.h> >>>> >>>> #define PR_SET_THP_ALWAYS 78 >>>> #define SIZE 12 * (2 * 1024 * 1024) // 24 MB >>>> >>>> void check_smaps(void) { >>>> FILE *file = fopen("/proc/self/smaps", "r"); >>>> if (!file) { >>>> perror("fopen"); >>>> return; >>>> } >>>> >>>> char line[256]; >>>> int is_hugepage = 0; >>>> while (fgets(line, sizeof(line), file)) { >>>> // if (strstr(line, "AnonHugePages:")) >>>> // printf("%s\n", line); >>>> if (strstr(line, "AnonHugePages:") && strstr(line, "24576 kB")) >>>> { >>>> // printf("%s\n", line); >>>> is_hugepage = 1; >>>> break; >>>> } >>>> } >>>> fclose(file); >>>> if (is_hugepage) { >>>> printf("THP is set to 'always'.\n"); >>>> } else { >>>> printf("THP is not set to 'always'.\n"); >>>> } >>>> } >>>> >>>> void test_mmap_thp(void) { >>>> char *buffer = (char *)mmap(NULL, SIZE, PROT_READ | PROT_WRITE, >>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); >>>> if (buffer == MAP_FAILED) { >>>> perror("mmap"); >>>> return; >>>> } >>>> // Touch the memory to ensure it's allocated >>>> memset(buffer, 0, SIZE); >>>> check_smaps(); >>>> munmap(buffer, SIZE); >>>> } >>>> >>>> int main() { >>>> printf("Default THP setting: \n"); >>>> test_mmap_thp(); >>>> printf("PR_SET_THP_ALWAYS = 1 \n"); >>>> prctl(PR_SET_THP_ALWAYS, 1, NULL, NULL, NULL); >>>> test_mmap_thp(); >>>> printf("PR_SET_THP_ALWAYS = 0 \n"); >>>> prctl(PR_SET_THP_ALWAYS, 0, NULL, NULL, NULL); >>>> test_mmap_thp(); >>>> >>>> return 0; >>>> } >>>> >>>> >>>> Usama Arif (1): >>>> prctl: allow overriding system THP policy to always per process >>>> >>>> include/linux/huge_mm.h | 3 ++- >>>> include/linux/mm_types.h | 7 ++----- >>>> include/uapi/linux/prctl.h | 3 +++ >>>> kernel/sys.c | 16 ++++++++++++++++ >>>> tools/include/uapi/linux/prctl.h | 3 +++ >>>> .../perf/trace/beauty/include/uapi/linux/prctl.h | 3 +++ >>>> 6 files changed, 29 insertions(+), 6 deletions(-) >>>> >>>> -- >>>> 2.47.1 >>> >>> >>> -- >>> Best Regards, >>> Yan, Zi > > > -- > Best Regards, > Yan, Zi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-07 16:09 ` Usama Arif @ 2025-05-08 5:41 ` Yafang Shao 2025-05-08 16:04 ` Usama Arif 0 siblings, 1 reply; 29+ messages in thread From: Yafang Shao @ 2025-05-08 5:41 UTC (permalink / raw) To: Usama Arif Cc: Zi Yan, Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 07/05/2025 16:57, Zi Yan wrote: > > On 7 May 2025, at 11:12, Usama Arif wrote: > > > >> On 07/05/2025 15:57, Zi Yan wrote: > >>> +Yafang, who is also looking at changing THP config at cgroup/container level. Thanks > >>> > >>> On 7 May 2025, at 10:00, Usama Arif wrote: > >>> > >>>> Allowing override of global THP policy per process allows workloads > >>>> that have shown to benefit from hugepages to do so, without regressing > >>>> workloads that wouldn't benefit. This will allow such types of > >>>> workloads to be run/stacked on the same machine. > >>>> > >>>> It also helps in rolling out hugepages in hyperscaler configurations > >>>> for workloads that benefit from them, where a single THP policy is > >>>> likely to be used across the entire fleet, and prctl will help override it. > >>>> > >>>> An advantage of doing it via prctl vs creating a cgroup specific > >>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is > >>>> that this will work even when there are no cgroups present, and my > >>>> understanding is there is a strong preference of cgroups controls being > >>>> hierarchical which usually means them having a numerical value. > >>> > >>> Hi Usama, > >>> > >>> Do you mind giving an example on how to change THP policy for a set of > >>> processes running in a container (under a cgroup)? > >> > >> Hi Zi, > >> > >> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always > >> for processes in a cgroup is in the same way we enable KSM for the cgroup. > >> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS > >> in exec-invoke. > >> This is at the start of the process, but you would already know at the start of the process > >> whether you want THP=always for it or not. > >> > >> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > > > You also need to add a new systemd.directives, e.g., MemoryTHP, to > > pass the THP enablement or disablement info from a systemd config file. > > And if you find those processes do not benefit from using THPs, > > you can just change the new "MemoryTHP" config and restart the processes. > > > > Am I getting it? Thanks. > > > > Yes, thats right. They would exactly the same as what we (Meta) do > for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set, > the ExecContext->memory_thp would be set similar to memory_ksm [2], and when > that is set, the prctl will be called at exec_invoke of the process [3]. > > The systemd changes should be quite simple to do. > > [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978 > [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151 > [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 This solution carries a risk: since prctl() does not require any capabilities, the task itself could call it and override your memory policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that capability is typically enabled by default in containers, leaving them still vulnerable. This approach might work for Kubernetes/container environments, but it would require substantial code changes to implement securely. -- Regards Yafang ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-08 5:41 ` Yafang Shao @ 2025-05-08 16:04 ` Usama Arif 2025-05-09 2:15 ` Yafang Shao 0 siblings, 1 reply; 29+ messages in thread From: Usama Arif @ 2025-05-08 16:04 UTC (permalink / raw) To: Yafang Shao Cc: Zi Yan, Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On 08/05/2025 06:41, Yafang Shao wrote: > On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote: >> >> >> >> On 07/05/2025 16:57, Zi Yan wrote: >>> On 7 May 2025, at 11:12, Usama Arif wrote: >>> >>>> On 07/05/2025 15:57, Zi Yan wrote: >>>>> +Yafang, who is also looking at changing THP config at cgroup/container level. > > Thanks > >>>>> >>>>> On 7 May 2025, at 10:00, Usama Arif wrote: >>>>> >>>>>> Allowing override of global THP policy per process allows workloads >>>>>> that have shown to benefit from hugepages to do so, without regressing >>>>>> workloads that wouldn't benefit. This will allow such types of >>>>>> workloads to be run/stacked on the same machine. >>>>>> >>>>>> It also helps in rolling out hugepages in hyperscaler configurations >>>>>> for workloads that benefit from them, where a single THP policy is >>>>>> likely to be used across the entire fleet, and prctl will help override it. >>>>>> >>>>>> An advantage of doing it via prctl vs creating a cgroup specific >>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is >>>>>> that this will work even when there are no cgroups present, and my >>>>>> understanding is there is a strong preference of cgroups controls being >>>>>> hierarchical which usually means them having a numerical value. >>>>> >>>>> Hi Usama, >>>>> >>>>> Do you mind giving an example on how to change THP policy for a set of >>>>> processes running in a container (under a cgroup)? >>>> >>>> Hi Zi, >>>> >>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always >>>> for processes in a cgroup is in the same way we enable KSM for the cgroup. >>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS >>>> in exec-invoke. >>>> This is at the start of the process, but you would already know at the start of the process >>>> whether you want THP=always for it or not. >>>> >>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 >>> >>> You also need to add a new systemd.directives, e.g., MemoryTHP, to >>> pass the THP enablement or disablement info from a systemd config file. >>> And if you find those processes do not benefit from using THPs, >>> you can just change the new "MemoryTHP" config and restart the processes. >>> >>> Am I getting it? Thanks. >>> >> >> Yes, thats right. They would exactly the same as what we (Meta) do >> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set, >> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when >> that is set, the prctl will be called at exec_invoke of the process [3]. >> >> The systemd changes should be quite simple to do. >> >> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978 >> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151 >> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > This solution carries a risk: since prctl() does not require any > capabilities, the task itself could call it and override your memory > policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that > capability is typically enabled by default in containers, leaving them > still vulnerable. > > This approach might work for Kubernetes/container environments, but it > would require substantial code changes to implement securely. > You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE already exists and the someone could use this to slow the process down. So the approach this patch takes shouldn't be anymore of a security fix then what is already exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE should be used to restrict this. In terms of security vulnerability of prctl, I feel like there are a lot of others that can be a much much bigger issue? I just had a look and you can change the seccomp, reset PAC keys(!) even speculation control(!!), so I dont think the security argument would be valid. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-08 16:04 ` Usama Arif @ 2025-05-09 2:15 ` Yafang Shao 2025-05-09 5:13 ` Johannes Weiner 0 siblings, 1 reply; 29+ messages in thread From: Yafang Shao @ 2025-05-09 2:15 UTC (permalink / raw) To: Usama Arif Cc: Zi Yan, Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On Fri, May 9, 2025 at 12:04 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 08/05/2025 06:41, Yafang Shao wrote: > > On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >> > >> > >> On 07/05/2025 16:57, Zi Yan wrote: > >>> On 7 May 2025, at 11:12, Usama Arif wrote: > >>> > >>>> On 07/05/2025 15:57, Zi Yan wrote: > >>>>> +Yafang, who is also looking at changing THP config at cgroup/container level. > > > > Thanks > > > >>>>> > >>>>> On 7 May 2025, at 10:00, Usama Arif wrote: > >>>>> > >>>>>> Allowing override of global THP policy per process allows workloads > >>>>>> that have shown to benefit from hugepages to do so, without regressing > >>>>>> workloads that wouldn't benefit. This will allow such types of > >>>>>> workloads to be run/stacked on the same machine. > >>>>>> > >>>>>> It also helps in rolling out hugepages in hyperscaler configurations > >>>>>> for workloads that benefit from them, where a single THP policy is > >>>>>> likely to be used across the entire fleet, and prctl will help override it. > >>>>>> > >>>>>> An advantage of doing it via prctl vs creating a cgroup specific > >>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is > >>>>>> that this will work even when there are no cgroups present, and my > >>>>>> understanding is there is a strong preference of cgroups controls being > >>>>>> hierarchical which usually means them having a numerical value. > >>>>> > >>>>> Hi Usama, > >>>>> > >>>>> Do you mind giving an example on how to change THP policy for a set of > >>>>> processes running in a container (under a cgroup)? > >>>> > >>>> Hi Zi, > >>>> > >>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always > >>>> for processes in a cgroup is in the same way we enable KSM for the cgroup. > >>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS > >>>> in exec-invoke. > >>>> This is at the start of the process, but you would already know at the start of the process > >>>> whether you want THP=always for it or not. > >>>> > >>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > >>> > >>> You also need to add a new systemd.directives, e.g., MemoryTHP, to > >>> pass the THP enablement or disablement info from a systemd config file. > >>> And if you find those processes do not benefit from using THPs, > >>> you can just change the new "MemoryTHP" config and restart the processes. > >>> > >>> Am I getting it? Thanks. > >>> > >> > >> Yes, thats right. They would exactly the same as what we (Meta) do > >> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set, > >> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when > >> that is set, the prctl will be called at exec_invoke of the process [3]. > >> > >> The systemd changes should be quite simple to do. > >> > >> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978 > >> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151 > >> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > > > This solution carries a risk: since prctl() does not require any > > capabilities, the task itself could call it and override your memory > > policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that > > capability is typically enabled by default in containers, leaving them > > still vulnerable. > > > > This approach might work for Kubernetes/container environments, but it > > would require substantial code changes to implement securely. > > > > You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE > already exists and the someone could use this to slow the process down. So the > approach this patch takes shouldn't be anymore of a security fix then what is already > exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE > should be used to restrict this. I believe we should at least require CAP_SYS_RESOURCE to enable THP, since it overrides global system settings. Alternatively, CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely certain. > > In terms of security vulnerability of prctl, I feel like there are a lot of others > that can be a much much bigger issue? I just had a look and you can change the > seccomp, reset PAC keys(!) even speculation control(!!), so I dont think the security > argument would be valid. I was surprised to discover that none of these operations require any capabilities to execute. -- Regards Yafang ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-09 2:15 ` Yafang Shao @ 2025-05-09 5:13 ` Johannes Weiner 2025-05-09 9:24 ` Yafang Shao 0 siblings, 1 reply; 29+ messages in thread From: Johannes Weiner @ 2025-05-09 5:13 UTC (permalink / raw) To: Yafang Shao Cc: Usama Arif, Zi Yan, Andrew Morton, david, linux-mm, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote: > On Fri, May 9, 2025 at 12:04 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > > > > > On 08/05/2025 06:41, Yafang Shao wrote: > > > On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote: > > >> > > >> > > >> > > >> On 07/05/2025 16:57, Zi Yan wrote: > > >>> On 7 May 2025, at 11:12, Usama Arif wrote: > > >>> > > >>>> On 07/05/2025 15:57, Zi Yan wrote: > > >>>>> +Yafang, who is also looking at changing THP config at cgroup/container level. > > > > > > Thanks > > > > > >>>>> > > >>>>> On 7 May 2025, at 10:00, Usama Arif wrote: > > >>>>> > > >>>>>> Allowing override of global THP policy per process allows workloads > > >>>>>> that have shown to benefit from hugepages to do so, without regressing > > >>>>>> workloads that wouldn't benefit. This will allow such types of > > >>>>>> workloads to be run/stacked on the same machine. > > >>>>>> > > >>>>>> It also helps in rolling out hugepages in hyperscaler configurations > > >>>>>> for workloads that benefit from them, where a single THP policy is > > >>>>>> likely to be used across the entire fleet, and prctl will help override it. > > >>>>>> > > >>>>>> An advantage of doing it via prctl vs creating a cgroup specific > > >>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is > > >>>>>> that this will work even when there are no cgroups present, and my > > >>>>>> understanding is there is a strong preference of cgroups controls being > > >>>>>> hierarchical which usually means them having a numerical value. > > >>>>> > > >>>>> Hi Usama, > > >>>>> > > >>>>> Do you mind giving an example on how to change THP policy for a set of > > >>>>> processes running in a container (under a cgroup)? > > >>>> > > >>>> Hi Zi, > > >>>> > > >>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always > > >>>> for processes in a cgroup is in the same way we enable KSM for the cgroup. > > >>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS > > >>>> in exec-invoke. > > >>>> This is at the start of the process, but you would already know at the start of the process > > >>>> whether you want THP=always for it or not. > > >>>> > > >>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > >>> > > >>> You also need to add a new systemd.directives, e.g., MemoryTHP, to > > >>> pass the THP enablement or disablement info from a systemd config file. > > >>> And if you find those processes do not benefit from using THPs, > > >>> you can just change the new "MemoryTHP" config and restart the processes. > > >>> > > >>> Am I getting it? Thanks. > > >>> > > >> > > >> Yes, thats right. They would exactly the same as what we (Meta) do > > >> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set, > > >> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when > > >> that is set, the prctl will be called at exec_invoke of the process [3]. > > >> > > >> The systemd changes should be quite simple to do. > > >> > > >> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978 > > >> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151 > > >> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > > > > > This solution carries a risk: since prctl() does not require any > > > capabilities, the task itself could call it and override your memory > > > policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that > > > capability is typically enabled by default in containers, leaving them > > > still vulnerable. > > > > > > This approach might work for Kubernetes/container environments, but it > > > would require substantial code changes to implement securely. > > > > > > > You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE > > already exists and the someone could use this to slow the process down. So the > > approach this patch takes shouldn't be anymore of a security fix then what is already > > exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE > > should be used to restrict this. > > I believe we should at least require CAP_SYS_RESOURCE to enable THP, > since it overrides global system settings. Alternatively, > CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely > certain. Hm, could you verbalize a concrete security concern? I've never really looked at the global settings as a hard policy, more as picking a default for the workloads in the system. It's usually `madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have long existed to give applications the ability to refine the global choice. The prctl should probably respect `never' for consistency, but beyond that I don't really see the concern, or how this would allow something that isn't already possible. > > In terms of security vulnerability of prctl, I feel like there are a lot of others > > that can be a much much bigger issue? I just had a look and you can change the > > seccomp, reset PAC keys(!) even speculation control(!!), so I dont think the security > > argument would be valid. > > I was surprised to discover that none of these operations require any > capabilities to execute. seccomp enabling is a one-way street, PR_SPEC_FORCE_DISABLE is as well. You can reset PAC keys, but presumably, unless you also switch to a new execution context with entirely new PAC/AUT pairs, this would just crash the application on the next AUT? ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-09 5:13 ` Johannes Weiner @ 2025-05-09 9:24 ` Yafang Shao 2025-05-09 9:30 ` David Hildenbrand 0 siblings, 1 reply; 29+ messages in thread From: Yafang Shao @ 2025-05-09 9:24 UTC (permalink / raw) To: Johannes Weiner Cc: Usama Arif, Zi Yan, Andrew Morton, david, linux-mm, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On Fri, May 9, 2025 at 1:13 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote: > > On Fri, May 9, 2025 at 12:04 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > > > > > > > > > On 08/05/2025 06:41, Yafang Shao wrote: > > > > On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > >> > > > >> > > > >> > > > >> On 07/05/2025 16:57, Zi Yan wrote: > > > >>> On 7 May 2025, at 11:12, Usama Arif wrote: > > > >>> > > > >>>> On 07/05/2025 15:57, Zi Yan wrote: > > > >>>>> +Yafang, who is also looking at changing THP config at cgroup/container level. > > > > > > > > Thanks > > > > > > > >>>>> > > > >>>>> On 7 May 2025, at 10:00, Usama Arif wrote: > > > >>>>> > > > >>>>>> Allowing override of global THP policy per process allows workloads > > > >>>>>> that have shown to benefit from hugepages to do so, without regressing > > > >>>>>> workloads that wouldn't benefit. This will allow such types of > > > >>>>>> workloads to be run/stacked on the same machine. > > > >>>>>> > > > >>>>>> It also helps in rolling out hugepages in hyperscaler configurations > > > >>>>>> for workloads that benefit from them, where a single THP policy is > > > >>>>>> likely to be used across the entire fleet, and prctl will help override it. > > > >>>>>> > > > >>>>>> An advantage of doing it via prctl vs creating a cgroup specific > > > >>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is > > > >>>>>> that this will work even when there are no cgroups present, and my > > > >>>>>> understanding is there is a strong preference of cgroups controls being > > > >>>>>> hierarchical which usually means them having a numerical value. > > > >>>>> > > > >>>>> Hi Usama, > > > >>>>> > > > >>>>> Do you mind giving an example on how to change THP policy for a set of > > > >>>>> processes running in a container (under a cgroup)? > > > >>>> > > > >>>> Hi Zi, > > > >>>> > > > >>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always > > > >>>> for processes in a cgroup is in the same way we enable KSM for the cgroup. > > > >>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS > > > >>>> in exec-invoke. > > > >>>> This is at the start of the process, but you would already know at the start of the process > > > >>>> whether you want THP=always for it or not. > > > >>>> > > > >>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > > >>> > > > >>> You also need to add a new systemd.directives, e.g., MemoryTHP, to > > > >>> pass the THP enablement or disablement info from a systemd config file. > > > >>> And if you find those processes do not benefit from using THPs, > > > >>> you can just change the new "MemoryTHP" config and restart the processes. > > > >>> > > > >>> Am I getting it? Thanks. > > > >>> > > > >> > > > >> Yes, thats right. They would exactly the same as what we (Meta) do > > > >> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set, > > > >> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when > > > >> that is set, the prctl will be called at exec_invoke of the process [3]. > > > >> > > > >> The systemd changes should be quite simple to do. > > > >> > > > >> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978 > > > >> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151 > > > >> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > > > > > > > This solution carries a risk: since prctl() does not require any > > > > capabilities, the task itself could call it and override your memory > > > > policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that > > > > capability is typically enabled by default in containers, leaving them > > > > still vulnerable. > > > > > > > > This approach might work for Kubernetes/container environments, but it > > > > would require substantial code changes to implement securely. > > > > > > > > > > You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE > > > already exists and the someone could use this to slow the process down. So the > > > approach this patch takes shouldn't be anymore of a security fix then what is already > > > exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE > > > should be used to restrict this. > > > > I believe we should at least require CAP_SYS_RESOURCE to enable THP, > > since it overrides global system settings. Alternatively, > > CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely > > certain. > > Hm, could you verbalize a concrete security concern? > > I've never really looked at the global settings as a hard policy, more > as picking a default for the workloads in the system. It's usually > `madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have long > existed to give applications the ability to refine the global choice. > > The prctl should probably respect `never' for consistency, but beyond > that I don't really see the concern, or how this would allow something > that isn't already possible. I would interpret the always, madvise, and never options as follows: - always The sysadmin strongly recommends using THP. If a user does not want to use it, they must explicitly disable it. - madvise The sysadmin gently encourages the use of THP, but it is only enabled when explicitly requested by the application. - never The sysadmin discourages the use of THP, and "its use is only permitted with explicit approval" . > > > > In terms of security vulnerability of prctl, I feel like there are a lot of others > > > that can be a much much bigger issue? I just had a look and you can change the > > > seccomp, reset PAC keys(!) even speculation control(!!), so I dont think the security > > > argument would be valid. > > > > I was surprised to discover that none of these operations require any > > capabilities to execute. > > seccomp enabling is a one-way street, PR_SPEC_FORCE_DISABLE is as > well. You can reset PAC keys, but presumably, unless you also switch > to a new execution context with entirely new PAC/AUT pairs, this would > just crash the application on the next AUT? It appears so—thank you for the clarification. -- Regards Yafang ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-09 9:24 ` Yafang Shao @ 2025-05-09 9:30 ` David Hildenbrand 2025-05-09 9:43 ` Yafang Shao 0 siblings, 1 reply; 29+ messages in thread From: David Hildenbrand @ 2025-05-09 9:30 UTC (permalink / raw) To: Yafang Shao, Johannes Weiner Cc: Usama Arif, Zi Yan, Andrew Morton, linux-mm, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On 09.05.25 11:24, Yafang Shao wrote: > On Fri, May 9, 2025 at 1:13 PM Johannes Weiner <hannes@cmpxchg.org> wrote: >> >> On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote: >>> On Fri, May 9, 2025 at 12:04 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>> >>>> >>>> >>>> On 08/05/2025 06:41, Yafang Shao wrote: >>>>> On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 07/05/2025 16:57, Zi Yan wrote: >>>>>>> On 7 May 2025, at 11:12, Usama Arif wrote: >>>>>>> >>>>>>>> On 07/05/2025 15:57, Zi Yan wrote: >>>>>>>>> +Yafang, who is also looking at changing THP config at cgroup/container level. >>>>> >>>>> Thanks >>>>> >>>>>>>>> >>>>>>>>> On 7 May 2025, at 10:00, Usama Arif wrote: >>>>>>>>> >>>>>>>>>> Allowing override of global THP policy per process allows workloads >>>>>>>>>> that have shown to benefit from hugepages to do so, without regressing >>>>>>>>>> workloads that wouldn't benefit. This will allow such types of >>>>>>>>>> workloads to be run/stacked on the same machine. >>>>>>>>>> >>>>>>>>>> It also helps in rolling out hugepages in hyperscaler configurations >>>>>>>>>> for workloads that benefit from them, where a single THP policy is >>>>>>>>>> likely to be used across the entire fleet, and prctl will help override it. >>>>>>>>>> >>>>>>>>>> An advantage of doing it via prctl vs creating a cgroup specific >>>>>>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is >>>>>>>>>> that this will work even when there are no cgroups present, and my >>>>>>>>>> understanding is there is a strong preference of cgroups controls being >>>>>>>>>> hierarchical which usually means them having a numerical value. >>>>>>>>> >>>>>>>>> Hi Usama, >>>>>>>>> >>>>>>>>> Do you mind giving an example on how to change THP policy for a set of >>>>>>>>> processes running in a container (under a cgroup)? >>>>>>>> >>>>>>>> Hi Zi, >>>>>>>> >>>>>>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always >>>>>>>> for processes in a cgroup is in the same way we enable KSM for the cgroup. >>>>>>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS >>>>>>>> in exec-invoke. >>>>>>>> This is at the start of the process, but you would already know at the start of the process >>>>>>>> whether you want THP=always for it or not. >>>>>>>> >>>>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 >>>>>>> >>>>>>> You also need to add a new systemd.directives, e.g., MemoryTHP, to >>>>>>> pass the THP enablement or disablement info from a systemd config file. >>>>>>> And if you find those processes do not benefit from using THPs, >>>>>>> you can just change the new "MemoryTHP" config and restart the processes. >>>>>>> >>>>>>> Am I getting it? Thanks. >>>>>>> >>>>>> >>>>>> Yes, thats right. They would exactly the same as what we (Meta) do >>>>>> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set, >>>>>> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when >>>>>> that is set, the prctl will be called at exec_invoke of the process [3]. >>>>>> >>>>>> The systemd changes should be quite simple to do. >>>>>> >>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978 >>>>>> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151 >>>>>> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 >>>>> >>>>> This solution carries a risk: since prctl() does not require any >>>>> capabilities, the task itself could call it and override your memory >>>>> policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that >>>>> capability is typically enabled by default in containers, leaving them >>>>> still vulnerable. >>>>> >>>>> This approach might work for Kubernetes/container environments, but it >>>>> would require substantial code changes to implement securely. >>>>> >>>> >>>> You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE >>>> already exists and the someone could use this to slow the process down. So the >>>> approach this patch takes shouldn't be anymore of a security fix then what is already >>>> exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE >>>> should be used to restrict this. >>> >>> I believe we should at least require CAP_SYS_RESOURCE to enable THP, >>> since it overrides global system settings. Alternatively, >>> CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely >>> certain. >> >> Hm, could you verbalize a concrete security concern? >> >> I've never really looked at the global settings as a hard policy, more >> as picking a default for the workloads in the system. It's usually >> `madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have long >> existed to give applications the ability to refine the global choice. >> >> The prctl should probably respect `never' for consistency, but beyond >> that I don't really see the concern, or how this would allow something >> that isn't already possible. > > I would interpret the always, madvise, and never options as follows: > - always > The sysadmin strongly recommends using THP. If a user does not > want to use it, they must explicitly disable it. > - madvise > The sysadmin gently encourages the use of THP, but it is only > enabled when explicitly requested by the application. > - never > The sysadmin discourages the use of THP, and "its use is only permitted > with explicit approval" . "never" so far means "no thps, no exceptions". We've had serious THP issues in the past, where our workaround until we sorted out the issue for affected customers was to force-disable THPs on that system during boot. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-09 9:30 ` David Hildenbrand @ 2025-05-09 9:43 ` Yafang Shao 2025-05-09 16:46 ` Johannes Weiner 0 siblings, 1 reply; 29+ messages in thread From: Yafang Shao @ 2025-05-09 9:43 UTC (permalink / raw) To: David Hildenbrand Cc: Johannes Weiner, Usama Arif, Zi Yan, Andrew Morton, linux-mm, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On Fri, May 9, 2025 at 5:31 PM David Hildenbrand <david@redhat.com> wrote: > > On 09.05.25 11:24, Yafang Shao wrote: > > On Fri, May 9, 2025 at 1:13 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > >> > >> On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote: > >>> On Fri, May 9, 2025 at 12:04 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>> > >>>> > >>>> > >>>> On 08/05/2025 06:41, Yafang Shao wrote: > >>>>> On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 07/05/2025 16:57, Zi Yan wrote: > >>>>>>> On 7 May 2025, at 11:12, Usama Arif wrote: > >>>>>>> > >>>>>>>> On 07/05/2025 15:57, Zi Yan wrote: > >>>>>>>>> +Yafang, who is also looking at changing THP config at cgroup/container level. > >>>>> > >>>>> Thanks > >>>>> > >>>>>>>>> > >>>>>>>>> On 7 May 2025, at 10:00, Usama Arif wrote: > >>>>>>>>> > >>>>>>>>>> Allowing override of global THP policy per process allows workloads > >>>>>>>>>> that have shown to benefit from hugepages to do so, without regressing > >>>>>>>>>> workloads that wouldn't benefit. This will allow such types of > >>>>>>>>>> workloads to be run/stacked on the same machine. > >>>>>>>>>> > >>>>>>>>>> It also helps in rolling out hugepages in hyperscaler configurations > >>>>>>>>>> for workloads that benefit from them, where a single THP policy is > >>>>>>>>>> likely to be used across the entire fleet, and prctl will help override it. > >>>>>>>>>> > >>>>>>>>>> An advantage of doing it via prctl vs creating a cgroup specific > >>>>>>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is > >>>>>>>>>> that this will work even when there are no cgroups present, and my > >>>>>>>>>> understanding is there is a strong preference of cgroups controls being > >>>>>>>>>> hierarchical which usually means them having a numerical value. > >>>>>>>>> > >>>>>>>>> Hi Usama, > >>>>>>>>> > >>>>>>>>> Do you mind giving an example on how to change THP policy for a set of > >>>>>>>>> processes running in a container (under a cgroup)? > >>>>>>>> > >>>>>>>> Hi Zi, > >>>>>>>> > >>>>>>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always > >>>>>>>> for processes in a cgroup is in the same way we enable KSM for the cgroup. > >>>>>>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS > >>>>>>>> in exec-invoke. > >>>>>>>> This is at the start of the process, but you would already know at the start of the process > >>>>>>>> whether you want THP=always for it or not. > >>>>>>>> > >>>>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > >>>>>>> > >>>>>>> You also need to add a new systemd.directives, e.g., MemoryTHP, to > >>>>>>> pass the THP enablement or disablement info from a systemd config file. > >>>>>>> And if you find those processes do not benefit from using THPs, > >>>>>>> you can just change the new "MemoryTHP" config and restart the processes. > >>>>>>> > >>>>>>> Am I getting it? Thanks. > >>>>>>> > >>>>>> > >>>>>> Yes, thats right. They would exactly the same as what we (Meta) do > >>>>>> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set, > >>>>>> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when > >>>>>> that is set, the prctl will be called at exec_invoke of the process [3]. > >>>>>> > >>>>>> The systemd changes should be quite simple to do. > >>>>>> > >>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978 > >>>>>> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151 > >>>>>> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > >>>>> > >>>>> This solution carries a risk: since prctl() does not require any > >>>>> capabilities, the task itself could call it and override your memory > >>>>> policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that > >>>>> capability is typically enabled by default in containers, leaving them > >>>>> still vulnerable. > >>>>> > >>>>> This approach might work for Kubernetes/container environments, but it > >>>>> would require substantial code changes to implement securely. > >>>>> > >>>> > >>>> You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE > >>>> already exists and the someone could use this to slow the process down. So the > >>>> approach this patch takes shouldn't be anymore of a security fix then what is already > >>>> exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE > >>>> should be used to restrict this. > >>> > >>> I believe we should at least require CAP_SYS_RESOURCE to enable THP, > >>> since it overrides global system settings. Alternatively, > >>> CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely > >>> certain. > >> > >> Hm, could you verbalize a concrete security concern? > >> > >> I've never really looked at the global settings as a hard policy, more > >> as picking a default for the workloads in the system. It's usually > >> `madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have long > >> existed to give applications the ability to refine the global choice. > >> > >> The prctl should probably respect `never' for consistency, but beyond > >> that I don't really see the concern, or how this would allow something > >> that isn't already possible. > > > > I would interpret the always, madvise, and never options as follows: > > - always > > The sysadmin strongly recommends using THP. If a user does not > > want to use it, they must explicitly disable it. > > - madvise > > The sysadmin gently encourages the use of THP, but it is only > > enabled when explicitly requested by the application. > > - never > > The sysadmin discourages the use of THP, and "its use is only permitted > > with explicit approval" . > > "never" so far means "no thps, no exceptions". We've had serious THP > issues in the past, where our workaround until we sorted out the issue > for affected customers was to force-disable THPs on that system during boot. Right, that reflects the current behavior. What we aim to enhance is by adding the requirement that "its use is only permitted with explicit approval." This would be a small but meaningful step toward making THP truly usable in a transparent way in production environments. -- Regards Yafang ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-09 9:43 ` Yafang Shao @ 2025-05-09 16:46 ` Johannes Weiner 2025-05-09 22:42 ` David Hildenbrand 2025-05-11 2:08 ` Yafang Shao 0 siblings, 2 replies; 29+ messages in thread From: Johannes Weiner @ 2025-05-09 16:46 UTC (permalink / raw) To: Yafang Shao Cc: David Hildenbrand, Usama Arif, Zi Yan, Andrew Morton, linux-mm, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On Fri, May 09, 2025 at 05:43:10PM +0800, Yafang Shao wrote: > On Fri, May 9, 2025 at 5:31 PM David Hildenbrand <david@redhat.com> wrote: > > > > On 09.05.25 11:24, Yafang Shao wrote: > > > On Fri, May 9, 2025 at 1:13 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > >> > > >> On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote: > > >>> On Fri, May 9, 2025 at 12:04 AM Usama Arif <usamaarif642@gmail.com> wrote: > > >>>> > > >>>> > > >>>> > > >>>> On 08/05/2025 06:41, Yafang Shao wrote: > > >>>>> On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote: > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> On 07/05/2025 16:57, Zi Yan wrote: > > >>>>>>> On 7 May 2025, at 11:12, Usama Arif wrote: > > >>>>>>> > > >>>>>>>> On 07/05/2025 15:57, Zi Yan wrote: > > >>>>>>>>> +Yafang, who is also looking at changing THP config at cgroup/container level. > > >>>>> > > >>>>> Thanks > > >>>>> > > >>>>>>>>> > > >>>>>>>>> On 7 May 2025, at 10:00, Usama Arif wrote: > > >>>>>>>>> > > >>>>>>>>>> Allowing override of global THP policy per process allows workloads > > >>>>>>>>>> that have shown to benefit from hugepages to do so, without regressing > > >>>>>>>>>> workloads that wouldn't benefit. This will allow such types of > > >>>>>>>>>> workloads to be run/stacked on the same machine. > > >>>>>>>>>> > > >>>>>>>>>> It also helps in rolling out hugepages in hyperscaler configurations > > >>>>>>>>>> for workloads that benefit from them, where a single THP policy is > > >>>>>>>>>> likely to be used across the entire fleet, and prctl will help override it. > > >>>>>>>>>> > > >>>>>>>>>> An advantage of doing it via prctl vs creating a cgroup specific > > >>>>>>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is > > >>>>>>>>>> that this will work even when there are no cgroups present, and my > > >>>>>>>>>> understanding is there is a strong preference of cgroups controls being > > >>>>>>>>>> hierarchical which usually means them having a numerical value. > > >>>>>>>>> > > >>>>>>>>> Hi Usama, > > >>>>>>>>> > > >>>>>>>>> Do you mind giving an example on how to change THP policy for a set of > > >>>>>>>>> processes running in a container (under a cgroup)? > > >>>>>>>> > > >>>>>>>> Hi Zi, > > >>>>>>>> > > >>>>>>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always > > >>>>>>>> for processes in a cgroup is in the same way we enable KSM for the cgroup. > > >>>>>>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS > > >>>>>>>> in exec-invoke. > > >>>>>>>> This is at the start of the process, but you would already know at the start of the process > > >>>>>>>> whether you want THP=always for it or not. > > >>>>>>>> > > >>>>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > >>>>>>> > > >>>>>>> You also need to add a new systemd.directives, e.g., MemoryTHP, to > > >>>>>>> pass the THP enablement or disablement info from a systemd config file. > > >>>>>>> And if you find those processes do not benefit from using THPs, > > >>>>>>> you can just change the new "MemoryTHP" config and restart the processes. > > >>>>>>> > > >>>>>>> Am I getting it? Thanks. > > >>>>>>> > > >>>>>> > > >>>>>> Yes, thats right. They would exactly the same as what we (Meta) do > > >>>>>> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set, > > >>>>>> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when > > >>>>>> that is set, the prctl will be called at exec_invoke of the process [3]. > > >>>>>> > > >>>>>> The systemd changes should be quite simple to do. > > >>>>>> > > >>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978 > > >>>>>> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151 > > >>>>>> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > >>>>> > > >>>>> This solution carries a risk: since prctl() does not require any > > >>>>> capabilities, the task itself could call it and override your memory > > >>>>> policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that > > >>>>> capability is typically enabled by default in containers, leaving them > > >>>>> still vulnerable. > > >>>>> > > >>>>> This approach might work for Kubernetes/container environments, but it > > >>>>> would require substantial code changes to implement securely. > > >>>>> > > >>>> > > >>>> You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE > > >>>> already exists and the someone could use this to slow the process down. So the > > >>>> approach this patch takes shouldn't be anymore of a security fix then what is already > > >>>> exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE > > >>>> should be used to restrict this. > > >>> > > >>> I believe we should at least require CAP_SYS_RESOURCE to enable THP, > > >>> since it overrides global system settings. Alternatively, > > >>> CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely > > >>> certain. > > >> > > >> Hm, could you verbalize a concrete security concern? > > >> > > >> I've never really looked at the global settings as a hard policy, more > > >> as picking a default for the workloads in the system. It's usually > > >> `madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have long > > >> existed to give applications the ability to refine the global choice. > > >> > > >> The prctl should probably respect `never' for consistency, but beyond > > >> that I don't really see the concern, or how this would allow something > > >> that isn't already possible. > > > > > > I would interpret the always, madvise, and never options as follows: > > > - always > > > The sysadmin strongly recommends using THP. If a user does not > > > want to use it, they must explicitly disable it. I would call this "kernel mode" or "auto mode", where userspace should *generally* not have to worry about huge pages, but with an option for declaring the odd exceptional case. Both madvise() and unprivileged prctl() currently work, and IMO should continue to work, for declaring exceptions. > > > - madvise > > > The sysadmin gently encourages the use of THP, but it is only > > > enabled when explicitly requested by the application. And this "user mode" or "manual mode", where applications self-manage which parts of userspace they want to enroll. Both madvise() and unprivileged prctl() should work here as well, IMO. There is no policy or security difference between them, it's just about granularity and usability. > > > - never > > > The sysadmin discourages the use of THP, and "its use is only permitted > > > with explicit approval" . This one I don't quite agree with, and IMO conflicts with what David is saying as well. > > "never" so far means "no thps, no exceptions". We've had serious THP > > issues in the past, where our workaround until we sorted out the issue > > for affected customers was to force-disable THPs on that system during boot. > > Right, that reflects the current behavior. What we aim to enhance is > by adding the requirement that "its use is only permitted with > explicit approval." I think you're conflating a safety issue with a security issue. David is saying there can be cases where the kernel is broken, and "never" is a production escape hatch to disable the feature until a kernel upgrade for the fix is possible. In such a case, it doesn't make sense to override this decision based on any sort of workload policy, privileged or not. The way I understand you is that you want enrollment (and/or self-management) only for blessed applications. Because you don't generally trust workloads in the wild enough to switch the global default away from "never", given the semantics of always/madvise. To me this sounds like you'd need a different mode, call it "blessed"; with a privileged interface to control which applications are allowed to madvise/prctl-enable. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-09 16:46 ` Johannes Weiner @ 2025-05-09 22:42 ` David Hildenbrand 2025-05-09 23:34 ` Zi Yan 2025-05-11 2:08 ` Yafang Shao 1 sibling, 1 reply; 29+ messages in thread From: David Hildenbrand @ 2025-05-09 22:42 UTC (permalink / raw) To: Johannes Weiner, Yafang Shao Cc: Usama Arif, Zi Yan, Andrew Morton, linux-mm, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team >>>> - madvise >>>> The sysadmin gently encourages the use of THP, but it is only >>>> enabled when explicitly requested by the application. > > And this "user mode" or "manual mode", where applications self-manage > which parts of userspace they want to enroll. > > Both madvise() and unprivileged prctl() should work here as well, > IMO. There is no policy or security difference between them, it's just > about granularity and usability. > >>>> - never >>>> The sysadmin discourages the use of THP, and "its use is only permitted >>>> with explicit approval" . > > This one I don't quite agree with, and IMO conflicts with what David > is saying as well. Yeah ... "never" does not mean "sometimes" in my reality :) > >>> "never" so far means "no thps, no exceptions". We've had serious THP >>> issues in the past, where our workaround until we sorted out the issue >>> for affected customers was to force-disable THPs on that system during boot. >> >> Right, that reflects the current behavior. What we aim to enhance is >> by adding the requirement that "its use is only permitted with >> explicit approval." > > I think you're conflating a safety issue with a security issue. > > David is saying there can be cases where the kernel is broken, and > "never" is a production escape hatch to disable the feature until a > kernel upgrade for the fix is possible. In such a case, it doesn't > make sense to override this decision based on any sort of workload > policy, privileged or not. > > The way I understand you is that you want enrollment (and/or > self-management) only for blessed applications. Because you don't > generally trust workloads in the wild enough to switch the global > default away from "never", given the semantics of always/madvise. Assuming "never" means "never" and "always" means "always" ( crazy, right? :) ), could be make use of "madvise" mode, which essentially means "VM_HUGEPAGE" takes control? We'd need a) A way to enable THP for a process. Changing the default/vma settings to VM_HUGEPAGE as discussed using a prctl could work. b) A way to ignore VM_HUGEPAGE for processes. Maybe the existing prctl to force-disable THPs could work? -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-09 22:42 ` David Hildenbrand @ 2025-05-09 23:34 ` Zi Yan 2025-05-11 8:15 ` David Hildenbrand 0 siblings, 1 reply; 29+ messages in thread From: Zi Yan @ 2025-05-09 23:34 UTC (permalink / raw) To: David Hildenbrand Cc: Johannes Weiner, Yafang Shao, Usama Arif, Andrew Morton, linux-mm, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On 9 May 2025, at 18:42, David Hildenbrand wrote: >>>>> - madvise >>>>> The sysadmin gently encourages the use of THP, but it is only >>>>> enabled when explicitly requested by the application. >> >> And this "user mode" or "manual mode", where applications self-manage >> which parts of userspace they want to enroll. >> >> Both madvise() and unprivileged prctl() should work here as well, >> IMO. There is no policy or security difference between them, it's just >> about granularity and usability. >> >>>>> - never >>>>> The sysadmin discourages the use of THP, and "its use is only permitted >>>>> with explicit approval" . >> >> This one I don't quite agree with, and IMO conflicts with what David >> is saying as well. > > Yeah ... "never" does not mean "sometimes" in my reality :) > >> >>>> "never" so far means "no thps, no exceptions". We've had serious THP >>>> issues in the past, where our workaround until we sorted out the issue >>>> for affected customers was to force-disable THPs on that system during boot. >>> >>> Right, that reflects the current behavior. What we aim to enhance is >>> by adding the requirement that "its use is only permitted with >>> explicit approval." >> >> I think you're conflating a safety issue with a security issue. >> >> David is saying there can be cases where the kernel is broken, and >> "never" is a production escape hatch to disable the feature until a >> kernel upgrade for the fix is possible. In such a case, it doesn't >> make sense to override this decision based on any sort of workload >> policy, privileged or not. >> >> The way I understand you is that you want enrollment (and/or >> self-management) only for blessed applications. Because you don't >> generally trust workloads in the wild enough to switch the global >> default away from "never", given the semantics of always/madvise. > > Assuming "never" means "never" and "always" means "always" ( crazy, right? :) ), could be make use of "madvise" mode, which essentially means "VM_HUGEPAGE" takes control? > > We'd need > > a) A way to enable THP for a process. Changing the default/vma settings to VM_HUGEPAGE as discussed using a prctl could work. > > b) A way to ignore VM_HUGEPAGE for processes. Maybe the existing prctl to force-disable THPs could work? This means process level control overrides VMA level control, which overrides global control, right? Intuitively, it should be that VMA level control overrides process level control, which overrides global control, namely finer granularly control precedes coarse one. But some apps might not use VMA level control (e.g., madvise) carefully, we want to override that. Maybe ignoring VMA level control is what we want? -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-09 23:34 ` Zi Yan @ 2025-05-11 8:15 ` David Hildenbrand 2025-05-11 14:08 ` Usama Arif 0 siblings, 1 reply; 29+ messages in thread From: David Hildenbrand @ 2025-05-11 8:15 UTC (permalink / raw) To: Zi Yan Cc: Johannes Weiner, Yafang Shao, Usama Arif, Andrew Morton, linux-mm, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On 10.05.25 01:34, Zi Yan wrote: > On 9 May 2025, at 18:42, David Hildenbrand wrote: > >>>>>> - madvise >>>>>> The sysadmin gently encourages the use of THP, but it is only >>>>>> enabled when explicitly requested by the application. >>> >>> And this "user mode" or "manual mode", where applications self-manage >>> which parts of userspace they want to enroll. >>> >>> Both madvise() and unprivileged prctl() should work here as well, >>> IMO. There is no policy or security difference between them, it's just >>> about granularity and usability. >>> >>>>>> - never >>>>>> The sysadmin discourages the use of THP, and "its use is only permitted >>>>>> with explicit approval" . >>> >>> This one I don't quite agree with, and IMO conflicts with what David >>> is saying as well. >> >> Yeah ... "never" does not mean "sometimes" in my reality :) >> >>> >>>>> "never" so far means "no thps, no exceptions". We've had serious THP >>>>> issues in the past, where our workaround until we sorted out the issue >>>>> for affected customers was to force-disable THPs on that system during boot. >>>> >>>> Right, that reflects the current behavior. What we aim to enhance is >>>> by adding the requirement that "its use is only permitted with >>>> explicit approval." >>> >>> I think you're conflating a safety issue with a security issue. >>> >>> David is saying there can be cases where the kernel is broken, and >>> "never" is a production escape hatch to disable the feature until a >>> kernel upgrade for the fix is possible. In such a case, it doesn't >>> make sense to override this decision based on any sort of workload >>> policy, privileged or not. >>> >>> The way I understand you is that you want enrollment (and/or >>> self-management) only for blessed applications. Because you don't >>> generally trust workloads in the wild enough to switch the global >>> default away from "never", given the semantics of always/madvise. >> >> Assuming "never" means "never" and "always" means "always" ( crazy, right? :) ), could be make use of "madvise" mode, which essentially means "VM_HUGEPAGE" takes control? >> >> We'd need >> >> a) A way to enable THP for a process. Changing the default/vma settings to VM_HUGEPAGE as discussed using a prctl could work. >> >> b) A way to ignore VM_HUGEPAGE for processes. Maybe the existing prctl to force-disable THPs could work? > > This means process level control overrides VMA level control, which > overrides global control, right? > > Intuitively, it should be that VMA level control overrides process level > control, which overrides global control, namely finer granularly control > precedes coarse one. But some apps might not use VMA level control > (e.g., madvise) carefully, we want to override that. Maybe ignoring VMA > level control is what we want? Let's take a step back: Current behavior is 1) If anybody (global / process / VM) says "never" (never/PR_SET_THP_DISABLE/VM_NOHUGEPAGE), the behavior is "never". 2) In "madvise" mode, only "VM_HUGEPAGE" gets THP unless PR_SET_THP_DISABLE is set. So per-process overrides per-VMA. 3) In "always" mode, everything gets THP unless per-VMA (VM_NOHUGEPAGE) or per-process (PR_SET_THP_DISABLE) disables it. Interestingly, PR_SET_THP_DISABLE used to mimic exactly what I proposed for the other direction (change default of VM_HUGEPAGE), except that it wouldn't modify already existing mappings. Worth looking at 1860033237d4. Not sure if that commit was the right call, but it's the semantics we have today. That commit notes: "It should be noted, that the new implementation makes PR_SET_THP_DISABLE master override to any per-VMA setting, which was not the case previously." Whatever we do, we have to be careful to not create more mess or inconsistency. Especially, if anybody sets VM_NOHUGEPAGE or PR_SET_THP_DISABLE, we must not use THPs, ever. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-11 8:15 ` David Hildenbrand @ 2025-05-11 14:08 ` Usama Arif 2025-05-13 11:43 ` Yafang Shao 2025-05-13 12:04 ` David Hildenbrand 0 siblings, 2 replies; 29+ messages in thread From: Usama Arif @ 2025-05-11 14:08 UTC (permalink / raw) To: David Hildenbrand, Zi Yan Cc: Johannes Weiner, Yafang Shao, Andrew Morton, linux-mm, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On 11/05/2025 09:15, David Hildenbrand wrote: > On 10.05.25 01:34, Zi Yan wrote: >> On 9 May 2025, at 18:42, David Hildenbrand wrote: >> >>>>>>> - madvise >>>>>>> The sysadmin gently encourages the use of THP, but it is only >>>>>>> enabled when explicitly requested by the application. >>>> >>>> And this "user mode" or "manual mode", where applications self-manage >>>> which parts of userspace they want to enroll. >>>> >>>> Both madvise() and unprivileged prctl() should work here as well, >>>> IMO. There is no policy or security difference between them, it's just >>>> about granularity and usability. >>>> >>>>>>> - never >>>>>>> The sysadmin discourages the use of THP, and "its use is only permitted >>>>>>> with explicit approval" . >>>> >>>> This one I don't quite agree with, and IMO conflicts with what David >>>> is saying as well. >>> >>> Yeah ... "never" does not mean "sometimes" in my reality :) >>> >>>> >>>>>> "never" so far means "no thps, no exceptions". We've had serious THP >>>>>> issues in the past, where our workaround until we sorted out the issue >>>>>> for affected customers was to force-disable THPs on that system during boot. >>>>> >>>>> Right, that reflects the current behavior. What we aim to enhance is >>>>> by adding the requirement that "its use is only permitted with >>>>> explicit approval." >>>> >>>> I think you're conflating a safety issue with a security issue. >>>> >>>> David is saying there can be cases where the kernel is broken, and >>>> "never" is a production escape hatch to disable the feature until a >>>> kernel upgrade for the fix is possible. In such a case, it doesn't >>>> make sense to override this decision based on any sort of workload >>>> policy, privileged or not. >>>> >>>> The way I understand you is that you want enrollment (and/or >>>> self-management) only for blessed applications. Because you don't >>>> generally trust workloads in the wild enough to switch the global >>>> default away from "never", given the semantics of always/madvise. >>> >>> Assuming "never" means "never" and "always" means "always" ( crazy, right? :) ), could be make use of "madvise" mode, which essentially means "VM_HUGEPAGE" takes control? >>> >>> We'd need >>> >>> a) A way to enable THP for a process. Changing the default/vma settings to VM_HUGEPAGE as discussed using a prctl could work. >>> >>> b) A way to ignore VM_HUGEPAGE for processes. Maybe the existing prctl to force-disable THPs could work? >> >> This means process level control overrides VMA level control, which >> overrides global control, right? >> >> Intuitively, it should be that VMA level control overrides process level >> control, which overrides global control, namely finer granularly control >> precedes coarse one. But some apps might not use VMA level control >> (e.g., madvise) carefully, we want to override that. Maybe ignoring VMA >> level control is what we want? > > Let's take a step back: > > Current behavior is > > 1) If anybody (global / process / VM) says "never" (never/PR_SET_THP_DISABLE/VM_NOHUGEPAGE), the behavior is "never". Just to add here to the current behavior for completeness, if we have the global system setting set to never, but the global hugepage level setting set to madvise, we do still get a THP, i.e. if I have: [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled always inherit [madvise] never And then MADV_HUGEPAGE some region, it still gives me a THP. > > 2) In "madvise" mode, only "VM_HUGEPAGE" gets THP unless PR_SET_THP_DISABLE is set. So per-process overrides per-VMA. > > 3) In "always" mode, everything gets THP unless per-VMA (VM_NOHUGEPAGE) or per-process (PR_SET_THP_DISABLE) disables it. > > > Interestingly, PR_SET_THP_DISABLE used to mimic exactly what I proposed for the other direction (change default of VM_HUGEPAGE), except that it wouldn't modify already existing mappings. Worth looking at 1860033237d4. Not sure if that commit was the right call, but it's the semantics we have today. > > That commit notes: > > "It should be noted, that the new implementation makes PR_SET_THP_DISABLE master override to any per-VMA setting, which was not the case previously." > > > Whatever we do, we have to be careful to not create more mess or inconsistency. > > Especially, if anybody sets VM_NOHUGEPAGE or PR_SET_THP_DISABLE, we must not use THPs, ever. > I thought I will also summarize what the real world usecases are that we want to solve: 1) global system policy=madvise, process wants "always" policy for itself: We can have different types of workloads stacked on the same host, some of them benefit from always having THPs, others will incur a regression (either its a performance regression or they are completely memory bound and even a very slight increase in memory will cause them to OOM). So we want to selectively have "always" set for just those workloads (processes). (This is how workloads are deployed in our (Metas) fleet at this moment.) 2) global system policy=always, process wants "madvise" policy for itself: Same reasoning as 1, just that the host has a different default policy and we don't want the workloads (processes) that regress with always getting THPs to do so. (We hope this is us (meta) in the future, if a majority of workloads show that they benefit from always, we flip the default host setting to "always" and workloads that regress can opt-out and be "madvise". New services developed will then be tested with always by default. Always is also the default defconfig option upstream, so I would imagine this is faced by others as well.) 3) global system policy=never, process wants "madvise" policy for itself: This is what Yafang mentioned in [1]. sysadmins dont want to switch the global policy to madvise, but are willing to accept certain processes to madvise. But David mentioned in [2] that never means no thps, no exceptions and the only way to solve some issues in the past has been to disable THPs completely. Please feel free to add to the above list. I thought it would be good to list them out so that the solution can be derived with them in mind. In terms of doing this with prctl, I was able to make prototypes for the 2 approaches that have been discussed: a) have prctl change how existing and new VMAs have VM_HUGEPAGE set for the process including after fork+exec, as proposed by David. This prototype is available at [3]. This will solve problem 1 discussed above, but I don't think this approach can be used to solve problems 2 and 3? There isnt a way where we can have a process change VMA setting so that after prctl, all future allocations are on madvise basis and not global policy (even if always). IOW we will need some change in thp_vma_allowable_orders to have it done on process level basis. b) have prctl override global policy *only* for hugepages that "inherit" global and only if global is "madvise" or "always". This prototype is available at [4]. The way I did this will achieve usecase 1 and 2, but not 3 (It can very easily be modified to get 3, but didn't do it as there maybe still is a discussion on what should be allowed when global=never?). I do prefer this method as I think it might be simpler overall and achieves both usecases. [1] https://lore.kernel.org/all/CALOAHbBAVELx-fwyoQUH_ypFvT_Zd5ZLjSkAPXxShgCua8ifpA@mail.gmail.com/ [2] https://lore.kernel.org/all/41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com/ [3] https://github.com/uarif1/linux/commit/209373cdeda93a55a699e2eee29f88f4e64ac8a5 [4] https://github.com/uarif1/linux/commit/e85c8edbcb4165c00026f0058b71e85f77da23f4 Thanks, Usama ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-11 14:08 ` Usama Arif @ 2025-05-13 11:43 ` Yafang Shao 2025-05-13 12:04 ` David Hildenbrand 1 sibling, 0 replies; 29+ messages in thread From: Yafang Shao @ 2025-05-13 11:43 UTC (permalink / raw) To: Usama Arif Cc: David Hildenbrand, Zi Yan, Johannes Weiner, Andrew Morton, linux-mm, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On Sun, May 11, 2025 at 10:08 PM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 11/05/2025 09:15, David Hildenbrand wrote: > > On 10.05.25 01:34, Zi Yan wrote: > >> On 9 May 2025, at 18:42, David Hildenbrand wrote: > >> > >>>>>>> - madvise > >>>>>>> The sysadmin gently encourages the use of THP, but it is only > >>>>>>> enabled when explicitly requested by the application. > >>>> > >>>> And this "user mode" or "manual mode", where applications self-manage > >>>> which parts of userspace they want to enroll. > >>>> > >>>> Both madvise() and unprivileged prctl() should work here as well, > >>>> IMO. There is no policy or security difference between them, it's just > >>>> about granularity and usability. > >>>> > >>>>>>> - never > >>>>>>> The sysadmin discourages the use of THP, and "its use is only permitted > >>>>>>> with explicit approval" . > >>>> > >>>> This one I don't quite agree with, and IMO conflicts with what David > >>>> is saying as well. > >>> > >>> Yeah ... "never" does not mean "sometimes" in my reality :) > >>> > >>>> > >>>>>> "never" so far means "no thps, no exceptions". We've had serious THP > >>>>>> issues in the past, where our workaround until we sorted out the issue > >>>>>> for affected customers was to force-disable THPs on that system during boot. > >>>>> > >>>>> Right, that reflects the current behavior. What we aim to enhance is > >>>>> by adding the requirement that "its use is only permitted with > >>>>> explicit approval." > >>>> > >>>> I think you're conflating a safety issue with a security issue. > >>>> > >>>> David is saying there can be cases where the kernel is broken, and > >>>> "never" is a production escape hatch to disable the feature until a > >>>> kernel upgrade for the fix is possible. In such a case, it doesn't > >>>> make sense to override this decision based on any sort of workload > >>>> policy, privileged or not. > >>>> > >>>> The way I understand you is that you want enrollment (and/or > >>>> self-management) only for blessed applications. Because you don't > >>>> generally trust workloads in the wild enough to switch the global > >>>> default away from "never", given the semantics of always/madvise. > >>> > >>> Assuming "never" means "never" and "always" means "always" ( crazy, right? :) ), could be make use of "madvise" mode, which essentially means "VM_HUGEPAGE" takes control? > >>> > >>> We'd need > >>> > >>> a) A way to enable THP for a process. Changing the default/vma settings to VM_HUGEPAGE as discussed using a prctl could work. > >>> > >>> b) A way to ignore VM_HUGEPAGE for processes. Maybe the existing prctl to force-disable THPs could work? > >> > >> This means process level control overrides VMA level control, which > >> overrides global control, right? > >> > >> Intuitively, it should be that VMA level control overrides process level > >> control, which overrides global control, namely finer granularly control > >> precedes coarse one. But some apps might not use VMA level control > >> (e.g., madvise) carefully, we want to override that. Maybe ignoring VMA > >> level control is what we want? > > > > Let's take a step back: > > > > Current behavior is > > > > 1) If anybody (global / process / VM) says "never" (never/PR_SET_THP_DISABLE/VM_NOHUGEPAGE), the behavior is "never". > > Just to add here to the current behavior for completeness, if we have the global system setting set to never, > but the global hugepage level setting set to madvise, we do still get a THP, i.e. if I have: > > [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/enabled > always madvise [never] > [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > always inherit [madvise] never > > And then MADV_HUGEPAGE some region, it still gives me a THP. > > > > > 2) In "madvise" mode, only "VM_HUGEPAGE" gets THP unless PR_SET_THP_DISABLE is set. So per-process overrides per-VMA. > > > > 3) In "always" mode, everything gets THP unless per-VMA (VM_NOHUGEPAGE) or per-process (PR_SET_THP_DISABLE) disables it. > > > > > > Interestingly, PR_SET_THP_DISABLE used to mimic exactly what I proposed for the other direction (change default of VM_HUGEPAGE), except that it wouldn't modify already existing mappings. Worth looking at 1860033237d4. Not sure if that commit was the right call, but it's the semantics we have today. > > > > That commit notes: > > > > "It should be noted, that the new implementation makes PR_SET_THP_DISABLE master override to any per-VMA setting, which was not the case previously." > > > > > > Whatever we do, we have to be careful to not create more mess or inconsistency. > > > > Especially, if anybody sets VM_NOHUGEPAGE or PR_SET_THP_DISABLE, we must not use THPs, ever. > > > > > I thought I will also summarize what the real world usecases are that we want to solve: > > 1) global system policy=madvise, process wants "always" policy for itself: We can have different types of workloads stacked on the same host, some of them benefit from always having THPs, > others will incur a regression (either its a performance regression or they are completely memory bound and even a very slight increase in memory will cause them to OOM). > So we want to selectively have "always" set for just those workloads (processes). (This is how workloads are deployed in our (Metas) fleet at this moment.) > > 2) global system policy=always, process wants "madvise" policy for itself: Same reasoning as 1, just that the host has a different default policy and we don't want the workloads (processes) that regress with always getting THPs to do so. > (We hope this is us (meta) in the future, if a majority of workloads show that they benefit from always, we flip the default host setting to "always" and workloads that regress can opt-out and be "madvise". > New services developed will then be tested with always by default. Always is also the default defconfig option upstream, so I would imagine this is faced by others as well.) > > 3) global system policy=never, process wants "madvise" policy for itself: This is what Yafang mentioned in [1]. sysadmins dont want to switch the global policy to madvise, but are willing to accept certain processes to madvise. > But David mentioned in [2] that never means no thps, no exceptions and the only way to solve some issues in the past has been to disable THPs completely. The interpretation of never as "disable THPs completely" makes sense to me, as sysadmins need a guaranteed way to disable THP. If applications could still allocate THPs in never mode, it would frustrate sysadmins. For case 3), I agree with Johannes that introducing a new mode (e.g., "blessed" or "bpf") would be ideal. This mode would allow per-task THP policy adjustments via BPF programs, defaulting to never when no BPF program is attached. [0]. https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.org/ -- Regards Yafang ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-11 14:08 ` Usama Arif 2025-05-13 11:43 ` Yafang Shao @ 2025-05-13 12:04 ` David Hildenbrand 1 sibling, 0 replies; 29+ messages in thread From: David Hildenbrand @ 2025-05-13 12:04 UTC (permalink / raw) To: Usama Arif, Zi Yan Cc: Johannes Weiner, Yafang Shao, Andrew Morton, linux-mm, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team >>> This means process level control overrides VMA level control, which >>> overrides global control, right? >>> >>> Intuitively, it should be that VMA level control overrides process level >>> control, which overrides global control, namely finer granularly control >>> precedes coarse one. But some apps might not use VMA level control >>> (e.g., madvise) carefully, we want to override that. Maybe ignoring VMA >>> level control is what we want? >> >> Let's take a step back: >> >> Current behavior is >> >> 1) If anybody (global / process / VM) says "never" (never/PR_SET_THP_DISABLE/VM_NOHUGEPAGE), the behavior is "never". > > Just to add here to the current behavior for completeness, if we have the global system setting set to never, > but the global hugepage level setting set to madvise, we do still get a THP, i.e. if I have: > > [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/enabled > always madvise [never] > [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > always inherit [madvise] never > > And then MADV_HUGEPAGE some region, it still gives me a THP. Yes. These "global" / "system" toggles are to be considered "one setting". If you set a per-size toggle to something that is not "inherit", then you're telling the system that you want more fine-grain control. In retrospective, we could maybe have made "/sys/kernel/mm/transparent_hugepage/enabled" a symlink to /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled, but we didn't think of that option back then and decided for the "inherit" solution. > >> >> 2) In "madvise" mode, only "VM_HUGEPAGE" gets THP unless PR_SET_THP_DISABLE is set. So per-process overrides per-VMA. >> >> 3) In "always" mode, everything gets THP unless per-VMA (VM_NOHUGEPAGE) or per-process (PR_SET_THP_DISABLE) disables it. >> >> >> Interestingly, PR_SET_THP_DISABLE used to mimic exactly what I proposed for the other direction (change default of VM_HUGEPAGE), except that it wouldn't modify already existing mappings. Worth looking at 1860033237d4. Not sure if that commit was the right call, but it's the semantics we have today. >> >> That commit notes: >> >> "It should be noted, that the new implementation makes PR_SET_THP_DISABLE master override to any per-VMA setting, which was not the case previously." >> >> >> Whatever we do, we have to be careful to not create more mess or inconsistency. >> >> Especially, if anybody sets VM_NOHUGEPAGE or PR_SET_THP_DISABLE, we must not use THPs, ever. >> > > > I thought I will also summarize what the real world usecases are that we want to solve: > > 1) global system policy=madvise, process wants "always" policy for itself: We can have different types of workloads stacked on the same host, some of them benefit from always having THPs, > others will incur a regression (either its a performance regression or they are completely memory bound and even a very slight increase in memory will cause them to OOM). > So we want to selectively have "always" set for just those workloads (processes). (This is how workloads are deployed in our (Metas) fleet at this moment.) Agreed. Similar to the process setting VM_HUGEPAGE on all VMAs where we do want VMAs. > > 2) global system policy=always, process wants "madvise" policy for itself: Same reasoning as 1, just that the host has a different default policy and we don't want the workloads (processes) that regress with always getting THPs to do so. > (We hope this is us (meta) in the future, if a majority of workloads show that they benefit from always, we flip the default host setting to "always" and workloads that regress can opt-out and be "madvise". > New services developed will then be tested with always by default. Always is also the default defconfig option upstream, so I would imagine this is faced by others as well.) > Understood. Similar to the process setting VM_NOHUGEPAGE on all VMAs where we don't want THPs. > 3) global system policy=never, process wants "madvise" policy for itself: This is what Yafang mentioned in [1]. sysadmins dont want to switch the global policy to madvise, but are willing to accept certain processes to madvise. > But David mentioned in [2] that never means no thps, no exceptions and the only way to solve some issues in the past has been to disable THPs completely. Yes. Similar to setting on all processes where we don't want any THPs PR_SET_THP_DISABLE. > > Please feel free to add to the above list. I thought it would be good to list them out so that the solution can be derived with them in mind. > > In terms of doing this with prctl, I was able to make prototypes for the 2 approaches that have been discussed: > > a) have prctl change how existing and new VMAs have VM_HUGEPAGE set for the process including after fork+exec, as proposed by David. This prototype is available at [3]. This will solve problem 1 discussed above, but I don't think this > approach can be used to solve problems 2 and 3? There isnt a way where we can have a process change VMA setting so that after prctl, all future allocations are on madvise basis and not global policy (even if always). IOW we will need > some change in thp_vma_allowable_orders to have it done on process level basis. Let's assume we change PR_SET_THP_DISABLE to per-VMA handling as well, we could have for the use cases 1) system policy=madvise. Set (new) PR_SET_THP_ENABLE on the process. Afterwards, MADV_NOHUGEPAGE could be used to fine-tune the VMAs. (e.g., explicitly temporarily disable on some areas, for example, required with some uffd scenarios) 2) system policy=always. Set PR_SET_THP_DISABLE, then only enable it for the VMAs using MADV_HUGEPAGE. 3) system policy=madvise/always. Set PR_SET_THP_DISABLE on all processes where we don't want THPs. In case of 3) nothing would stop the process from re-enabling THPs either using the prctl or madvise(). If that's a problem, PR_SET_THP_DISABLE_LOCKED or sth. like that could be used. But again, just a thought on how to work with what we already have, trying not to create an absolute mess. > > b) have prctl override global policy *only* for hugepages that "inherit" global and only if global is "madvise" or "always". This prototype is available at [4]. The way I did this will achieve usecase 1 and 2, but not 3 (It can very easily be modified to get 3, but didn't do it as there maybe still is a discussion on what should be allowed when global=never?). I do prefer this method as I think it might be simpler overall and achieves both usecases. I'm afraid that will create a mess. :/ -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-09 16:46 ` Johannes Weiner 2025-05-09 22:42 ` David Hildenbrand @ 2025-05-11 2:08 ` Yafang Shao 1 sibling, 0 replies; 29+ messages in thread From: Yafang Shao @ 2025-05-11 2:08 UTC (permalink / raw) To: Johannes Weiner Cc: David Hildenbrand, Usama Arif, Zi Yan, Andrew Morton, linux-mm, shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On Sat, May 10, 2025 at 12:47 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, May 09, 2025 at 05:43:10PM +0800, Yafang Shao wrote: > > On Fri, May 9, 2025 at 5:31 PM David Hildenbrand <david@redhat.com> wrote: > > > > > > On 09.05.25 11:24, Yafang Shao wrote: > > > > On Fri, May 9, 2025 at 1:13 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > >> > > > >> On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote: > > > >>> On Fri, May 9, 2025 at 12:04 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > >>>> > > > >>>> > > > >>>> > > > >>>> On 08/05/2025 06:41, Yafang Shao wrote: > > > >>>>> On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> On 07/05/2025 16:57, Zi Yan wrote: > > > >>>>>>> On 7 May 2025, at 11:12, Usama Arif wrote: > > > >>>>>>> > > > >>>>>>>> On 07/05/2025 15:57, Zi Yan wrote: > > > >>>>>>>>> +Yafang, who is also looking at changing THP config at cgroup/container level. > > > >>>>> > > > >>>>> Thanks > > > >>>>> > > > >>>>>>>>> > > > >>>>>>>>> On 7 May 2025, at 10:00, Usama Arif wrote: > > > >>>>>>>>> > > > >>>>>>>>>> Allowing override of global THP policy per process allows workloads > > > >>>>>>>>>> that have shown to benefit from hugepages to do so, without regressing > > > >>>>>>>>>> workloads that wouldn't benefit. This will allow such types of > > > >>>>>>>>>> workloads to be run/stacked on the same machine. > > > >>>>>>>>>> > > > >>>>>>>>>> It also helps in rolling out hugepages in hyperscaler configurations > > > >>>>>>>>>> for workloads that benefit from them, where a single THP policy is > > > >>>>>>>>>> likely to be used across the entire fleet, and prctl will help override it. > > > >>>>>>>>>> > > > >>>>>>>>>> An advantage of doing it via prctl vs creating a cgroup specific > > > >>>>>>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is > > > >>>>>>>>>> that this will work even when there are no cgroups present, and my > > > >>>>>>>>>> understanding is there is a strong preference of cgroups controls being > > > >>>>>>>>>> hierarchical which usually means them having a numerical value. > > > >>>>>>>>> > > > >>>>>>>>> Hi Usama, > > > >>>>>>>>> > > > >>>>>>>>> Do you mind giving an example on how to change THP policy for a set of > > > >>>>>>>>> processes running in a container (under a cgroup)? > > > >>>>>>>> > > > >>>>>>>> Hi Zi, > > > >>>>>>>> > > > >>>>>>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always > > > >>>>>>>> for processes in a cgroup is in the same way we enable KSM for the cgroup. > > > >>>>>>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS > > > >>>>>>>> in exec-invoke. > > > >>>>>>>> This is at the start of the process, but you would already know at the start of the process > > > >>>>>>>> whether you want THP=always for it or not. > > > >>>>>>>> > > > >>>>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > > >>>>>>> > > > >>>>>>> You also need to add a new systemd.directives, e.g., MemoryTHP, to > > > >>>>>>> pass the THP enablement or disablement info from a systemd config file. > > > >>>>>>> And if you find those processes do not benefit from using THPs, > > > >>>>>>> you can just change the new "MemoryTHP" config and restart the processes. > > > >>>>>>> > > > >>>>>>> Am I getting it? Thanks. > > > >>>>>>> > > > >>>>>> > > > >>>>>> Yes, thats right. They would exactly the same as what we (Meta) do > > > >>>>>> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set, > > > >>>>>> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when > > > >>>>>> that is set, the prctl will be called at exec_invoke of the process [3]. > > > >>>>>> > > > >>>>>> The systemd changes should be quite simple to do. > > > >>>>>> > > > >>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978 > > > >>>>>> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151 > > > >>>>>> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > > >>>>> > > > >>>>> This solution carries a risk: since prctl() does not require any > > > >>>>> capabilities, the task itself could call it and override your memory > > > >>>>> policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that > > > >>>>> capability is typically enabled by default in containers, leaving them > > > >>>>> still vulnerable. > > > >>>>> > > > >>>>> This approach might work for Kubernetes/container environments, but it > > > >>>>> would require substantial code changes to implement securely. > > > >>>>> > > > >>>> > > > >>>> You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE > > > >>>> already exists and the someone could use this to slow the process down. So the > > > >>>> approach this patch takes shouldn't be anymore of a security fix then what is already > > > >>>> exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE > > > >>>> should be used to restrict this. > > > >>> > > > >>> I believe we should at least require CAP_SYS_RESOURCE to enable THP, > > > >>> since it overrides global system settings. Alternatively, > > > >>> CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely > > > >>> certain. > > > >> > > > >> Hm, could you verbalize a concrete security concern? > > > >> > > > >> I've never really looked at the global settings as a hard policy, more > > > >> as picking a default for the workloads in the system. It's usually > > > >> `madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have long > > > >> existed to give applications the ability to refine the global choice. > > > >> > > > >> The prctl should probably respect `never' for consistency, but beyond > > > >> that I don't really see the concern, or how this would allow something > > > >> that isn't already possible. > > > > > > > > I would interpret the always, madvise, and never options as follows: > > > > - always > > > > The sysadmin strongly recommends using THP. If a user does not > > > > want to use it, they must explicitly disable it. > > I would call this "kernel mode" or "auto mode", where userspace should > *generally* not have to worry about huge pages, but with an option for > declaring the odd exceptional case. > > Both madvise() and unprivileged prctl() currently work, and IMO should > continue to work, for declaring exceptions. > > > > > - madvise > > > > The sysadmin gently encourages the use of THP, but it is only > > > > enabled when explicitly requested by the application. > > And this "user mode" or "manual mode", where applications self-manage > which parts of userspace they want to enroll. > > Both madvise() and unprivileged prctl() should work here as well, > IMO. There is no policy or security difference between them, it's just > about granularity and usability. > > > > > - never > > > > The sysadmin discourages the use of THP, and "its use is only permitted > > > > with explicit approval" . > > This one I don't quite agree with, and IMO conflicts with what David > is saying as well. > > > > "never" so far means "no thps, no exceptions". We've had serious THP > > > issues in the past, where our workaround until we sorted out the issue > > > for affected customers was to force-disable THPs on that system during boot. > > > > Right, that reflects the current behavior. What we aim to enhance is > > by adding the requirement that "its use is only permitted with > > explicit approval." > > I think you're conflating a safety issue with a security issue. I appreciate the corrections. English isn't my first language, so I occasionally don't use words as precisely as I'd like. > > David is saying there can be cases where the kernel is broken, and > "never" is a production escape hatch to disable the feature until a > kernel upgrade for the fix is possible. In such a case, it doesn't > make sense to override this decision based on any sort of workload > policy, privileged or not. > > The way I understand you is that you want enrollment (and/or > self-management) only for blessed applications. Right. > Because you don't > generally trust workloads in the wild enough to switch the global > default away from "never", given the semantics of always/madvise. Historically, we have always set it to 'never.' Due to concerns stemming from past incidents, the sysadmins have been hesitant to switch it to 'madvise.' However, we’ve now discovered that AI services can gain significant performance benefits from it. As a solution, we propose enabling THP exclusively for AI services while maintaining the global setting as 'never.' > > To me this sounds like you'd need a different mode, call it "blessed"; > with a privileged interface to control which applications are allowed > to madvise/prctl-enable. This appears to be a viable solution. -- Regards Yafang ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-07 14:00 [PATCH 0/1] prctl: allow overriding system THP policy to always Usama Arif 2025-05-07 14:00 ` [PATCH 1/1] prctl: allow overriding system THP policy to always per process Usama Arif 2025-05-07 14:57 ` [PATCH 0/1] prctl: allow overriding system THP policy to always Zi Yan @ 2025-05-08 11:06 ` David Hildenbrand 2025-05-08 16:35 ` Usama Arif 2 siblings, 1 reply; 29+ messages in thread From: David Hildenbrand @ 2025-05-08 11:06 UTC (permalink / raw) To: Usama Arif, Andrew Morton, linux-mm Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On 07.05.25 16:00, Usama Arif wrote: > Allowing override of global THP policy per process allows workloads > that have shown to benefit from hugepages to do so, without regressing > workloads that wouldn't benefit. This will allow such types of > workloads to be run/stacked on the same machine. > > It also helps in rolling out hugepages in hyperscaler configurations > for workloads that benefit from them, where a single THP policy is > likely to be used across the entire fleet, and prctl will help override it. > > An advantage of doing it via prctl vs creating a cgroup specific > option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is > that this will work even when there are no cgroups present, and my > understanding is there is a strong preference of cgroups controls being > hierarchical which usually means them having a numerical value. > > > The output and code of test program is below: > > [root@vm4 vmuser]# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled > [root@vm4 vmuser]# echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > [root@vm4 vmuser]# ./a.out > Default THP setting: > THP is not set to 'always'. > PR_SET_THP_ALWAYS = 1 > THP is set to 'always'. > PR_SET_THP_ALWAYS = 0 > THP is not set to 'always'. Some quick feedback: (1) The "always" in PR_SET_THP_ALWAYS does not look future proof. Why wouldn't someone want to force-disable them for a process (-> "never") or set it to some other new mode ("-> defer" that is currently on the list). (2) In your example, is the toggle specific to 2M THP or the global toggle ...? Unclear. And that makes this interface also suboptimal. (3) I'm a bit concerned about interaction with per-VMA settings (the one we already have, and order-specific ones that people were discussing). It's going to be a mess if we have global, per-process, per-vma and then some other policies (ebpf? whatever else?) on top. The low-hanging fruit would be a per-process toggle that only controls the existing per-VMA toggle: for example, with the semantics that (1) All new (applicable) VMAs start with VM_HUGEPAGE (2) All existing (applicable) VMAs that are *not* VM_NOHUGEPAGE become VM_HUGEPAGE. We did something similar with PR_SET_MEMORY_MERGE. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-08 11:06 ` David Hildenbrand @ 2025-05-08 16:35 ` Usama Arif 2025-05-08 17:39 ` David Hildenbrand 0 siblings, 1 reply; 29+ messages in thread From: Usama Arif @ 2025-05-08 16:35 UTC (permalink / raw) To: David Hildenbrand, Andrew Morton, linux-mm Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On 08/05/2025 12:06, David Hildenbrand wrote: > On 07.05.25 16:00, Usama Arif wrote: >> Allowing override of global THP policy per process allows workloads >> that have shown to benefit from hugepages to do so, without regressing >> workloads that wouldn't benefit. This will allow such types of >> workloads to be run/stacked on the same machine. >> >> It also helps in rolling out hugepages in hyperscaler configurations >> for workloads that benefit from them, where a single THP policy is >> likely to be used across the entire fleet, and prctl will help override it. >> >> An advantage of doing it via prctl vs creating a cgroup specific >> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is >> that this will work even when there are no cgroups present, and my >> understanding is there is a strong preference of cgroups controls being >> hierarchical which usually means them having a numerical value. >> >> >> The output and code of test program is below: >> >> [root@vm4 vmuser]# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled >> [root@vm4 vmuser]# echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled >> [root@vm4 vmuser]# ./a.out >> Default THP setting: >> THP is not set to 'always'. >> PR_SET_THP_ALWAYS = 1 >> THP is set to 'always'. >> PR_SET_THP_ALWAYS = 0 >> THP is not set to 'always'. > > Some quick feedback: > > (1) The "always" in PR_SET_THP_ALWAYS does not look future proof. Why wouldn't someone want to force-disable them for a process (-> "never") or set it to some other new mode ("-> defer" that is currently on the list). Yes agree with this, I think there are 3 different possible ways forward for this which I outlined in [1] in reply to Zi Yan. I like flags2 approach, but let me know what you think. [1] https://lore.kernel.org/all/9ed673ad-764f-4f46-84a7-ef98b19d22ec@gmail.com/ > > (2) In your example, is the toggle specific to 2M THP or the global toggle ...? Unclear. And that makes this interface also suboptimal. In this approach you would overwrite inherit folio sizes, and I think thats the right approach. So if you have for e.g. 2M and 16K set to inherit, and the global one is set to madvise, doing PR_SET_THP_ALWAYS would change those folio to always. > > (3) I'm a bit concerned about interaction with per-VMA settings (the one we already have, and order-specific ones that people were discussing). It's going to be a mess if we have global, per-process, per-vma and then some other policies (ebpf? whatever else?) on top. > > > The low-hanging fruit would be a per-process toggle that only controls the existing per-VMA toggle: for example, with the semantics that > > (1) All new (applicable) VMAs start with VM_HUGEPAGE > (2) All existing (applicable) VMAs that are *not* VM_NOHUGEPAGE become VM_HUGEPAGE. > > > We did something similar with PR_SET_MEMORY_MERGE. > For this you mean the prctl command would do for_each_vma and set VM_HUGEPAGE to implement point 2. For having point 1, I think we will still need extra mm->flags, i.e. MMF_VM_THP_MADVISE/DEFER/ALWAYS/NEVER. I think it would have the same affect as what this patch is trying to do? But would be just more expensive in terms of both code changes and the cost of the actual call as you now have to walk all vmas. On the other hand you wont need the below diff in from v1. I do feel the current approach in the patch is simpler? But if your point 3 is better in terms of code maintainability, happy to make it the change to it in v2. diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2f190c90192d..0587dc4b8e2d 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -293,7 +293,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, if (vm_flags & VM_HUGEPAGE) mask |= READ_ONCE(huge_anon_orders_madvise); if (hugepage_global_always() || - ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled())) + ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()) || + test_bit(MMF_THP_ALWAYS, &vma->vm_mm->flags)) mask |= READ_ONCE(huge_anon_orders_inherit); orders &= mask; ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-08 16:35 ` Usama Arif @ 2025-05-08 17:39 ` David Hildenbrand 2025-05-08 18:05 ` Usama Arif 0 siblings, 1 reply; 29+ messages in thread From: David Hildenbrand @ 2025-05-08 17:39 UTC (permalink / raw) To: Usama Arif, Andrew Morton, linux-mm Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team >> We did something similar with PR_SET_MEMORY_MERGE. >> > > For this you mean the prctl command would do for_each_vma and set VM_HUGEPAGE to implement point 2. Yes. The expectation would be that the prctl is either a) triggered early when there are not that many VMAs yet or b) inherited during fork+exec such that the target process immediately gets it enabled for all VMAs. > For having point 1, I think we will still need extra mm->flags, i.e. MMF_VM_THP_MADVISE/DEFER/ALWAYS/NEVER. Yes, and it's unclear what we really want here. > > I think it would have the same affect as what this patch is trying to do? But would be just more > expensive in terms of both code changes and the cost of the actual call as you now have to walk > all vmas. On the other hand you wont need the below diff in from v1. I do feel the current approach > in the patch is simpler? But if your point 3 is better in terms of code maintainability, happy to make > it the change to it in v2. Yes. Having a toggle that a) Changes the default of new VMAs to be VM_HUGEPAGE b) Modifies all existing VMAs that are not VM_NOHUGEPAGE to be VM_HUGEPAGE c) Is inherited during fork+exec Would not add any additional checks in our already-complicated THP allocation logic, it simply changes the default/value of VM_HUGEPAGE. At least to me, this sounds like better semantics. Would that make your use case happy? -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always 2025-05-08 17:39 ` David Hildenbrand @ 2025-05-08 18:05 ` Usama Arif 0 siblings, 0 replies; 29+ messages in thread From: Usama Arif @ 2025-05-08 18:05 UTC (permalink / raw) To: David Hildenbrand, Andrew Morton, linux-mm Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team On 08/05/2025 18:39, David Hildenbrand wrote: >>> We did something similar with PR_SET_MEMORY_MERGE. >>> >> >> For this you mean the prctl command would do for_each_vma and set VM_HUGEPAGE to implement point 2. > > Yes. The expectation would be that the prctl is either a) triggered early when there are not that many VMAs yet or b) inherited during fork+exec such that the target process immediately gets it enabled for all VMAs. Yes that is the case for us as well, For us its mostly going to be a, i.e. exec_invoke called by systemd. But b will be good as well. > >> For having point 1, I think we will still need extra mm->flags, i.e. MMF_VM_THP_MADVISE/DEFER/ALWAYS/NEVER. > > Yes, and it's unclear what we really want here. Let me make a v2 and we might be able to have a better discussion once its done. I will move it to RFC. > >> >> I think it would have the same affect as what this patch is trying to do? But would be just more >> expensive in terms of both code changes and the cost of the actual call as you now have to walk >> all vmas. On the other hand you wont need the below diff in from v1. I do feel the current approach >> in the patch is simpler? But if your point 3 is better in terms of code maintainability, happy to make >> it the change to it in v2. > > Yes. Having a toggle that > > a) Changes the default of new VMAs to be VM_HUGEPAGE > > b) Modifies all existing VMAs that are not VM_NOHUGEPAGE to be VM_HUGEPAGE > > c) Is inherited during fork+exec > > Would not add any additional checks in our already-complicated THP allocation logic, it simply changes the default/value of VM_HUGEPAGE. > > > At least to me, this sounds like better semantics. Would that make your use case happy? > Yes, this would work for us. Let me try and make this work in the way you described and also make it work for MADVISE and NEVER, will send that in v2. Thanks for the feedback!! ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2025-05-13 12:05 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-05-07 14:00 [PATCH 0/1] prctl: allow overriding system THP policy to always Usama Arif 2025-05-07 14:00 ` [PATCH 1/1] prctl: allow overriding system THP policy to always per process Usama Arif 2025-05-07 15:02 ` Usama Arif 2025-05-07 20:14 ` Zi Yan 2025-05-08 10:53 ` Usama Arif 2025-05-08 20:29 ` Zi Yan 2025-05-07 14:57 ` [PATCH 0/1] prctl: allow overriding system THP policy to always Zi Yan 2025-05-07 15:12 ` Usama Arif 2025-05-07 15:57 ` Zi Yan 2025-05-07 16:09 ` Usama Arif 2025-05-08 5:41 ` Yafang Shao 2025-05-08 16:04 ` Usama Arif 2025-05-09 2:15 ` Yafang Shao 2025-05-09 5:13 ` Johannes Weiner 2025-05-09 9:24 ` Yafang Shao 2025-05-09 9:30 ` David Hildenbrand 2025-05-09 9:43 ` Yafang Shao 2025-05-09 16:46 ` Johannes Weiner 2025-05-09 22:42 ` David Hildenbrand 2025-05-09 23:34 ` Zi Yan 2025-05-11 8:15 ` David Hildenbrand 2025-05-11 14:08 ` Usama Arif 2025-05-13 11:43 ` Yafang Shao 2025-05-13 12:04 ` David Hildenbrand 2025-05-11 2:08 ` Yafang Shao 2025-05-08 11:06 ` David Hildenbrand 2025-05-08 16:35 ` Usama Arif 2025-05-08 17:39 ` David Hildenbrand 2025-05-08 18:05 ` Usama Arif
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox