[RFC PATCH] mm: control mthp per process/cgroup

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] mm: control mthp per process/cgroup
@ 2024-08-16  9:13 Nanyong Sun
  2024-08-16 18:15 ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Nanyong Sun @ 2024-08-16  9:13 UTC (permalink / raw)
  To: hughd, akpm, david, willy, ryan.roberts
  Cc: baohua, baolin.wang, ioworker0, peterx, ziy, sunnanyong,
	wangkefeng.wang, linux-mm, linux-kernel

Now the large folio control interfaces is system wide and tend to be
default on: file systems use large folio by default if supported,
mTHP is tend to default enable when boot [1].
When large folio enabled, some workloads have performance benefit,
but some may not and some side effects can happen: the memory usage
may increase, direct reclaim maybe more frequently because of more
large order allocations, result in cpu usage also increases. We observed
this on a product environment which run nginx, the pgscan_direct count
increased a lot than before, can reach to 3000 times per second, and
disable file large folio can fix this.

Now the anon/shmem/file mthp control interfaces is system wide, so the
control api per process or cgroup may be necessary, for example, in one
machine, some containers can use large folios and some can disable them.

This patch present a possible solution:
Extend the existing prctl api(PR_SET_THP_DISABLE), use the third argument
to specify which kind of mTHP to disable for this process.
For example:
  prctl(PR_SET_THP_DISABLE, 1, 0, 0, 0); //keep the original semantics
  prctl(PR_SET_THP_DISABLE, 1, PR_DISABLE_ANON_MTHP, 0, 0);
  prctl(PR_SET_THP_DISABLE, 1, PR_DISABLE_SHMEM_MTHP, 0, 0);
  prctl(PR_SET_THP_DISABLE, 1, PR_DISABLE_FILE_MTHP, 0, 0);

The child processes will inherit the setting so if a seed process had
configured, all processes in the cgroup will inherit the setting.

This patch does not implement control over file mTHP, this is planed to
do after pagecache folio sizes control done[2].

[1] https://lore.kernel.org/linux-mm/20240814085409.124466-1-21cnbao@gmail.com/T/
[2] https://lore.kernel.org/lkml/20240717071257.4141363-1-ryan.roberts@arm.com/T/

Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
---
This patch is more to discuss the possible directions.

I am not sure occupy 3 bits in the mm flag hole is a good way, do we need
add a new flag in mm_struct to do this?

I think another possible solution(not in this patch) is to go in the
opposite direction of this patch, i.e. only allow the processes/cgroups
who really need large folio to use the mTHP. We can add a new prctl api
to mark the process who can use some specific sizes of mTHP.

TODO:
Need modify reference manual for the change of prctl PR_SET_THP_DISABLE.

 include/linux/huge_mm.h        |  3 +++
 include/linux/sched/coredump.h |  9 +++++++--
 include/uapi/linux/prctl.h     |  5 +++++
 kernel/sys.c                   | 36 ++++++++++++++++++++++++++++------
 mm/shmem.c                     |  7 +++++--
 5 files changed, 50 insertions(+), 10 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e25d9ebfdf89..8c0b62b732b7 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -247,6 +247,9 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 	if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
 		unsigned long mask = READ_ONCE(huge_anon_orders_always);
 
+		if (test_bit(MMF_DISABLE_ANON_MTHP, &vma->vm_mm->flags))
+			return 0;
+
 		if (vm_flags & VM_HUGEPAGE)
 			mask |= READ_ONCE(huge_anon_orders_madvise);
 		if (hugepage_global_always() ||
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index e62ff805cfc9..0935b4790e6f 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -56,6 +56,10 @@ static inline int get_dumpable(struct mm_struct *mm)
 # define MMF_DUMP_MASK_DEFAULT_ELF	0
 #endif
 					/* leave room for more dump flags */
+#define MMF_DISABLE_ANON_MTHP	13
+#define MMF_DISABLE_SHMEM_MTHP	14
+#define MMF_DISABLE_FILE_MTHP	15
+#define MMF_DISABLE_MTHP_MASK	(7 << MMF_DISABLE_ANON_MTHP)
 #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
 #define MMF_VM_HUGEPAGE		17	/* set when mm is available for
 					   khugepaged */
@@ -96,8 +100,9 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_TOPDOWN_MASK	(1 << MMF_TOPDOWN)
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
-				 MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
-				 MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
+				 MMF_DISABLE_THP_MASK | MMF_DISABLE_MTHP_MASK |\
+				 MMF_HAS_MDWE_MASK | MMF_VM_MERGE_ANY_MASK |\
+				 MMF_TOPDOWN_MASK)
 
 static inline unsigned long mmf_init_flags(unsigned long flags)
 {
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 35791791a879..584ac45f4ec8 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -178,6 +178,11 @@ struct prctl_mm_map {
 #define PR_GET_TID_ADDRESS	40
 
 #define PR_SET_THP_DISABLE	41
+# define PR_DISABLE_ANON_MTHP	(1UL << 1)
+# define PR_DISABLE_SHMEM_MTHP	(1UL << 2)
+# define PR_DISABLE_FILE_MTHP	(1UL << 3)
+# define DISABLE_MTHP_ALL_MASK	(PR_DISABLE_ANON_MTHP | PR_DISABLE_SHMEM_MTHP |\
+				 PR_DISABLE_FILE_MTHP)
 #define PR_GET_THP_DISABLE	42
 
 /*
diff --git a/kernel/sys.c b/kernel/sys.c
index 3a2df1bd9f64..06f2b1de46a7 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2627,17 +2627,41 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_THP_DISABLE:
 		if (arg2 || arg3 || arg4 || arg5)
 			return -EINVAL;
-		error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
+		if (test_bit(MMF_DISABLE_THP, &me->mm->flags))
+			error = 1;
+		if (test_bit(MMF_DISABLE_ANON_MTHP, &me->mm->flags))
+			error |= PR_DISABLE_ANON_MTHP;
+		if (test_bit(MMF_DISABLE_SHMEM_MTHP, &me->mm->flags))
+			error |= PR_DISABLE_SHMEM_MTHP;
+		if (test_bit(MMF_DISABLE_FILE_MTHP, &me->mm->flags))
+			error |= PR_DISABLE_FILE_MTHP;
 		break;
 	case PR_SET_THP_DISABLE:
-		if (arg3 || arg4 || arg5)
+		if (arg4 || arg5)
+			return -EINVAL;
+		if (arg3 && (arg3 & ~DISABLE_MTHP_ALL_MASK))
 			return -EINVAL;
 		if (mmap_write_lock_killable(me->mm))
 			return -EINTR;
-		if (arg2)
-			set_bit(MMF_DISABLE_THP, &me->mm->flags);
-		else
-			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
+		if (arg2) {
+			if (!arg3)
+				set_bit(MMF_DISABLE_THP, &me->mm->flags);
+			if (arg3 & PR_DISABLE_ANON_MTHP)
+				set_bit(MMF_DISABLE_ANON_MTHP, &me->mm->flags);
+			if (arg3 & PR_DISABLE_SHMEM_MTHP)
+				set_bit(MMF_DISABLE_SHMEM_MTHP, &me->mm->flags);
+			if (arg3 & PR_DISABLE_FILE_MTHP)
+				set_bit(MMF_DISABLE_FILE_MTHP, &me->mm->flags);
+		} else {
+			if (!arg3)
+				clear_bit(MMF_DISABLE_THP, &me->mm->flags);
+			if (arg3 & PR_DISABLE_ANON_MTHP)
+				clear_bit(MMF_DISABLE_ANON_MTHP, &me->mm->flags);
+			if (arg3 & PR_DISABLE_SHMEM_MTHP)
+				clear_bit(MMF_DISABLE_SHMEM_MTHP, &me->mm->flags);
+			if (arg3 & PR_DISABLE_FILE_MTHP)
+				clear_bit(MMF_DISABLE_FILE_MTHP, &me->mm->flags);
+		}
 		mmap_write_unlock(me->mm);
 		break;
 	case PR_MPX_ENABLE_MANAGEMENT:
diff --git a/mm/shmem.c b/mm/shmem.c
index 5a77acf6ac6a..f4272883df77 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -556,7 +556,9 @@ static bool __shmem_is_huge(struct inode *inode, pgoff_t index,
 
 	if (!S_ISREG(inode->i_mode))
 		return false;
-	if (mm && ((vm_flags & VM_NOHUGEPAGE) || test_bit(MMF_DISABLE_THP, &mm->flags)))
+	if (mm && ((vm_flags & VM_NOHUGEPAGE) ||
+				test_bit(MMF_DISABLE_THP, &mm->flags) ||
+				test_bit(MMF_DISABLE_SHMEM_MTHP, &mm->flags)))
 		return false;
 	if (shmem_huge == SHMEM_HUGE_DENY)
 		return false;
@@ -1633,7 +1635,8 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
 	int order;
 
 	if ((vm_flags & VM_NOHUGEPAGE) ||
-	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
+	    test_bit(MMF_DISABLE_SHMEM_MTHP, &vma->vm_mm->flags))
 		return 0;
 
 	/* If the hardware/firmware marked hugepage support disabled. */
-- 
2.33.0



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: control mthp per process/cgroup
  2024-08-16  9:13 [RFC PATCH] mm: control mthp per process/cgroup Nanyong Sun
@ 2024-08-16 18:15 ` Matthew Wilcox
  2024-08-19  5:58   ` Nanyong Sun
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2024-08-16 18:15 UTC (permalink / raw)
  To: Nanyong Sun
  Cc: hughd, akpm, david, ryan.roberts, baohua, baolin.wang, ioworker0,
	peterx, ziy, wangkefeng.wang, linux-mm, linux-kernel

On Fri, Aug 16, 2024 at 05:13:27PM +0800, Nanyong Sun wrote:
> Now the large folio control interfaces is system wide and tend to be
> default on: file systems use large folio by default if supported,
> mTHP is tend to default enable when boot [1].
> When large folio enabled, some workloads have performance benefit,
> but some may not and some side effects can happen: the memory usage
> may increase, direct reclaim maybe more frequently because of more
> large order allocations, result in cpu usage also increases. We observed
> this on a product environment which run nginx, the pgscan_direct count
> increased a lot than before, can reach to 3000 times per second, and
> disable file large folio can fix this.

Can you share any details of your nginx workload that shows a regression?
The heuristics for allocating large folios are completely untuned, so
having data for a workload which performs better with small folios is
very valuable.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: control mthp per process/cgroup
  2024-08-16 18:15 ` Matthew Wilcox
@ 2024-08-19  5:58   ` Nanyong Sun
  2024-08-26  2:26     ` Nanyong Sun
  2024-09-02  9:36     ` Baolin Wang
  0 siblings, 2 replies; 6+ messages in thread
From: Nanyong Sun @ 2024-08-19  5:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: hughd, akpm, david, ryan.roberts, baohua, baolin.wang, ioworker0,
	peterx, ziy, wangkefeng.wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1671 bytes --]

On 2024/8/17 2:15, Matthew Wilcox wrote:

> On Fri, Aug 16, 2024 at 05:13:27PM +0800, Nanyong Sun wrote:
>> Now the large folio control interfaces is system wide and tend to be
>> default on: file systems use large folio by default if supported,
>> mTHP is tend to default enable when boot [1].
>> When large folio enabled, some workloads have performance benefit,
>> but some may not and some side effects can happen: the memory usage
>> may increase, direct reclaim maybe more frequently because of more
>> large order allocations, result in cpu usage also increases. We observed
>> this on a product environment which run nginx, the pgscan_direct count
>> increased a lot than before, can reach to 3000 times per second, and
>> disable file large folio can fix this.
> Can you share any details of your nginx workload that shows a regression?
> The heuristics for allocating large folios are completely untuned, so
> having data for a workload which performs better with small folios is
> very valuable.
>
> .
The RPS(/Requests per second/) which is the performance metric of nginx 
workload has no
regression(also no improvement)，we just observed that  pgscan_direct 
rate is much higher
with large folio.
So far, we have tested some workloads' benchmark, some did not have 
performance improvement
but also did not have regression.
In a production environment, different workloads may be deployed on a 
machine. Therefore,
do we need to add a process/cgroup level control to prevent workloads 
that will not have
performance improvement from using mTHP? In this way, the memory 
overhead and direct reclaim
caused by mTHP can be avoided for those process/cgroup.

[-- Attachment #2: Type: text/html, Size: 2813 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: control mthp per process/cgroup
  2024-08-19  5:58   ` Nanyong Sun
@ 2024-08-26  2:26     ` Nanyong Sun
  2024-09-02  9:36     ` Baolin Wang
  1 sibling, 0 replies; 6+ messages in thread
From: Nanyong Sun @ 2024-08-26  2:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: hughd, akpm, david, ryan.roberts, baohua, baolin.wang, ioworker0,
	peterx, ziy, wangkefeng.wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1799 bytes --]

On 2024/8/19 13:58, Nanyong Sun wrote:
> On 2024/8/17 2:15, Matthew Wilcox wrote:
>> On Fri, Aug 16, 2024 at 05:13:27PM +0800, Nanyong Sun wrote:
>>> Now the large folio control interfaces is system wide and tend to be
>>> default on: file systems use large folio by default if supported,
>>> mTHP is tend to default enable when boot [1].
>>> When large folio enabled, some workloads have performance benefit,
>>> but some may not and some side effects can happen: the memory usage
>>> may increase, direct reclaim maybe more frequently because of more
>>> large order allocations, result in cpu usage also increases. We observed
>>> this on a product environment which run nginx, the pgscan_direct count
>>> increased a lot than before, can reach to 3000 times per second, and
>>> disable file large folio can fix this.
>> Can you share any details of your nginx workload that shows a regression?
>> The heuristics for allocating large folios are completely untuned, so
>> having data for a workload which performs better with small folios is
>> very valuable.
>>
>> .
> The RPS//(Requests per second) which is the performance metric of 
> nginx workload has no
> regression(also no improvement)，we just observed that pgscan_direct 
> rate is much higher
> with large folio.
> So far, we have tested some workloads' benchmark, some did not have 
> performance improvement
> but also did not have regression.
> In a production environment, different workloads may be deployed on a 
> machine. Therefore,
> do we need to add a process/cgroup level control to prevent workloads 
> that will not have
> performance improvement from using mTHP? In this way, the memory 
> overhead and direct reclaim
> caused by mTHP can be avoided for those process/cgroup.
Sorry to disturb, just a friendly ping : )

[-- Attachment #2: Type: text/html, Size: 2495 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: control mthp per process/cgroup
  2024-08-19  5:58   ` Nanyong Sun
  2024-08-26  2:26     ` Nanyong Sun
@ 2024-09-02  9:36     ` Baolin Wang
  2024-09-02 13:33       ` David Hildenbrand
  1 sibling, 1 reply; 6+ messages in thread
From: Baolin Wang @ 2024-09-02  9:36 UTC (permalink / raw)
  To: Nanyong Sun, Matthew Wilcox
  Cc: hughd, akpm, david, ryan.roberts, baohua, ioworker0, peterx, ziy,
	wangkefeng.wang, linux-mm, linux-kernel



On 2024/8/19 13:58, Nanyong Sun wrote:
> On 2024/8/17 2:15, Matthew Wilcox wrote:
> 
>> On Fri, Aug 16, 2024 at 05:13:27PM +0800, Nanyong Sun wrote:
>>> Now the large folio control interfaces is system wide and tend to be
>>> default on: file systems use large folio by default if supported,
>>> mTHP is tend to default enable when boot [1].
>>> When large folio enabled, some workloads have performance benefit,
>>> but some may not and some side effects can happen: the memory usage
>>> may increase, direct reclaim maybe more frequently because of more
>>> large order allocations, result in cpu usage also increases. We observed
>>> this on a product environment which run nginx, the pgscan_direct count
>>> increased a lot than before, can reach to 3000 times per second, and
>>> disable file large folio can fix this.
>> Can you share any details of your nginx workload that shows a regression?
>> The heuristics for allocating large folios are completely untuned, so
>> having data for a workload which performs better with small folios is
>> very valuable.
>>
>> .
> The RPS(/Requests per second/) which is the performance metric of nginx 
> workload has no
> regression(also no improvement)，we just observed that  pgscan_direct 
> rate is much higher
> with large folio.
> So far, we have tested some workloads' benchmark, some did not have 
> performance improvement
> but also did not have regression.
> In a production environment, different workloads may be deployed on a 
> machine. Therefore,
> do we need to add a process/cgroup level control to prevent workloads 
> that will not have
> performance improvement from using mTHP? In this way, the memory 
> overhead and direct reclaim
> caused by mTHP can be avoided for those process/cgroup.

OK. So no regression with mTHP, seems just some theoretical analysis.

IMHO, it would be better to evaluate your 'per-cgroup mTHP control' idea 
on some real workloads, and gather some data to evaluation, which can be 
more convincing.

Just my 2 cents:)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: control mthp per process/cgroup
  2024-09-02  9:36     ` Baolin Wang
@ 2024-09-02 13:33       ` David Hildenbrand
  0 siblings, 0 replies; 6+ messages in thread
From: David Hildenbrand @ 2024-09-02 13:33 UTC (permalink / raw)
  To: Baolin Wang, Nanyong Sun, Matthew Wilcox
  Cc: hughd, akpm, ryan.roberts, baohua, ioworker0, peterx, ziy,
	wangkefeng.wang, linux-mm, linux-kernel

On 02.09.24 11:36, Baolin Wang wrote:
> 
> 
> On 2024/8/19 13:58, Nanyong Sun wrote:
>> On 2024/8/17 2:15, Matthew Wilcox wrote:
>>
>>> On Fri, Aug 16, 2024 at 05:13:27PM +0800, Nanyong Sun wrote:
>>>> Now the large folio control interfaces is system wide and tend to be
>>>> default on: file systems use large folio by default if supported,
>>>> mTHP is tend to default enable when boot [1].
>>>> When large folio enabled, some workloads have performance benefit,
>>>> but some may not and some side effects can happen: the memory usage
>>>> may increase, direct reclaim maybe more frequently because of more
>>>> large order allocations, result in cpu usage also increases. We observed
>>>> this on a product environment which run nginx, the pgscan_direct count
>>>> increased a lot than before, can reach to 3000 times per second, and
>>>> disable file large folio can fix this.
>>> Can you share any details of your nginx workload that shows a regression?
>>> The heuristics for allocating large folios are completely untuned, so
>>> having data for a workload which performs better with small folios is
>>> very valuable.
>>>
>>> .
>> The RPS(/Requests per second/) which is the performance metric of nginx
>> workload has no
>> regression(also no improvement)，we just observed that  pgscan_direct
>> rate is much higher
>> with large folio.
>> So far, we have tested some workloads' benchmark, some did not have
>> performance improvement
>> but also did not have regression.
>> In a production environment, different workloads may be deployed on a
>> machine. Therefore,
>> do we need to add a process/cgroup level control to prevent workloads
>> that will not have
>> performance improvement from using mTHP? In this way, the memory
>> overhead and direct reclaim
>> caused by mTHP can be avoided for those process/cgroup.
> 
> OK. So no regression with mTHP, seems just some theoretical analysis.
> 
> IMHO, it would be better to evaluate your 'per-cgroup mTHP control' idea
> on some real workloads, and gather some data to evaluation, which can be
> more convincing.

Agreed!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-09-02 13:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-16  9:13 [RFC PATCH] mm: control mthp per process/cgroup Nanyong Sun
2024-08-16 18:15 ` Matthew Wilcox
2024-08-19  5:58   ` Nanyong Sun
2024-08-26  2:26     ` Nanyong Sun
2024-09-02  9:36     ` Baolin Wang
2024-09-02 13:33       ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox