[RFC PATCH] mm: show mthp_fault_alloc and mthp_fault

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] mm: show mthp_fault_alloc and mthp_fault_fallback of multi-size THPs
@ 2024-03-26  3:01 Barry Song
  2024-03-26  3:24 ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Barry Song @ 2024-03-26  3:01 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: ryan.roberts, david, kasong, yuzhao, yosryahmed,
	cerasuolodomenico, surenb, Barry Song

From: Barry Song <v-songbaohua@oppo.com>

Profiling a system blindly with mTHP has become challenging due
to the lack of visibility into its operations. While displaying
additional statistics such as partial map/unmap actions may
spark debate, presenting the success rate of mTHP allocations
appears to be a straightforward and pressing need.

Recently, I've been experiencing significant difficulty debugging
performance improvements and regressions without these figures.
It's crucial for us to understand the true effectiveness of
mTHP in real-world scenarios, especially in systems with fragmented
memory.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/vm_event_item.h | 2 ++
 mm/memory.c                   | 2 ++
 mm/vmstat.c                   | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 747943bc8cc2..3233b39bdb38 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -95,6 +95,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_ALLOC,
 		THP_FAULT_FALLBACK,
 		THP_FAULT_FALLBACK_CHARGE,
+		MTHP_FAULT_ALLOC,
+		MTHP_FAULT_FALLBACK,
 		THP_COLLAPSE_ALLOC,
 		THP_COLLAPSE_ALLOC_FAILED,
 		THP_FILE_ALLOC,
diff --git a/mm/memory.c b/mm/memory.c
index 62ee4a15092a..803f00a07d54 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4364,12 +4364,14 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 			}
 			folio_throttle_swaprate(folio, gfp);
 			clear_huge_page(&folio->page, vmf->address, 1 << order);
+			count_vm_event(MTHP_FAULT_ALLOC);
 			return folio;
 		}
 next:
 		order = next_order(&orders, order);
 	}
 
+	count_vm_event(MTHP_FAULT_FALLBACK);
 fallback:
 #endif
 	return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index db79935e4a54..0cc86c73ecdc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1353,6 +1353,8 @@ const char * const vmstat_text[] = {
 	"thp_fault_alloc",
 	"thp_fault_fallback",
 	"thp_fault_fallback_charge",
+	"mthp_fault_alloc",
+	"mthp_fault_fallback",
 	"thp_collapse_alloc",
 	"thp_collapse_alloc_failed",
 	"thp_file_alloc",
-- 
2.34.1



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: show mthp_fault_alloc and mthp_fault_fallback of multi-size THPs
  2024-03-26  3:01 [RFC PATCH] mm: show mthp_fault_alloc and mthp_fault_fallback of multi-size THPs Barry Song
@ 2024-03-26  3:24 ` Matthew Wilcox
  2024-03-26  3:40   ` Barry Song
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2024-03-26  3:24 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, ryan.roberts, david, kasong, yuzhao, yosryahmed,
	cerasuolodomenico, surenb, Barry Song

On Tue, Mar 26, 2024 at 04:01:03PM +1300, Barry Song wrote:
> Profiling a system blindly with mTHP has become challenging due
> to the lack of visibility into its operations. While displaying
> additional statistics such as partial map/unmap actions may
> spark debate, presenting the success rate of mTHP allocations
> appears to be a straightforward and pressing need.

Ummm ... no?  Not like this anyway.  It has the bad assumption that
"mTHP" only comes in one size.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: show mthp_fault_alloc and mthp_fault_fallback of multi-size THPs
  2024-03-26  3:24 ` Matthew Wilcox
@ 2024-03-26  3:40   ` Barry Song
  2024-03-26 22:19     ` Barry Song
  0 siblings, 1 reply; 6+ messages in thread
From: Barry Song @ 2024-03-26  3:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, ryan.roberts, david, kasong, yuzhao, yosryahmed,
	cerasuolodomenico, surenb, Barry Song

On Tue, Mar 26, 2024 at 4:25 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Mar 26, 2024 at 04:01:03PM +1300, Barry Song wrote:
> > Profiling a system blindly with mTHP has become challenging due
> > to the lack of visibility into its operations. While displaying
> > additional statistics such as partial map/unmap actions may
> > spark debate, presenting the success rate of mTHP allocations
> > appears to be a straightforward and pressing need.
>
> Ummm ... no?  Not like this anyway.  It has the bad assumption that
> "mTHP" only comes in one size.

I had initially considered per-size allocation and fallback before sending
the RFC. However, in order to prompt discussion and exploration
into profiling possibilities, I opted to send the simplest code instead.

We could consider two options for displaying per-size statistics.

1. A single file could be used to display data for all sizes.
1024KiB fault allocation:
1024KiB fault fallback:
512KiB fault allocation:
512KiB fault fallback:
....
64KiB fault allocation:
64KiB fault fallback:

2. A separate file for each size
For example,

/sys/kernel/debug/transparent_hugepage/hugepages-1024kB/vmstat
/sys/kernel/debug/transparent_hugepage/hugepages-512kB/vmstat
...
/sys/kernel/debug/transparent_hugepage/hugepages-64kB/vmstat

While the latter option may seem more appealing, it presents a challenge
in situations where a 512kB allocation may fallback to 256kB, yet a separate
256kB allocation succeeds. Demonstrating the connection that the successful
256kB allocation is actually a fallback from the 512kB allocation can be complex
especially if we begin to support per-VMA hints for mTHP sizes.

Thanks
Barry

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: show mthp_fault_alloc and mthp_fault_fallback of multi-size THPs
  2024-03-26  3:40   ` Barry Song
@ 2024-03-26 22:19     ` Barry Song
  2024-03-27 11:35       ` David Hildenbrand
  0 siblings, 1 reply; 6+ messages in thread
From: Barry Song @ 2024-03-26 22:19 UTC (permalink / raw)
  To: Matthew Wilcox, david, ryan.roberts, yuzhao
  Cc: akpm, linux-mm, kasong, yosryahmed, cerasuolodomenico, surenb,
	Barry Song

On Tue, Mar 26, 2024 at 4:40 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Mar 26, 2024 at 4:25 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Mar 26, 2024 at 04:01:03PM +1300, Barry Song wrote:
> > > Profiling a system blindly with mTHP has become challenging due
> > > to the lack of visibility into its operations. While displaying
> > > additional statistics such as partial map/unmap actions may
> > > spark debate, presenting the success rate of mTHP allocations
> > > appears to be a straightforward and pressing need.
> >
> > Ummm ... no?  Not like this anyway.  It has the bad assumption that
> > "mTHP" only comes in one size.
>
>
> I had initially considered per-size allocation and fallback before sending
> the RFC. However, in order to prompt discussion and exploration
> into profiling possibilities, I opted to send the simplest code instead.
>
> We could consider two options for displaying per-size statistics.
>
> 1. A single file could be used to display data for all sizes.
> 1024KiB fault allocation:
> 1024KiB fault fallback:
> 512KiB fault allocation:
> 512KiB fault fallback:
> ....
> 64KiB fault allocation:
> 64KiB fault fallback:
>
> 2. A separate file for each size
> For example,
>
> /sys/kernel/debug/transparent_hugepage/hugepages-1024kB/vmstat
> /sys/kernel/debug/transparent_hugepage/hugepages-512kB/vmstat
> ...
> /sys/kernel/debug/transparent_hugepage/hugepages-64kB/vmstat
>

Hi Ryan, David, Willy, Yu,

I'm collecting feedback on whether you'd prefer access to something similar
to /sys/kernel/debug/transparent_hugepage/hugepages-<size>/stat to help
determine the direction to take for this patch.

This is important to us because we're keen on understanding how often
folios allocations fail on a system with limited memory, such as a phone.

Presently, I've observed a success rate of under 8% for 64KiB allocations.
Yet, integrating Yu's TAO optimization [1] and establishing an 800MiB
nomerge zone on a phone with 8GiB memory, there's a substantial
enhancement in the success rate, reaching approximately 40%. I'm still
fine-tuning the optimal size for the zone.

[1] https://lore.kernel.org/linux-mm/20240229183436.4110845-1-yuzhao@google.com/

> While the latter option may seem more appealing, it presents a challenge
> in situations where a 512kB allocation may fallback to 256kB, yet a separate
> 256kB allocation succeeds. Demonstrating the connection that the successful
> 256kB allocation is actually a fallback from the 512kB allocation can be complex
> especially if we begin to support per-VMA hints for mTHP sizes.
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: show mthp_fault_alloc and mthp_fault_fallback of multi-size THPs
  2024-03-26 22:19     ` Barry Song
@ 2024-03-27 11:35       ` David Hildenbrand
  2024-03-27 12:17         ` Ryan Roberts
  0 siblings, 1 reply; 6+ messages in thread
From: David Hildenbrand @ 2024-03-27 11:35 UTC (permalink / raw)
  To: Barry Song, Matthew Wilcox, ryan.roberts, yuzhao
  Cc: akpm, linux-mm, kasong, yosryahmed, cerasuolodomenico, surenb,
	Barry Song, Johannes Weiner

On 26.03.24 23:19, Barry Song wrote:
> On Tue, Mar 26, 2024 at 4:40 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Tue, Mar 26, 2024 at 4:25 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>
>>> On Tue, Mar 26, 2024 at 04:01:03PM +1300, Barry Song wrote:
>>>> Profiling a system blindly with mTHP has become challenging due
>>>> to the lack of visibility into its operations. While displaying
>>>> additional statistics such as partial map/unmap actions may
>>>> spark debate, presenting the success rate of mTHP allocations
>>>> appears to be a straightforward and pressing need.
>>>
>>> Ummm ... no?  Not like this anyway.  It has the bad assumption that
>>> "mTHP" only comes in one size.
>>
>>
>> I had initially considered per-size allocation and fallback before sending
>> the RFC. However, in order to prompt discussion and exploration
>> into profiling possibilities, I opted to send the simplest code instead.
>>
>> We could consider two options for displaying per-size statistics.
>>
>> 1. A single file could be used to display data for all sizes.
>> 1024KiB fault allocation:
>> 1024KiB fault fallback:
>> 512KiB fault allocation:
>> 512KiB fault fallback:
>> ....
>> 64KiB fault allocation:
>> 64KiB fault fallback:
>>
>> 2. A separate file for each size
>> For example,
>>
>> /sys/kernel/debug/transparent_hugepage/hugepages-1024kB/vmstat
>> /sys/kernel/debug/transparent_hugepage/hugepages-512kB/vmstat
>> ...
>> /sys/kernel/debug/transparent_hugepage/hugepages-64kB/vmstat
>>
> 
> Hi Ryan, David, Willy, Yu,

Hi!

> 
> I'm collecting feedback on whether you'd prefer access to something similar
> to /sys/kernel/debug/transparent_hugepage/hugepages-<size>/stat to help
> determine the direction to take for this patch.

I discussed in the past that we might want to place statistics into 
sysfs. The idea was to place them into our new hierarchy:

/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/...

following the "one value per file" sysfs design principle.

We could have a new folder "stats" in there that contains files with 
statistics we care about.

Of course, we could also place that initially into debugfs in a similar 
fashion, and move it over once the interface is considered good and stable.

My 2 cents would be to avoid a "single file".

> 
> This is important to us because we're keen on understanding how often
> folios allocations fail on a system with limited memory, such as a phone.
> 
> Presently, I've observed a success rate of under 8% for 64KiB allocations.
> Yet, integrating Yu's TAO optimization [1] and establishing an 800MiB
> nomerge zone on a phone with 8GiB memory, there's a substantial
> enhancement in the success rate, reaching approximately 40%. I'm still
> fine-tuning the optimal size for the zone.

Just as a side note:

I didn't have the capacity to comment in detail on the "new zones" 
proposal in-depth so far (I'm hoping / assume there will be discussions 
at LSF/MM), but I'm hoping we can avoid that for now and instead improve 
our pageblock infrastructure, like Johannes is trying to, to achieve 
similar gains.

I suspect "some things we can do with new zones we can also do with 
pageblocks inside a zone". For example, there were discussions in the 
past to have "sticky movable" pageblocks: pageblocks that may only 
contain movable data. One could do the same with "pageblocks may not 
contain allocations < order X" etc. So one could similarly optimize the 
memmap to some degree for these pageblocks.

IMHO we should first try making THP <= pageblock allocations more 
reliable, not using new zones, and I'm happy that Johannes et al. are 
doing work in that direction. But it's a longer discussion to be had at 
LSF/MM.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: show mthp_fault_alloc and mthp_fault_fallback of multi-size THPs
  2024-03-27 11:35       ` David Hildenbrand
@ 2024-03-27 12:17         ` Ryan Roberts
  0 siblings, 0 replies; 6+ messages in thread
From: Ryan Roberts @ 2024-03-27 12:17 UTC (permalink / raw)
  To: David Hildenbrand, Barry Song, Matthew Wilcox, yuzhao
  Cc: akpm, linux-mm, kasong, yosryahmed, cerasuolodomenico, surenb,
	Barry Song, Johannes Weiner

On 27/03/2024 11:35, David Hildenbrand wrote:
> On 26.03.24 23:19, Barry Song wrote:
>> On Tue, Mar 26, 2024 at 4:40 PM Barry Song <21cnbao@gmail.com> wrote:
>>>
>>> On Tue, Mar 26, 2024 at 4:25 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>>
>>>> On Tue, Mar 26, 2024 at 04:01:03PM +1300, Barry Song wrote:
>>>>> Profiling a system blindly with mTHP has become challenging due
>>>>> to the lack of visibility into its operations. While displaying
>>>>> additional statistics such as partial map/unmap actions may
>>>>> spark debate, presenting the success rate of mTHP allocations
>>>>> appears to be a straightforward and pressing need.
>>>>
>>>> Ummm ... no?  Not like this anyway.  It has the bad assumption that
>>>> "mTHP" only comes in one size.
>>>
>>>
>>> I had initially considered per-size allocation and fallback before sending
>>> the RFC. However, in order to prompt discussion and exploration
>>> into profiling possibilities, I opted to send the simplest code instead.
>>>
>>> We could consider two options for displaying per-size statistics.
>>>
>>> 1. A single file could be used to display data for all sizes.
>>> 1024KiB fault allocation:
>>> 1024KiB fault fallback:
>>> 512KiB fault allocation:
>>> 512KiB fault fallback:
>>> ....
>>> 64KiB fault allocation:
>>> 64KiB fault fallback:
>>>
>>> 2. A separate file for each size
>>> For example,
>>>
>>> /sys/kernel/debug/transparent_hugepage/hugepages-1024kB/vmstat
>>> /sys/kernel/debug/transparent_hugepage/hugepages-512kB/vmstat
>>> ...
>>> /sys/kernel/debug/transparent_hugepage/hugepages-64kB/vmstat
>>>
>>
>> Hi Ryan, David, Willy, Yu,
> 
> Hi!
> 
>>
>> I'm collecting feedback on whether you'd prefer access to something similar
>> to /sys/kernel/debug/transparent_hugepage/hugepages-<size>/stat to help
>> determine the direction to take for this patch.
> 
> I discussed in the past that we might want to place statistics into sysfs. The
> idea was to place them into our new hierarchy:
> 
> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/...
> 
> following the "one value per file" sysfs design principle.
> 
> We could have a new folder "stats" in there that contains files with statistics
> we care about.
> 
> Of course, we could also place that initially into debugfs in a similar fashion,
> and move it over once the interface is considered good and stable.
> 
> My 2 cents would be to avoid a "single file".

Yes I agree with this. We discussed in the past and I summarised the outcome
here: https://lore.kernel.org/linux-mm/6cc7d781-884f-4d8f-a175-8609732b87eb@arm.com/

There are more counters on my list than what you are proposing here. But
conclusion was that they should be per-size and exposed through sysfs as David
suggests.

Personally I think those counters are obviously needed so prefer to go straight
to sysfs if we can :)

I had a low priority todo item to look at this - very pleased that you're taking
it on!

Copy/pasting my original summary here, for the lazy:

--8<----

I just want to try to summarise the counters we have discussed in this thread to
check my understanding:

1. global mTHP successful allocation counter, per mTHP size (inc only)
2. global mTHP failed allocation counter, per mTHP size (inc only)
3. global mTHP currently allocated counter, per mTHP size (inc and dec)
4. global "mTHP became partially mapped 1 or more processes" counter (inc only)

I geuss the above should apply to both page cache and anon? Do we want separate
counters for each?

I'm not sure if we would want 4. to be per mTHP or a single counter for all?
Probably the former if it provides a bit more info for neglegable cost.

Where should these be exposed? I guess /proc/vmstats is the obvious place, but I
don't think there is any precident for per-size counters (especially where the
sizes will change depending on system). Perhaps it would be better to expose
them in their per-size directories in
/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB ?


Additional to above global counters, there is a case for adding a per-process
version of 4. to smaps to consider.

--8<----

Thanks,
Ryan


> 
>>
>> This is important to us because we're keen on understanding how often
>> folios allocations fail on a system with limited memory, such as a phone.
>>
>> Presently, I've observed a success rate of under 8% for 64KiB allocations.
>> Yet, integrating Yu's TAO optimization [1] and establishing an 800MiB
>> nomerge zone on a phone with 8GiB memory, there's a substantial
>> enhancement in the success rate, reaching approximately 40%. I'm still
>> fine-tuning the optimal size for the zone.
> 
> Just as a side note:
> 
> I didn't have the capacity to comment in detail on the "new zones" proposal
> in-depth so far (I'm hoping / assume there will be discussions at LSF/MM), but
> I'm hoping we can avoid that for now and instead improve our pageblock
> infrastructure, like Johannes is trying to, to achieve similar gains.
> 
> I suspect "some things we can do with new zones we can also do with pageblocks
> inside a zone". For example, there were discussions in the past to have "sticky
> movable" pageblocks: pageblocks that may only contain movable data. One could do
> the same with "pageblocks may not contain allocations < order X" etc. So one
> could similarly optimize the memmap to some degree for these pageblocks.
> 
> IMHO we should first try making THP <= pageblock allocations more reliable, not
> using new zones, and I'm happy that Johannes et al. are doing work in that
> direction. But it's a longer discussion to be had at LSF/MM.
> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-03-27 12:17 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-26  3:01 [RFC PATCH] mm: show mthp_fault_alloc and mthp_fault_fallback of multi-size THPs Barry Song
2024-03-26  3:24 ` Matthew Wilcox
2024-03-26  3:40   ` Barry Song
2024-03-26 22:19     ` Barry Song
2024-03-27 11:35       ` David Hildenbrand
2024-03-27 12:17         ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox