[LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64
@ 2024-03-28 16:47 Yang Shi
  2024-04-01 18:16 ` Jonathan Cameron
  0 siblings, 1 reply; 16+ messages in thread
From: Yang Shi @ 2024-03-28 16:47 UTC (permalink / raw)
  To: lsf-pc, olivier.singla, Christoph Lameter (Ampere)
  Cc: Linux MM, Michal Hocko, Dan Williams

We just made some progress regarding multi-sized THP benchmarking on
the ARM64 platform. So I'd like to propose the topic "Multi-sized THP
performance benchmark on ARM64" for MM track.

We ran a series of benchmarks on Ampere Altra platform using some
popular workloads in the cloud: In-memory databases, kernel
compilation, etc, using different sized huge pages: 2M, 128K, 64K and
others.

This topic will cover:
  - The benchmark data of some popular workloads in Cloud
  - The performance analysis (where the gain came from, the
contributing factors, etc)
  - Th recommended page sizes which can achieve overall decent performance gain

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64
  2024-03-28 16:47 [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 Yang Shi
@ 2024-04-01 18:16 ` Jonathan Cameron
  2024-04-02 20:04   ` Yang Shi
  2024-04-04 18:57   ` Christoph Lameter (Ampere)
  0 siblings, 2 replies; 16+ messages in thread
From: Jonathan Cameron @ 2024-04-01 18:16 UTC (permalink / raw)
  To: Yang Shi
  Cc: lsf-pc, olivier.singla, Christoph Lameter (Ampere),
	Linux MM, Michal Hocko, Dan Williams

On Thu, 28 Mar 2024 09:47:04 -0700
Yang Shi <shy828301@gmail.com> wrote:

> We just made some progress regarding multi-sized THP benchmarking on
> the ARM64 platform. So I'd like to propose the topic "Multi-sized THP
> performance benchmark on ARM64" for MM track.
> 
> We ran a series of benchmarks on Ampere Altra platform using some
> popular workloads in the cloud: In-memory databases, kernel
> compilation, etc, using different sized huge pages: 2M, 128K, 64K and
> others.
> 
> This topic will cover:
>   - The benchmark data of some popular workloads in Cloud
>   - The performance analysis (where the gain came from, the
> contributing factors, etc)
>   - Th recommended page sizes which can achieve overall decent performance gain
> 

Sounds like useful data, but is it a suitable topic for LSF-MM?
What open questions etc is it raising?

I'm very interested in seeing your results, but maybe this isn't the best path
for them.

Jonathan


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64
  2024-04-01 18:16 ` Jonathan Cameron
@ 2024-04-02 20:04   ` Yang Shi
  2024-04-04 18:57   ` Christoph Lameter (Ampere)
  1 sibling, 0 replies; 16+ messages in thread
From: Yang Shi @ 2024-04-02 20:04 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: lsf-pc, olivier.singla, Christoph Lameter (Ampere),
	Linux MM, Michal Hocko, Dan Williams

On Mon, Apr 1, 2024 at 11:16 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Thu, 28 Mar 2024 09:47:04 -0700
> Yang Shi <shy828301@gmail.com> wrote:
>
> > We just made some progress regarding multi-sized THP benchmarking on
> > the ARM64 platform. So I'd like to propose the topic "Multi-sized THP
> > performance benchmark on ARM64" for MM track.
> >
> > We ran a series of benchmarks on Ampere Altra platform using some
> > popular workloads in the cloud: In-memory databases, kernel
> > compilation, etc, using different sized huge pages: 2M, 128K, 64K and
> > others.
> >
> > This topic will cover:
> >   - The benchmark data of some popular workloads in Cloud
> >   - The performance analysis (where the gain came from, the
> > contributing factors, etc)
> >   - Th recommended page sizes which can achieve overall decent performance gain
> >
>
> Sounds like useful data, but is it a suitable topic for LSF-MM?
> What open questions etc is it raising?
>
> I'm very interested in seeing your results, but maybe this isn't the best path
> for them.

Thanks for showing interest. Hopefully the benchmark data also can
help us for the direction of further optimizations, and I hope this is
also a part of the session.

>
> Jonathan


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64
  2024-04-01 18:16 ` Jonathan Cameron
  2024-04-02 20:04   ` Yang Shi
@ 2024-04-04 18:57   ` Christoph Lameter (Ampere)
  2024-04-04 19:33     ` David Hildenbrand
  2024-04-08 16:30     ` Matthew Wilcox
  1 sibling, 2 replies; 16+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-04-04 18:57 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Yang Shi, lsf-pc, olivier.singla, Linux MM, Michal Hocko, Dan Williams

On Mon, 1 Apr 2024, Jonathan Cameron wrote:

> Sounds like useful data, but is it a suitable topic for LSF-MM?
> What open questions etc is it raising?

mTHP is new functionality that will require additional work to support 
more use cases. It is also unclear at this point in what usecases mTHP is 
useful and where no benefit can so far be seen. Also the effect of 
coalescing multiple PTE entries into one TLB entry is new to MM 
(CONT_PTE).

Ultimately it would be useful to have mTHP support also provide larger 
blocksize capabilities for filesystem etc etc. mTHP needs to mature and an 
analysis of the arguable a bit experimental state of affairs can help a 
lot in getting there.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64
  2024-04-04 18:57   ` Christoph Lameter (Ampere)
@ 2024-04-04 19:33     ` David Hildenbrand
  2024-04-09 18:41       ` Yang Shi
  2024-04-30 14:41       ` Michal Hocko
  2024-04-08 16:30     ` Matthew Wilcox
  1 sibling, 2 replies; 16+ messages in thread
From: David Hildenbrand @ 2024-04-04 19:33 UTC (permalink / raw)
  To: Christoph Lameter (Ampere), Jonathan Cameron
  Cc: Yang Shi, lsf-pc, olivier.singla, Linux MM, Michal Hocko, Dan Williams

On 04.04.24 20:57, Christoph Lameter (Ampere) wrote:
> On Mon, 1 Apr 2024, Jonathan Cameron wrote:
> 
>> Sounds like useful data, but is it a suitable topic for LSF-MM?
>> What open questions etc is it raising?
> 
> 
> mTHP is new functionality that will require additional work to support
> more use cases. It is also unclear at this point in what usecases mTHP is
> useful and where no benefit can so far be seen. Also the effect of
> coalescing multiple PTE entries into one TLB entry is new to MM
> (CONT_PTE).
> 
> Ultimately it would be useful to have mTHP support also provide larger
> blocksize capabilities for filesystem etc etc. mTHP needs to mature and an
> analysis of the arguable a bit experimental state of affairs can help a
> lot in getting there.

Right, something like that (open items, missed use cases, requirements, 
ideas, etc,.) would be a better (good!) fit.

Pure benchmark results, analysis and recommendations are great. But 
likely a better fit for a (white) paper, blog post, 
less-discussion-focused conference.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64
  2024-04-04 19:33     ` David Hildenbrand
@ 2024-04-09 18:41       ` Yang Shi
  2024-04-09 18:44         ` David Hildenbrand
  2024-04-30 14:41       ` Michal Hocko
  1 sibling, 1 reply; 16+ messages in thread
From: Yang Shi @ 2024-04-09 18:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christoph Lameter (Ampere),
	Jonathan Cameron, lsf-pc, olivier.singla, Linux MM, Michal Hocko,
	Dan Williams

On Thu, Apr 4, 2024 at 12:34 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.04.24 20:57, Christoph Lameter (Ampere) wrote:
> > On Mon, 1 Apr 2024, Jonathan Cameron wrote:
> >
> >> Sounds like useful data, but is it a suitable topic for LSF-MM?
> >> What open questions etc is it raising?
> >
> >
> > mTHP is new functionality that will require additional work to support
> > more use cases. It is also unclear at this point in what usecases mTHP is
> > useful and where no benefit can so far be seen. Also the effect of
> > coalescing multiple PTE entries into one TLB entry is new to MM
> > (CONT_PTE).
> >
> > Ultimately it would be useful to have mTHP support also provide larger
> > blocksize capabilities for filesystem etc etc. mTHP needs to mature and an
> > analysis of the arguable a bit experimental state of affairs can help a
> > lot in getting there.
>
> Right, something like that (open items, missed use cases, requirements,
> ideas, etc,.) would be a better (good!) fit.
>
> Pure benchmark results, analysis and recommendations are great. But
> likely a better fit for a (white) paper, blog post,
> less-discussion-focused conference.

Thanks for the suggestion. I didn't plan to enumerate any open items
because I think those items (for example, khugepaged support, swap,
etc) were already well-known by mm community and we have made some
progress on some items.

The potential future optimization choices led by the benchmark and
analysis may be worth discussing. For example, shall the allocation
fallback should try every single order, is it a good idea to let users
decide the orders, etc. We didn't know what the good choice should be
before we had some benchmark data.

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64
  2024-04-09 18:41       ` Yang Shi
@ 2024-04-09 18:44         ` David Hildenbrand
  0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2024-04-09 18:44 UTC (permalink / raw)
  To: Yang Shi
  Cc: Christoph Lameter (Ampere),
	Jonathan Cameron, lsf-pc, olivier.singla, Linux MM, Michal Hocko,
	Dan Williams

On 09.04.24 20:41, Yang Shi wrote:
> On Thu, Apr 4, 2024 at 12:34 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.04.24 20:57, Christoph Lameter (Ampere) wrote:
>>> On Mon, 1 Apr 2024, Jonathan Cameron wrote:
>>>
>>>> Sounds like useful data, but is it a suitable topic for LSF-MM?
>>>> What open questions etc is it raising?
>>>
>>>
>>> mTHP is new functionality that will require additional work to support
>>> more use cases. It is also unclear at this point in what usecases mTHP is
>>> useful and where no benefit can so far be seen. Also the effect of
>>> coalescing multiple PTE entries into one TLB entry is new to MM
>>> (CONT_PTE).
>>>
>>> Ultimately it would be useful to have mTHP support also provide larger
>>> blocksize capabilities for filesystem etc etc. mTHP needs to mature and an
>>> analysis of the arguable a bit experimental state of affairs can help a
>>> lot in getting there.
>>
>> Right, something like that (open items, missed use cases, requirements,
>> ideas, etc,.) would be a better (good!) fit.
>>
>> Pure benchmark results, analysis and recommendations are great. But
>> likely a better fit for a (white) paper, blog post,
>> less-discussion-focused conference.
> 
> Thanks for the suggestion. I didn't plan to enumerate any open items
> because I think those items (for example, khugepaged support, swap,
> etc) were already well-known by mm community and we have made some
> progress on some items.

I think there are two types of open items: "we obviously know what we 
have to do -- basic swap, khugepaged, etc. support" and "we don't really 
know what to do because it's rather an optimization problem and there 
might not be a right or wrong".

> 
> The potential future optimization choices led by the benchmark and
> analysis may be worth discussing. For example, shall the allocation
> fallback should try every single order, is it a good idea to let users
> decide the orders, etc. We didn't know what the good choice should be
> before we had some benchmark data.

Focusing on such open questions makes a lot of sense. Then, you can use 
the benchmark data to guide the discussion and share your insights :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64
  2024-04-04 19:33     ` David Hildenbrand
  2024-04-09 18:41       ` Yang Shi
@ 2024-04-30 14:41       ` Michal Hocko
  2024-05-01 16:37         ` Yang Shi
  1 sibling, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2024-04-30 14:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christoph Lameter (Ampere),
	Jonathan Cameron, Yang Shi, lsf-pc, olivier.singla, Linux MM,
	Dan Williams

On Thu 04-04-24 21:33:57, David Hildenbrand wrote:
> On 04.04.24 20:57, Christoph Lameter (Ampere) wrote:
> > On Mon, 1 Apr 2024, Jonathan Cameron wrote:
> > 
> > > Sounds like useful data, but is it a suitable topic for LSF-MM?
> > > What open questions etc is it raising?
> > 
> > 
> > mTHP is new functionality that will require additional work to support
> > more use cases. It is also unclear at this point in what usecases mTHP is
> > useful and where no benefit can so far be seen. Also the effect of
> > coalescing multiple PTE entries into one TLB entry is new to MM
> > (CONT_PTE).
> > 
> > Ultimately it would be useful to have mTHP support also provide larger
> > blocksize capabilities for filesystem etc etc. mTHP needs to mature and an
> > analysis of the arguable a bit experimental state of affairs can help a
> > lot in getting there.
> 
> Right, something like that (open items, missed use cases, requirements,
> ideas, etc,.) would be a better (good!) fit.
> 
> Pure benchmark results, analysis and recommendations are great. But likely a
> better fit for a (white) paper, blog post, less-discussion-focused
> conference.

Completely agreed! It would be really great if Yang Shi could open the
topic with high level data and then we spent majority of the slot on the
actual discussion.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64
  2024-04-30 14:41       ` Michal Hocko
@ 2024-05-01 16:37         ` Yang Shi
  0 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2024-05-01 16:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Christoph Lameter (Ampere),
	Jonathan Cameron, lsf-pc, olivier.singla, Linux MM, Dan Williams

On Tue, Apr 30, 2024 at 7:41 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 04-04-24 21:33:57, David Hildenbrand wrote:
> > On 04.04.24 20:57, Christoph Lameter (Ampere) wrote:
> > > On Mon, 1 Apr 2024, Jonathan Cameron wrote:
> > >
> > > > Sounds like useful data, but is it a suitable topic for LSF-MM?
> > > > What open questions etc is it raising?
> > >
> > >
> > > mTHP is new functionality that will require additional work to support
> > > more use cases. It is also unclear at this point in what usecases mTHP is
> > > useful and where no benefit can so far be seen. Also the effect of
> > > coalescing multiple PTE entries into one TLB entry is new to MM
> > > (CONT_PTE).
> > >
> > > Ultimately it would be useful to have mTHP support also provide larger
> > > blocksize capabilities for filesystem etc etc. mTHP needs to mature and an
> > > analysis of the arguable a bit experimental state of affairs can help a
> > > lot in getting there.
> >
> > Right, something like that (open items, missed use cases, requirements,
> > ideas, etc,.) would be a better (good!) fit.
> >
> > Pure benchmark results, analysis and recommendations are great. But likely a
> > better fit for a (white) paper, blog post, less-discussion-focused
> > conference.
>
> Completely agreed! It would be really great if Yang Shi could open the
> topic with high level data and then we spent majority of the slot on the
> actual discussion.

I will try my best to minimize the time spent by explaining benchmark
data. It may take 10 - 15 minutes in ballpark estimation. Then the
remaining time can be spent in actual discussion.

>
> Thanks!
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64
  2024-04-04 18:57   ` Christoph Lameter (Ampere)
  2024-04-04 19:33     ` David Hildenbrand
@ 2024-04-08 16:30     ` Matthew Wilcox
  2024-04-08 18:56       ` Zi Yan
  1 sibling, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2024-04-08 16:30 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Jonathan Cameron, Yang Shi, lsf-pc, olivier.singla, Linux MM,
	Michal Hocko, Dan Williams

On Thu, Apr 04, 2024 at 11:57:03AM -0700, Christoph Lameter (Ampere) wrote:
> On Mon, 1 Apr 2024, Jonathan Cameron wrote:
> 
> > Sounds like useful data, but is it a suitable topic for LSF-MM?
> > What open questions etc is it raising?
> 
> 
> mTHP is new functionality that will require additional work to support more
> use cases. It is also unclear at this point in what usecases mTHP is useful
> and where no benefit can so far be seen. Also the effect of coalescing
> multiple PTE entries into one TLB entry is new to MM (CONT_PTE).
> 
> Ultimately it would be useful to have mTHP support also provide larger
> blocksize capabilities for filesystem etc etc. mTHP needs to mature and an
> analysis of the arguable a bit experimental state of affairs can help a lot
> in getting there.

Have you been paying attention to anything that's been happening in Linux
development in the last three years?  7b230db3b8d3 introduced folios
in December 2020 (was merged in November 2021 for v5.16).  v5.17 (March
2022) did everything short of enabling large folios for the page cache,
which landed in v5.18 (May 2022).  We started using cont-PTEs for large
folios in August 2023.  Again, the page cache led the way here and we're
just adding support for anonymous large folios (called mTHP) now.

There's still a ton of work to do, but we've been busy doing it since
LSFMM in Puerto Rico (2019) with READ_ONLY_THP_FOR_FS being the very
first result from the group of interested developers.

And if you haven't seen the results that Ryan Roberts has posted for
the tests he's run, I suggest you look them up.  He does a great job
of breaking down how much benefit he sees from the hardware side (use of
contPTE) vs the software side (shorter LRU lists, fewer atomic ops).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64
  2024-04-08 16:30     ` Matthew Wilcox
@ 2024-04-08 18:56       ` Zi Yan
  2024-04-09 10:47         ` Ryan Roberts
  0 siblings, 1 reply; 16+ messages in thread
From: Zi Yan @ 2024-04-08 18:56 UTC (permalink / raw)
  To: Matthew Wilcox, Christoph Lameter
  Cc: Jonathan Cameron, Yang Shi, lsf-pc, olivier.singla, Linux MM,
	Michal Hocko, Dan Williams, Ryan Roberts

[-- Attachment #1: Type: text/plain, Size: 3519 bytes --]

On 8 Apr 2024, at 12:30, Matthew Wilcox wrote:

> On Thu, Apr 04, 2024 at 11:57:03AM -0700, Christoph Lameter (Ampere) wrote:
>> On Mon, 1 Apr 2024, Jonathan Cameron wrote:
>>
>>> Sounds like useful data, but is it a suitable topic for LSF-MM?
>>> What open questions etc is it raising?
>>
>>
>> mTHP is new functionality that will require additional work to support more
>> use cases. It is also unclear at this point in what usecases mTHP is useful
>> and where no benefit can so far be seen. Also the effect of coalescing
>> multiple PTE entries into one TLB entry is new to MM (CONT_PTE).

I think we need a clarification of CONT_PTE from Christoph.

From the context of ARM CPUs, CONT_PTE might be a group of PTEs with contiguous
bit set. It was used by hugetlb and kernel linear mapping before Ryan added
CONT_PTE support for mTHPs. This requires software support (setting contiguous bits)
to be able to coalesce PTEs. But ARM also has this Hardware Page Aggregation (HPA)
feature[1], which can coalesce PTEs without software intervention. I am not
sure which ARM CPUs actually implement it.

From the context of all CPUs, AMD has "PTE coalescing/clustering"[2] feature
from Zen1. It is similar to ARM's HPA, not requiring software changes to
coalesce PTEs. RISC-V also has Svnapot (Naturally-Aligned Power-of-Two
Address-Translation Contiguity) [3], which requires software help.

So with Matthew's folio patches back in 2020, hardware-only CONT_PTE
would work since then. But software-assist CONT_PTE just began to work
on ARM CPUs with Ryan's cont-pte patchset for anonymous memory and page cache.

>>
>> Ultimately it would be useful to have mTHP support also provide larger
>> blocksize capabilities for filesystem etc etc. mTHP needs to mature and an
>> analysis of the arguable a bit experimental state of affairs can help a lot
>> in getting there.
>
> Have you been paying attention to anything that's been happening in Linux
> development in the last three years?  7b230db3b8d3 introduced folios
> in December 2020 (was merged in November 2021 for v5.16).  v5.17 (March
> 2022) did everything short of enabling large folios for the page cache,
> which landed in v5.18 (May 2022).  We started using cont-PTEs for large
> folios in August 2023.  Again, the page cache led the way here and we're
> just adding support for anonymous large folios (called mTHP) now.

Matthew, your cont-PTE here is "New page table range API" right? There is
no ARM contiguous bit manipulation, right?

>
> There's still a ton of work to do, but we've been busy doing it since
> LSFMM in Puerto Rico (2019) with READ_ONLY_THP_FOR_FS being the very
> first result from the group of interested developers.
>
> And if you haven't seen the results that Ryan Roberts has posted for
> the tests he's run, I suggest you look them up.  He does a great job
> of breaking down how much benefit he sees from the hardware side (use of
> contPTE) vs the software side (shorter LRU lists, fewer atomic ops).

It is definitely helpful to distinguish hardware and software benefits,
since not all CPUs can coalesce PTEs.


[1] https://developer.arm.com/documentation/100616/0301/register-descriptions/aarch64-system-registers/cpuectlr-el1--cpu-extended-control-register--el1
[2] https://www.eliot.so/memsys23.pdf
[3] https://github.com/riscv/virtual-memory?tab=readme-ov-file#svnapot-naturally-aligned-power-of-two-address-translation-contiguity

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64
  2024-04-08 18:56       ` Zi Yan
@ 2024-04-09 10:47         ` Ryan Roberts
  2024-06-25 11:12           ` Ryan Roberts
  0 siblings, 1 reply; 16+ messages in thread
From: Ryan Roberts @ 2024-04-09 10:47 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox, Christoph Lameter
  Cc: Jonathan Cameron, Yang Shi, lsf-pc, olivier.singla, Linux MM,
	Michal Hocko, Dan Williams

[-- Attachment #1: Type: text/plain, Size: 5006 bytes --]

Thanks for the CC, Zi! I must admit I'm not great at following the list...


On 08/04/2024 19:56, Zi Yan wrote:
> On 8 Apr 2024, at 12:30, Matthew Wilcox wrote:
> 
>> On Thu, Apr 04, 2024 at 11:57:03AM -0700, Christoph Lameter (Ampere) wrote:
>>> On Mon, 1 Apr 2024, Jonathan Cameron wrote:
>>>
>>>> Sounds like useful data, but is it a suitable topic for LSF-MM?
>>>> What open questions etc is it raising?

I'm happy to see others looking at mTHP, and would be very keen to be involved
in any discussion. Unfortunately I won't be able to make it to LSFMM this year -
my wife is expecting a baby the same week. I'll register for online, but even
joining that is looking unlikely.

It would be great to be cc'ed on any future results you make public though. And
I'd be very happy to work more closely together to debug problems or extend
things further - feel free to reach out!

I have a roadmap of items that I believe are needed to get this to perform
optimally (see first 2 columns of attached slide); only some of this is in
mainline so would be good to understand exactly what code you were doing your
testing with?

>>>
>>>
>>> mTHP is new functionality that will require additional work to support more
>>> use cases. It is also unclear at this point in what usecases mTHP is useful
>>> and where no benefit can so far be seen. Also the effect of coalescing
>>> multiple PTE entries into one TLB entry is new to MM (CONT_PTE).
> 
> I think we need a clarification of CONT_PTE from Christoph.
> 
> From the context of ARM CPUs, CONT_PTE might be a group of PTEs with contiguous
> bit set. It was used by hugetlb and kernel linear mapping before Ryan added
> CONT_PTE support for mTHPs. 

Yes indeed. Note the macro "PTE_CONT" is private to the arm64 arch and is never
used directly by the core-mm. It's been around for a while and used for hugetlb
and kernel memory. So the only new use is for regular user memory (anon and page
cache). So I don't think there are any risks from HW conformance PoV, if that
was the concern.

> This requires software support (setting contiguous bits)
> to be able to coalesce PTEs. But ARM also has this Hardware Page Aggregation (HPA)
> feature[1], which can coalesce PTEs without software intervention. I am not
> sure which ARM CPUs actually implement it.

All of the latest Arm-implemented cores support HPA. However sometimes it needs
to be explicitly enabled by EL3 FW. The N1 used in Ampere Altra has it, but it
is not propoerly enabled and due to errata, it is not possible to fully enable
it even with access to EL3.

Thanks,
Ryan

> 
> From the context of all CPUs, AMD has "PTE coalescing/clustering"[2] feature
> from Zen1. It is similar to ARM's HPA, not requiring software changes to
> coalesce PTEs. RISC-V also has Svnapot (Naturally-Aligned Power-of-Two
> Address-Translation Contiguity) [3], which requires software help.
> 
> So with Matthew's folio patches back in 2020, hardware-only CONT_PTE
> would work since then. But software-assist CONT_PTE just began to work
> on ARM CPUs with Ryan's cont-pte patchset for anonymous memory and page cache.
> 
>>>
>>> Ultimately it would be useful to have mTHP support also provide larger
>>> blocksize capabilities for filesystem etc etc. mTHP needs to mature and an
>>> analysis of the arguable a bit experimental state of affairs can help a lot
>>> in getting there.
>>
>> Have you been paying attention to anything that's been happening in Linux
>> development in the last three years?  7b230db3b8d3 introduced folios
>> in December 2020 (was merged in November 2021 for v5.16).  v5.17 (March
>> 2022) did everything short of enabling large folios for the page cache,
>> which landed in v5.18 (May 2022).  We started using cont-PTEs for large
>> folios in August 2023.  Again, the page cache led the way here and we're
>> just adding support for anonymous large folios (called mTHP) now.
> 
> Matthew, your cont-PTE here is "New page table range API" right? There is
> no ARM contiguous bit manipulation, right?
> 
>>
>> There's still a ton of work to do, but we've been busy doing it since
>> LSFMM in Puerto Rico (2019) with READ_ONLY_THP_FOR_FS being the very
>> first result from the group of interested developers.
>>
>> And if you haven't seen the results that Ryan Roberts has posted for
>> the tests he's run, I suggest you look them up.  He does a great job
>> of breaking down how much benefit he sees from the hardware side (use of
>> contPTE) vs the software side (shorter LRU lists, fewer atomic ops).
> 
> It is definitely helpful to distinguish hardware and software benefits,
> since not all CPUs can coalesce PTEs.
> 
> 
> [1] https://developer.arm.com/documentation/100616/0301/register-descriptions/aarch64-system-registers/cpuectlr-el1--cpu-extended-control-register--el1
> [2] https://www.eliot.so/memsys23.pdf
> [3] https://github.com/riscv/virtual-memory?tab=readme-ov-file#svnapot-naturally-aligned-power-of-two-address-translation-contiguity
> 
> --
> Best Regards,
> Yan, Zi

[-- Attachment #2: folios roadmap.pdf --]
[-- Type: application/pdf, Size: 75861 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64
  2024-04-09 10:47         ` Ryan Roberts
@ 2024-06-25 11:12           ` Ryan Roberts
  2024-06-25 18:11             ` Christoph Lameter (Ampere)
  2024-06-27 20:54             ` Yang Shi
  0 siblings, 2 replies; 16+ messages in thread
From: Ryan Roberts @ 2024-06-25 11:12 UTC (permalink / raw)
  To: Yang Shi
  Cc: Jonathan Cameron, lsf-pc, olivier.singla, Linux MM, Michal Hocko,
	Dan Williams, Christoph Lameter, Matthew Wilcox, Zi Yan

On 09/04/2024 11:47, Ryan Roberts wrote:
> Thanks for the CC, Zi! I must admit I'm not great at following the list...
> 
> 
> On 08/04/2024 19:56, Zi Yan wrote:
>> On 8 Apr 2024, at 12:30, Matthew Wilcox wrote:
>>
>>> On Thu, Apr 04, 2024 at 11:57:03AM -0700, Christoph Lameter (Ampere) wrote:
>>>> On Mon, 1 Apr 2024, Jonathan Cameron wrote:
>>>>
>>>>> Sounds like useful data, but is it a suitable topic for LSF-MM?
>>>>> What open questions etc is it raising?
> 
> I'm happy to see others looking at mTHP, and would be very keen to be involved
> in any discussion. Unfortunately I won't be able to make it to LSFMM this year -
> my wife is expecting a baby the same week. I'll register for online, but even
> joining that is looking unlikely.

[...]

Hi Yang Shi,

I finally got around to watching the video of your presentation; Thanks for
doing the work to benchmark this on your system.

I just wanted to raise a couple of points, first on your results and secondly on
your conclusions...

Results
=======

As I'm sure you have seen, I've done some benchmarking with mTHP and contpte,
also on an Ampere Altra system. Although my system has 2 NUMA nodes (80 CPUs per
node), I've deliberately disabled one of the nodes to avoid noise from cross
socket IO. So the HW should look and behave approximately the same as yours.

We have one overlapping benchmark - kernel compilation - and our results are not
a million miles apart. You can see my results for 4KPS at [1] (and you can take
16KPS and 64KPS results for reference from [2]).

page size   | Ryan   | Yang Shi
------------|--------|---------
16K (4KPS)  |  -6.1% |  -5%
16K (16KPS) |  -9.2% | -15%
64K (64KPS) | -11.4% | -16%

For 4KPS, my "mTHP + contpte" line is equivalent to what you have tested. I'm
seeing -6% vs your -5%. But the 16KPS and 64KPS results diverge more. I'm not
sure why these results diverge so much, perhaps you have an idea? From my side,
I've run these benchmarks many many times with successive kernels and revised
patches etc, and the numbers are always similar for me. I repeat multiple times
across multiple reboots and also disable kaslr and (user) aslr to avoid any
unwanted noise/skew.

The actual test is essentially:

$ make defconfig && time make –s –j80 Image

I'd also be interested in how you are measuring memory. I've measured both peak
and mean memory (by putting the workload in a cgroup) and see almost double the
memory increase that you report for 16KPS. Our measurements for other configs match.

But I also want to raise a more general point; We are not done with the
optimizations yet. contpte can also improve performance for iTLB, but this
requires a change to the page cache to store text in (at least) 64K folios.
Typically the iTLB is under a lot of pressure and this can help reduce it. This
change is not in mainline yet (and I still need to figure out how to make the
patch acceptable), but is worth another ~1.5% for the 4KPS case. I suspect this
will also move the needle on the other benchmarks you ran. See [3] - I'd
appreciate any thoughts you have on how to get something like this accepted.

[1] https://lore.kernel.org/all/20240215103205.2607016-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/lkml/20240111154106.3692206-1-ryan.roberts@arm.com/

Conclusions
===========

I think people in the room already said most of what I want to say;
Unfortunately there is a trade-off between performance and memory consumption.
And it is not always practical to dole out the biggest THP we can allocate; lots
of partially used 2M chunks would lead to a lot of wasted memory. So we need a
way to let user space configure the kernel for their desired mTHP sizes.

In the long term, it would be great to support an "auto" mode, and the current
interfaces leave the door open to that. Perhaps your suggestion to start out
with 64K and collapse to higher orders is one tool that could take us in that
direction. But 64K is arm64-specific. AMD wants 32K. So you still need some
mechanism to determine that (and the community wasn't keen on having the arch
tell us that).

It may actually turn out that we need a more complex interface to allow a (set
of) mTHP order(s) to be enabled for a specific VMA. We previously concluded that
if/when the time comes, then madvise_process() should give us what we need. That
would allow better integration with user space.

Your suggestion about splitting higher orders to 64K at swap out is interesting;
that might help with some swap fragmentation issues we are currently grappling
with. But ultimately spitting a folio is expensive and we want to avoid that
cost as much as possible. I'd prefer to continue down the route that Chris Li is
taking us so that we can do a better job of allocating swap in the first place.

Thanks,
Ryan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64
  2024-06-25 11:12           ` Ryan Roberts
@ 2024-06-25 18:11             ` Christoph Lameter (Ampere)
  2024-06-26 10:47               ` Ryan Roberts
  2024-06-27 20:54             ` Yang Shi
  1 sibling, 1 reply; 16+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-06-25 18:11 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Yang Shi, Jonathan Cameron, lsf-pc, olivier.singla, Linux MM,
	Michal Hocko, Dan Williams, Matthew Wilcox, Zi Yan

On Tue, 25 Jun 2024, Ryan Roberts wrote:

> But I also want to raise a more general point; We are not done with the
> optimizations yet. contpte can also improve performance for iTLB, but this
> requires a change to the page cache to store text in (at least) 64K folios.
> Typically the iTLB is under a lot of pressure and this can help reduce it. This
> change is not in mainline yet (and I still need to figure out how to make the
> patch acceptable), but is worth another ~1.5% for the 4KPS case. I suspect this
> will also move the needle on the other benchmarks you ran. See [3] - I'd
> appreciate any thoughts you have on how to get something like this accepted.
>
> [3] https://lore.kernel.org/lkml/20240111154106.3692206-1-ryan.roberts@arm.com/

The discussion here seems to indicate that readahead is already ok for 
order-2 (16K mTHP size?). So this is only for 64K mTHP on 4K?

From what I read in the ARM64 manuals it seems that CONT_PTE can only be 
used for 64K mTHP on 4K kernels. The 16K case will not benefit from 
CONT_PTE nor any other intermediate size than 64K.

Quoting:

https://developer.arm.com/documentation/ddi0406/c/System-Level-Architecture/Virtual-Memory-System-Architecture--VMSA-/Memory-region-attributes/Long-descriptor-format-memory-region-attributes?lang=en#BEIIBEIJ

"Contiguous hint

The Long-descriptor translation table format descriptors contain a 
Contiguous hint bit. Setting this bit to 1 indicates that 16 adjacent 
translation table entries point to a contiguous output address range. 
These 16 entries must be aligned in the translation table so that the top 
5 bits of their input addresses, that index their position in the 
translation table, are the same. For example, referring to Figure 12.21, 
to use this hint for a block of 16 entries in the third-level translation 
table, bits[20:16] of the input addresses for the 16 entries must be the 
same.

The contiguous output address range must be aligned to size of 16 
translation table entries at the same translation table level.

Use of this hint means that the TLB can cache a single entry to cover the 
16 translation table entries.

This bit is only a hint bit. The architecture does not require a processor 
to cache TLB entries in this way. To avoid TLB coherency issues, any TLB 
maintenance by address must not assume any optimization of the TLB tables 
that might result from use of the hint bit.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64
  2024-06-25 18:11             ` Christoph Lameter (Ampere)
@ 2024-06-26 10:47               ` Ryan Roberts
  0 siblings, 0 replies; 16+ messages in thread
From: Ryan Roberts @ 2024-06-26 10:47 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Yang Shi, Jonathan Cameron, lsf-pc, olivier.singla, Linux MM,
	Michal Hocko, Dan Williams, Matthew Wilcox, Zi Yan

On 25/06/2024 19:11, Christoph Lameter (Ampere) wrote:
> On Tue, 25 Jun 2024, Ryan Roberts wrote:
> 
>> But I also want to raise a more general point; We are not done with the
>> optimizations yet. contpte can also improve performance for iTLB, but this
>> requires a change to the page cache to store text in (at least) 64K folios.
>> Typically the iTLB is under a lot of pressure and this can help reduce it. This
>> change is not in mainline yet (and I still need to figure out how to make the
>> patch acceptable), but is worth another ~1.5% for the 4KPS case. I suspect this
>> will also move the needle on the other benchmarks you ran. See [3] - I'd
>> appreciate any thoughts you have on how to get something like this accepted.
>>
>> [3] https://lore.kernel.org/lkml/20240111154106.3692206-1-ryan.roberts@arm.com/
> 
> The discussion here seems to indicate that readahead is already ok for order-2
> (16K mTHP size?). So this is only for 64K mTHP on 4K?

Kind of; for fiflesystems that report support for large folios, readahead starts
with order-2 folio, then increments the folio order by 2 orders for every
subsequent readahead marker that is hit. But text is rarely accessed
sequentially so readahead markers are rarely hit in practice and therefore all
the text folios tend to end up as order-2 (16K for 4K base pages).

But the important bit is that the filesystem needs to support large folios in
the first place, without that, we are always stuck using small (order-0) folios.
XFS and a few other (network) filesystems support large folios today, but ext4
doesn't - that's being worked on though.

> 
> From what I read in the ARM64 manuals it seems that CONT_PTE can only be used
> for 64K mTHP on 4K kernels. The 16K case will not benefit from CONT_PTE nor any
> other intermediate size than 64K.

Yes and no. The contiguous hint, when applied, constitutes a single fixed size
and that size depends on the base page size. Its 64K for 4KPS, 2M for 16KPS and
2M for 64KPS.

However, most modern Arm-designed CPUs support a micro-architectural feature
called Hardware Page Aggregation (HPA), which can aggregate up to 4 pages into a
single TLB in a way that is transparent to SW. So that feature can benefit from
16K folios when using 4K base pages. Although HPA is implemented in the Neoverse
N1 CPU (which is what I believe is in the Ampere Altra), it is disabled and due
to an errata can't be enabled. So HPA is not relevant for Altra.

> 
> Quoting:
> 
> https://developer.arm.com/documentation/ddi0406/c/System-Level-Architecture/Virtual-Memory-System-Architecture--VMSA-/Memory-region-attributes/Long-descriptor-format-memory-region-attributes?lang=en#BEIIBEIJ

Note this link is for armv7A, not v8. But hopefully my explanation about answers
everything.

Thanks,
Ryan

> 
> "Contiguous hint
> 
> The Long-descriptor translation table format descriptors contain a Contiguous
> hint bit. Setting this bit to 1 indicates that 16 adjacent translation table
> entries point to a contiguous output address range. These 16 entries must be
> aligned in the translation table so that the top 5 bits of their input
> addresses, that index their position in the translation table, are the same. For
> example, referring to Figure 12.21, to use this hint for a block of 16 entries
> in the third-level translation table, bits[20:16] of the input addresses for the
> 16 entries must be the same.
> 
> The contiguous output address range must be aligned to size of 16 translation
> table entries at the same translation table level.
> 
> Use of this hint means that the TLB can cache a single entry to cover the 16
> translation table entries.
> 
> This bit is only a hint bit. The architecture does not require a processor to
> cache TLB entries in this way. To avoid TLB coherency issues, any TLB
> maintenance by address must not assume any optimization of the TLB tables that
> might result from use of the hint bit.
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64
  2024-06-25 11:12           ` Ryan Roberts
  2024-06-25 18:11             ` Christoph Lameter (Ampere)
@ 2024-06-27 20:54             ` Yang Shi
  1 sibling, 0 replies; 16+ messages in thread
From: Yang Shi @ 2024-06-27 20:54 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Jonathan Cameron, lsf-pc, olivier.singla, Linux MM, Michal Hocko,
	Dan Williams, Christoph Lameter, Matthew Wilcox, Zi Yan

On Tue, Jun 25, 2024 at 4:12 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 09/04/2024 11:47, Ryan Roberts wrote:
> > Thanks for the CC, Zi! I must admit I'm not great at following the list...
> >
> >
> > On 08/04/2024 19:56, Zi Yan wrote:
> >> On 8 Apr 2024, at 12:30, Matthew Wilcox wrote:
> >>
> >>> On Thu, Apr 04, 2024 at 11:57:03AM -0700, Christoph Lameter (Ampere) wrote:
> >>>> On Mon, 1 Apr 2024, Jonathan Cameron wrote:
> >>>>
> >>>>> Sounds like useful data, but is it a suitable topic for LSF-MM?
> >>>>> What open questions etc is it raising?
> >
> > I'm happy to see others looking at mTHP, and would be very keen to be involved
> > in any discussion. Unfortunately I won't be able to make it to LSFMM this year -
> > my wife is expecting a baby the same week. I'll register for online, but even
> > joining that is looking unlikely.
>
> [...]
>
> Hi Yang Shi,
>
> I finally got around to watching the video of your presentation; Thanks for
> doing the work to benchmark this on your system.
>
> I just wanted to raise a couple of points, first on your results and secondly on
> your conclusions...

Thanks for following up. Sorry for the late reply, I just came back
from a 2 week vacation and still suffered from jet lag...

>
> Results
> =======
>
> As I'm sure you have seen, I've done some benchmarking with mTHP and contpte,
> also on an Ampere Altra system. Although my system has 2 NUMA nodes (80 CPUs per
> node), I've deliberately disabled one of the nodes to avoid noise from cross
> socket IO. So the HW should look and behave approximately the same as yours.

I used 1 socket system, but 128 cores per node. I used taskset to bind
kernel build tasks on core 10 - 89.

>
> We have one overlapping benchmark - kernel compilation - and our results are not
> a million miles apart. You can see my results for 4KPS at [1] (and you can take
> 16KPS and 64KPS results for reference from [2]).
>
> page size   | Ryan   | Yang Shi
> ------------|--------|---------
> 16K (4KPS)  |  -6.1% |  -5%
> 16K (16KPS) |  -9.2% | -15%
> 64K (64KPS) | -11.4% | -16%
>
> For 4KPS, my "mTHP + contpte" line is equivalent to what you have tested. I'm
> seeing -6% vs your -5%. But the 16KPS and 64KPS results diverge more. I'm not
> sure why these results diverge so much, perhaps you have an idea? From my side,
> I've run these benchmarks many many times with successive kernels and revised
> patches etc, and the numbers are always similar for me. I repeat multiple times
> across multiple reboots and also disable kaslr and (user) aslr to avoid any
> unwanted noise/skew.
>
> The actual test is essentially:
>
> $ make defconfig && time make –s –j80 Image

I'm not sure whether the config may make some difference or not. I
used the default Fedora config. And I'm running my test on Fedora 39
with gcc (GCC) 13.2.1 20230918. I saw you were using ubuntu 22.04. Not
sure whether this is correlated or not.

And Matthew said he didn't see any number close to our number (I can't
remember what exactly he said, but he should mean it) in the
discussion. I'm not sure what number Matthew meant, or he meant your
number?

>
> I'd also be interested in how you are measuring memory. I've measured both peak
> and mean memory (by putting the workload in a cgroup) and see almost double the
> memory increase that you report for 16KPS. Our measurements for other configs match.

I also used memory.peak to measure the memory consumption. I didn't
try different configs. I just noticed more cores may incur more memory
consumption. It is more noticeable with 64KPS.

>
> But I also want to raise a more general point; We are not done with the
> optimizations yet. contpte can also improve performance for iTLB, but this
> requires a change to the page cache to store text in (at least) 64K folios.
> Typically the iTLB is under a lot of pressure and this can help reduce it. This
> change is not in mainline yet (and I still need to figure out how to make the
> patch acceptable), but is worth another ~1.5% for the 4KPS case. I suspect this
> will also move the needle on the other benchmarks you ran. See [3] - I'd
> appreciate any thoughts you have on how to get something like this accepted.

AFAIK, the improvement from reduced iTLB really depends on workloads.
IIRC, MySQL is more sensitive to it. We did some tests with
CONFIG_READ_ONLY_THP_FOR_FS enabled for MySQL, we saw decent
improvement, but I really don't remember the exact number.

>
> [1] https://lore.kernel.org/all/20240215103205.2607016-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.roberts@arm.com/
> [3] https://lore.kernel.org/lkml/20240111154106.3692206-1-ryan.roberts@arm.com/
>
> Conclusions
> ===========
>
> I think people in the room already said most of what I want to say;
> Unfortunately there is a trade-off between performance and memory consumption.
> And it is not always practical to dole out the biggest THP we can allocate; lots
> of partially used 2M chunks would lead to a lot of wasted memory. So we need a
> way to let user space configure the kernel for their desired mTHP sizes.
>
> In the long term, it would be great to support an "auto" mode, and the current
> interfaces leave the door open to that. Perhaps your suggestion to start out
> with 64K and collapse to higher orders is one tool that could take us in that
> direction. But 64K is arm64-specific. AMD wants 32K. So you still need some
> mechanism to determine that (and the community wasn't keen on having the arch
> tell us that).
>
> It may actually turn out that we need a more complex interface to allow a (set
> of) mTHP order(s) to be enabled for a specific VMA. We previously concluded that
> if/when the time comes, then madvise_process() should give us what we need. That
> would allow better integration with user space.

The internal fragmentation or memory waste for 2M THP is a chronic
problem. The medium sized THP can help tackle this, but the
performance may not be as good as 2M THP.

So after the discussion I was actually thinking that we may need two
policies based on the workloads since there seems to be no one policy
that works for everyone. One for max TLB utilization improvement, the
other for memory conservative.

For example, the workload which doesn't care too much about memory
waste, they can choose to allocate THP from the biggest suitable
order, for example, 2M for some VM workloads. On the other side of the
spectrum, we can start allocating from smaller order then collapse to
larger order.

The system can have a default policy, the users can change the policy
by calling some interfaces, for example, madvise(). Anyway, just off
the top of my head, I haven't invested too much time in this aspect
yet.

I don't think 64K vs 32K is a problem. The two 32K chunks in the same
64K chunk are properly aligned. 64K is not a very high order, so
starting from 64K for everyone should not be a problem. I don't see
why we have to care about this.

By all the means mentioned above, we may be able to achieve full
"auto" mode in the future.

Actually another problem about the current interface is we may end up
having the same behavior with different settings. For example, having
"inherit" for all orders and have "always" for top level knob may
behave the same as having all orders and top level knob set to
"always". This may result in some confusion and violate the rule for
sysfs interfaces.

>
> Your suggestion about splitting higher orders to 64K at swap out is interesting;
> that might help with some swap fragmentation issues we are currently grappling
> with. But ultimately spitting a folio is expensive and we want to avoid that
> cost as much as possible. I'd prefer to continue down the route that Chris Li is
> taking us so that we can do a better job of allocating swap in the first place.

I think I meant splitting to 64K when we have to split. I don't mean
we split to 64K all the time. If we run into swap fragmentation,
splitting to small order may help reduce the premature OOM and the
cost of splitting may be worth it. Just like what we did for other
paths, for example, page demotion, migration, etc, we split the large
folio if there is not enough memory. I may not articulate this in the
slides and the discussion, sorry for the confusion.

If we have a better way to tackle the swap fragmentation without
splitting, that is definitely more preferred.

>
> Thanks,
> Ryan
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-06-27 20:54 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-28 16:47 [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 Yang Shi
2024-04-01 18:16 ` Jonathan Cameron
2024-04-02 20:04   ` Yang Shi
2024-04-04 18:57   ` Christoph Lameter (Ampere)
2024-04-04 19:33     ` David Hildenbrand
2024-04-09 18:41       ` Yang Shi
2024-04-09 18:44         ` David Hildenbrand
2024-04-30 14:41       ` Michal Hocko
2024-05-01 16:37         ` Yang Shi
2024-04-08 16:30     ` Matthew Wilcox
2024-04-08 18:56       ` Zi Yan
2024-04-09 10:47         ` Ryan Roberts
2024-06-25 11:12           ` Ryan Roberts
2024-06-25 18:11             ` Christoph Lameter (Ampere)
2024-06-26 10:47               ` Ryan Roberts
2024-06-27 20:54             ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox