* [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 @ 2024-03-28 16:47 Yang Shi 2024-04-01 18:16 ` Jonathan Cameron 0 siblings, 1 reply; 16+ messages in thread From: Yang Shi @ 2024-03-28 16:47 UTC (permalink / raw) To: lsf-pc, olivier.singla, Christoph Lameter (Ampere) Cc: Linux MM, Michal Hocko, Dan Williams We just made some progress regarding multi-sized THP benchmarking on the ARM64 platform. So I'd like to propose the topic "Multi-sized THP performance benchmark on ARM64" for MM track. We ran a series of benchmarks on Ampere Altra platform using some popular workloads in the cloud: In-memory databases, kernel compilation, etc, using different sized huge pages: 2M, 128K, 64K and others. This topic will cover: - The benchmark data of some popular workloads in Cloud - The performance analysis (where the gain came from, the contributing factors, etc) - Th recommended page sizes which can achieve overall decent performance gain ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 2024-03-28 16:47 [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 Yang Shi @ 2024-04-01 18:16 ` Jonathan Cameron 2024-04-02 20:04 ` Yang Shi 2024-04-04 18:57 ` Christoph Lameter (Ampere) 0 siblings, 2 replies; 16+ messages in thread From: Jonathan Cameron @ 2024-04-01 18:16 UTC (permalink / raw) To: Yang Shi Cc: lsf-pc, olivier.singla, Christoph Lameter (Ampere), Linux MM, Michal Hocko, Dan Williams On Thu, 28 Mar 2024 09:47:04 -0700 Yang Shi <shy828301@gmail.com> wrote: > We just made some progress regarding multi-sized THP benchmarking on > the ARM64 platform. So I'd like to propose the topic "Multi-sized THP > performance benchmark on ARM64" for MM track. > > We ran a series of benchmarks on Ampere Altra platform using some > popular workloads in the cloud: In-memory databases, kernel > compilation, etc, using different sized huge pages: 2M, 128K, 64K and > others. > > This topic will cover: > - The benchmark data of some popular workloads in Cloud > - The performance analysis (where the gain came from, the > contributing factors, etc) > - Th recommended page sizes which can achieve overall decent performance gain > Sounds like useful data, but is it a suitable topic for LSF-MM? What open questions etc is it raising? I'm very interested in seeing your results, but maybe this isn't the best path for them. Jonathan ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 2024-04-01 18:16 ` Jonathan Cameron @ 2024-04-02 20:04 ` Yang Shi 2024-04-04 18:57 ` Christoph Lameter (Ampere) 1 sibling, 0 replies; 16+ messages in thread From: Yang Shi @ 2024-04-02 20:04 UTC (permalink / raw) To: Jonathan Cameron Cc: lsf-pc, olivier.singla, Christoph Lameter (Ampere), Linux MM, Michal Hocko, Dan Williams On Mon, Apr 1, 2024 at 11:16 AM Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > > On Thu, 28 Mar 2024 09:47:04 -0700 > Yang Shi <shy828301@gmail.com> wrote: > > > We just made some progress regarding multi-sized THP benchmarking on > > the ARM64 platform. So I'd like to propose the topic "Multi-sized THP > > performance benchmark on ARM64" for MM track. > > > > We ran a series of benchmarks on Ampere Altra platform using some > > popular workloads in the cloud: In-memory databases, kernel > > compilation, etc, using different sized huge pages: 2M, 128K, 64K and > > others. > > > > This topic will cover: > > - The benchmark data of some popular workloads in Cloud > > - The performance analysis (where the gain came from, the > > contributing factors, etc) > > - Th recommended page sizes which can achieve overall decent performance gain > > > > Sounds like useful data, but is it a suitable topic for LSF-MM? > What open questions etc is it raising? > > I'm very interested in seeing your results, but maybe this isn't the best path > for them. Thanks for showing interest. Hopefully the benchmark data also can help us for the direction of further optimizations, and I hope this is also a part of the session. > > Jonathan ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 2024-04-01 18:16 ` Jonathan Cameron 2024-04-02 20:04 ` Yang Shi @ 2024-04-04 18:57 ` Christoph Lameter (Ampere) 2024-04-04 19:33 ` David Hildenbrand 2024-04-08 16:30 ` Matthew Wilcox 1 sibling, 2 replies; 16+ messages in thread From: Christoph Lameter (Ampere) @ 2024-04-04 18:57 UTC (permalink / raw) To: Jonathan Cameron Cc: Yang Shi, lsf-pc, olivier.singla, Linux MM, Michal Hocko, Dan Williams On Mon, 1 Apr 2024, Jonathan Cameron wrote: > Sounds like useful data, but is it a suitable topic for LSF-MM? > What open questions etc is it raising? mTHP is new functionality that will require additional work to support more use cases. It is also unclear at this point in what usecases mTHP is useful and where no benefit can so far be seen. Also the effect of coalescing multiple PTE entries into one TLB entry is new to MM (CONT_PTE). Ultimately it would be useful to have mTHP support also provide larger blocksize capabilities for filesystem etc etc. mTHP needs to mature and an analysis of the arguable a bit experimental state of affairs can help a lot in getting there. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 2024-04-04 18:57 ` Christoph Lameter (Ampere) @ 2024-04-04 19:33 ` David Hildenbrand 2024-04-09 18:41 ` Yang Shi 2024-04-30 14:41 ` Michal Hocko 2024-04-08 16:30 ` Matthew Wilcox 1 sibling, 2 replies; 16+ messages in thread From: David Hildenbrand @ 2024-04-04 19:33 UTC (permalink / raw) To: Christoph Lameter (Ampere), Jonathan Cameron Cc: Yang Shi, lsf-pc, olivier.singla, Linux MM, Michal Hocko, Dan Williams On 04.04.24 20:57, Christoph Lameter (Ampere) wrote: > On Mon, 1 Apr 2024, Jonathan Cameron wrote: > >> Sounds like useful data, but is it a suitable topic for LSF-MM? >> What open questions etc is it raising? > > > mTHP is new functionality that will require additional work to support > more use cases. It is also unclear at this point in what usecases mTHP is > useful and where no benefit can so far be seen. Also the effect of > coalescing multiple PTE entries into one TLB entry is new to MM > (CONT_PTE). > > Ultimately it would be useful to have mTHP support also provide larger > blocksize capabilities for filesystem etc etc. mTHP needs to mature and an > analysis of the arguable a bit experimental state of affairs can help a > lot in getting there. Right, something like that (open items, missed use cases, requirements, ideas, etc,.) would be a better (good!) fit. Pure benchmark results, analysis and recommendations are great. But likely a better fit for a (white) paper, blog post, less-discussion-focused conference. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 2024-04-04 19:33 ` David Hildenbrand @ 2024-04-09 18:41 ` Yang Shi 2024-04-09 18:44 ` David Hildenbrand 2024-04-30 14:41 ` Michal Hocko 1 sibling, 1 reply; 16+ messages in thread From: Yang Shi @ 2024-04-09 18:41 UTC (permalink / raw) To: David Hildenbrand Cc: Christoph Lameter (Ampere), Jonathan Cameron, lsf-pc, olivier.singla, Linux MM, Michal Hocko, Dan Williams On Thu, Apr 4, 2024 at 12:34 PM David Hildenbrand <david@redhat.com> wrote: > > On 04.04.24 20:57, Christoph Lameter (Ampere) wrote: > > On Mon, 1 Apr 2024, Jonathan Cameron wrote: > > > >> Sounds like useful data, but is it a suitable topic for LSF-MM? > >> What open questions etc is it raising? > > > > > > mTHP is new functionality that will require additional work to support > > more use cases. It is also unclear at this point in what usecases mTHP is > > useful and where no benefit can so far be seen. Also the effect of > > coalescing multiple PTE entries into one TLB entry is new to MM > > (CONT_PTE). > > > > Ultimately it would be useful to have mTHP support also provide larger > > blocksize capabilities for filesystem etc etc. mTHP needs to mature and an > > analysis of the arguable a bit experimental state of affairs can help a > > lot in getting there. > > Right, something like that (open items, missed use cases, requirements, > ideas, etc,.) would be a better (good!) fit. > > Pure benchmark results, analysis and recommendations are great. But > likely a better fit for a (white) paper, blog post, > less-discussion-focused conference. Thanks for the suggestion. I didn't plan to enumerate any open items because I think those items (for example, khugepaged support, swap, etc) were already well-known by mm community and we have made some progress on some items. The potential future optimization choices led by the benchmark and analysis may be worth discussing. For example, shall the allocation fallback should try every single order, is it a good idea to let users decide the orders, etc. We didn't know what the good choice should be before we had some benchmark data. > > -- > Cheers, > > David / dhildenb > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 2024-04-09 18:41 ` Yang Shi @ 2024-04-09 18:44 ` David Hildenbrand 0 siblings, 0 replies; 16+ messages in thread From: David Hildenbrand @ 2024-04-09 18:44 UTC (permalink / raw) To: Yang Shi Cc: Christoph Lameter (Ampere), Jonathan Cameron, lsf-pc, olivier.singla, Linux MM, Michal Hocko, Dan Williams On 09.04.24 20:41, Yang Shi wrote: > On Thu, Apr 4, 2024 at 12:34 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 04.04.24 20:57, Christoph Lameter (Ampere) wrote: >>> On Mon, 1 Apr 2024, Jonathan Cameron wrote: >>> >>>> Sounds like useful data, but is it a suitable topic for LSF-MM? >>>> What open questions etc is it raising? >>> >>> >>> mTHP is new functionality that will require additional work to support >>> more use cases. It is also unclear at this point in what usecases mTHP is >>> useful and where no benefit can so far be seen. Also the effect of >>> coalescing multiple PTE entries into one TLB entry is new to MM >>> (CONT_PTE). >>> >>> Ultimately it would be useful to have mTHP support also provide larger >>> blocksize capabilities for filesystem etc etc. mTHP needs to mature and an >>> analysis of the arguable a bit experimental state of affairs can help a >>> lot in getting there. >> >> Right, something like that (open items, missed use cases, requirements, >> ideas, etc,.) would be a better (good!) fit. >> >> Pure benchmark results, analysis and recommendations are great. But >> likely a better fit for a (white) paper, blog post, >> less-discussion-focused conference. > > Thanks for the suggestion. I didn't plan to enumerate any open items > because I think those items (for example, khugepaged support, swap, > etc) were already well-known by mm community and we have made some > progress on some items. I think there are two types of open items: "we obviously know what we have to do -- basic swap, khugepaged, etc. support" and "we don't really know what to do because it's rather an optimization problem and there might not be a right or wrong". > > The potential future optimization choices led by the benchmark and > analysis may be worth discussing. For example, shall the allocation > fallback should try every single order, is it a good idea to let users > decide the orders, etc. We didn't know what the good choice should be > before we had some benchmark data. Focusing on such open questions makes a lot of sense. Then, you can use the benchmark data to guide the discussion and share your insights :) -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 2024-04-04 19:33 ` David Hildenbrand 2024-04-09 18:41 ` Yang Shi @ 2024-04-30 14:41 ` Michal Hocko 2024-05-01 16:37 ` Yang Shi 1 sibling, 1 reply; 16+ messages in thread From: Michal Hocko @ 2024-04-30 14:41 UTC (permalink / raw) To: David Hildenbrand Cc: Christoph Lameter (Ampere), Jonathan Cameron, Yang Shi, lsf-pc, olivier.singla, Linux MM, Dan Williams On Thu 04-04-24 21:33:57, David Hildenbrand wrote: > On 04.04.24 20:57, Christoph Lameter (Ampere) wrote: > > On Mon, 1 Apr 2024, Jonathan Cameron wrote: > > > > > Sounds like useful data, but is it a suitable topic for LSF-MM? > > > What open questions etc is it raising? > > > > > > mTHP is new functionality that will require additional work to support > > more use cases. It is also unclear at this point in what usecases mTHP is > > useful and where no benefit can so far be seen. Also the effect of > > coalescing multiple PTE entries into one TLB entry is new to MM > > (CONT_PTE). > > > > Ultimately it would be useful to have mTHP support also provide larger > > blocksize capabilities for filesystem etc etc. mTHP needs to mature and an > > analysis of the arguable a bit experimental state of affairs can help a > > lot in getting there. > > Right, something like that (open items, missed use cases, requirements, > ideas, etc,.) would be a better (good!) fit. > > Pure benchmark results, analysis and recommendations are great. But likely a > better fit for a (white) paper, blog post, less-discussion-focused > conference. Completely agreed! It would be really great if Yang Shi could open the topic with high level data and then we spent majority of the slot on the actual discussion. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 2024-04-30 14:41 ` Michal Hocko @ 2024-05-01 16:37 ` Yang Shi 0 siblings, 0 replies; 16+ messages in thread From: Yang Shi @ 2024-05-01 16:37 UTC (permalink / raw) To: Michal Hocko Cc: David Hildenbrand, Christoph Lameter (Ampere), Jonathan Cameron, lsf-pc, olivier.singla, Linux MM, Dan Williams On Tue, Apr 30, 2024 at 7:41 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 04-04-24 21:33:57, David Hildenbrand wrote: > > On 04.04.24 20:57, Christoph Lameter (Ampere) wrote: > > > On Mon, 1 Apr 2024, Jonathan Cameron wrote: > > > > > > > Sounds like useful data, but is it a suitable topic for LSF-MM? > > > > What open questions etc is it raising? > > > > > > > > > mTHP is new functionality that will require additional work to support > > > more use cases. It is also unclear at this point in what usecases mTHP is > > > useful and where no benefit can so far be seen. Also the effect of > > > coalescing multiple PTE entries into one TLB entry is new to MM > > > (CONT_PTE). > > > > > > Ultimately it would be useful to have mTHP support also provide larger > > > blocksize capabilities for filesystem etc etc. mTHP needs to mature and an > > > analysis of the arguable a bit experimental state of affairs can help a > > > lot in getting there. > > > > Right, something like that (open items, missed use cases, requirements, > > ideas, etc,.) would be a better (good!) fit. > > > > Pure benchmark results, analysis and recommendations are great. But likely a > > better fit for a (white) paper, blog post, less-discussion-focused > > conference. > > Completely agreed! It would be really great if Yang Shi could open the > topic with high level data and then we spent majority of the slot on the > actual discussion. I will try my best to minimize the time spent by explaining benchmark data. It may take 10 - 15 minutes in ballpark estimation. Then the remaining time can be spent in actual discussion. > > Thanks! > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 2024-04-04 18:57 ` Christoph Lameter (Ampere) 2024-04-04 19:33 ` David Hildenbrand @ 2024-04-08 16:30 ` Matthew Wilcox 2024-04-08 18:56 ` Zi Yan 1 sibling, 1 reply; 16+ messages in thread From: Matthew Wilcox @ 2024-04-08 16:30 UTC (permalink / raw) To: Christoph Lameter (Ampere) Cc: Jonathan Cameron, Yang Shi, lsf-pc, olivier.singla, Linux MM, Michal Hocko, Dan Williams On Thu, Apr 04, 2024 at 11:57:03AM -0700, Christoph Lameter (Ampere) wrote: > On Mon, 1 Apr 2024, Jonathan Cameron wrote: > > > Sounds like useful data, but is it a suitable topic for LSF-MM? > > What open questions etc is it raising? > > > mTHP is new functionality that will require additional work to support more > use cases. It is also unclear at this point in what usecases mTHP is useful > and where no benefit can so far be seen. Also the effect of coalescing > multiple PTE entries into one TLB entry is new to MM (CONT_PTE). > > Ultimately it would be useful to have mTHP support also provide larger > blocksize capabilities for filesystem etc etc. mTHP needs to mature and an > analysis of the arguable a bit experimental state of affairs can help a lot > in getting there. Have you been paying attention to anything that's been happening in Linux development in the last three years? 7b230db3b8d3 introduced folios in December 2020 (was merged in November 2021 for v5.16). v5.17 (March 2022) did everything short of enabling large folios for the page cache, which landed in v5.18 (May 2022). We started using cont-PTEs for large folios in August 2023. Again, the page cache led the way here and we're just adding support for anonymous large folios (called mTHP) now. There's still a ton of work to do, but we've been busy doing it since LSFMM in Puerto Rico (2019) with READ_ONLY_THP_FOR_FS being the very first result from the group of interested developers. And if you haven't seen the results that Ryan Roberts has posted for the tests he's run, I suggest you look them up. He does a great job of breaking down how much benefit he sees from the hardware side (use of contPTE) vs the software side (shorter LRU lists, fewer atomic ops). ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 2024-04-08 16:30 ` Matthew Wilcox @ 2024-04-08 18:56 ` Zi Yan 2024-04-09 10:47 ` Ryan Roberts 0 siblings, 1 reply; 16+ messages in thread From: Zi Yan @ 2024-04-08 18:56 UTC (permalink / raw) To: Matthew Wilcox, Christoph Lameter Cc: Jonathan Cameron, Yang Shi, lsf-pc, olivier.singla, Linux MM, Michal Hocko, Dan Williams, Ryan Roberts [-- Attachment #1: Type: text/plain, Size: 3519 bytes --] On 8 Apr 2024, at 12:30, Matthew Wilcox wrote: > On Thu, Apr 04, 2024 at 11:57:03AM -0700, Christoph Lameter (Ampere) wrote: >> On Mon, 1 Apr 2024, Jonathan Cameron wrote: >> >>> Sounds like useful data, but is it a suitable topic for LSF-MM? >>> What open questions etc is it raising? >> >> >> mTHP is new functionality that will require additional work to support more >> use cases. It is also unclear at this point in what usecases mTHP is useful >> and where no benefit can so far be seen. Also the effect of coalescing >> multiple PTE entries into one TLB entry is new to MM (CONT_PTE). I think we need a clarification of CONT_PTE from Christoph. From the context of ARM CPUs, CONT_PTE might be a group of PTEs with contiguous bit set. It was used by hugetlb and kernel linear mapping before Ryan added CONT_PTE support for mTHPs. This requires software support (setting contiguous bits) to be able to coalesce PTEs. But ARM also has this Hardware Page Aggregation (HPA) feature[1], which can coalesce PTEs without software intervention. I am not sure which ARM CPUs actually implement it. From the context of all CPUs, AMD has "PTE coalescing/clustering"[2] feature from Zen1. It is similar to ARM's HPA, not requiring software changes to coalesce PTEs. RISC-V also has Svnapot (Naturally-Aligned Power-of-Two Address-Translation Contiguity) [3], which requires software help. So with Matthew's folio patches back in 2020, hardware-only CONT_PTE would work since then. But software-assist CONT_PTE just began to work on ARM CPUs with Ryan's cont-pte patchset for anonymous memory and page cache. >> >> Ultimately it would be useful to have mTHP support also provide larger >> blocksize capabilities for filesystem etc etc. mTHP needs to mature and an >> analysis of the arguable a bit experimental state of affairs can help a lot >> in getting there. > > Have you been paying attention to anything that's been happening in Linux > development in the last three years? 7b230db3b8d3 introduced folios > in December 2020 (was merged in November 2021 for v5.16). v5.17 (March > 2022) did everything short of enabling large folios for the page cache, > which landed in v5.18 (May 2022). We started using cont-PTEs for large > folios in August 2023. Again, the page cache led the way here and we're > just adding support for anonymous large folios (called mTHP) now. Matthew, your cont-PTE here is "New page table range API" right? There is no ARM contiguous bit manipulation, right? > > There's still a ton of work to do, but we've been busy doing it since > LSFMM in Puerto Rico (2019) with READ_ONLY_THP_FOR_FS being the very > first result from the group of interested developers. > > And if you haven't seen the results that Ryan Roberts has posted for > the tests he's run, I suggest you look them up. He does a great job > of breaking down how much benefit he sees from the hardware side (use of > contPTE) vs the software side (shorter LRU lists, fewer atomic ops). It is definitely helpful to distinguish hardware and software benefits, since not all CPUs can coalesce PTEs. [1] https://developer.arm.com/documentation/100616/0301/register-descriptions/aarch64-system-registers/cpuectlr-el1--cpu-extended-control-register--el1 [2] https://www.eliot.so/memsys23.pdf [3] https://github.com/riscv/virtual-memory?tab=readme-ov-file#svnapot-naturally-aligned-power-of-two-address-translation-contiguity -- Best Regards, Yan, Zi [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 854 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 2024-04-08 18:56 ` Zi Yan @ 2024-04-09 10:47 ` Ryan Roberts 2024-06-25 11:12 ` Ryan Roberts 0 siblings, 1 reply; 16+ messages in thread From: Ryan Roberts @ 2024-04-09 10:47 UTC (permalink / raw) To: Zi Yan, Matthew Wilcox, Christoph Lameter Cc: Jonathan Cameron, Yang Shi, lsf-pc, olivier.singla, Linux MM, Michal Hocko, Dan Williams [-- Attachment #1: Type: text/plain, Size: 5006 bytes --] Thanks for the CC, Zi! I must admit I'm not great at following the list... On 08/04/2024 19:56, Zi Yan wrote: > On 8 Apr 2024, at 12:30, Matthew Wilcox wrote: > >> On Thu, Apr 04, 2024 at 11:57:03AM -0700, Christoph Lameter (Ampere) wrote: >>> On Mon, 1 Apr 2024, Jonathan Cameron wrote: >>> >>>> Sounds like useful data, but is it a suitable topic for LSF-MM? >>>> What open questions etc is it raising? I'm happy to see others looking at mTHP, and would be very keen to be involved in any discussion. Unfortunately I won't be able to make it to LSFMM this year - my wife is expecting a baby the same week. I'll register for online, but even joining that is looking unlikely. It would be great to be cc'ed on any future results you make public though. And I'd be very happy to work more closely together to debug problems or extend things further - feel free to reach out! I have a roadmap of items that I believe are needed to get this to perform optimally (see first 2 columns of attached slide); only some of this is in mainline so would be good to understand exactly what code you were doing your testing with? >>> >>> >>> mTHP is new functionality that will require additional work to support more >>> use cases. It is also unclear at this point in what usecases mTHP is useful >>> and where no benefit can so far be seen. Also the effect of coalescing >>> multiple PTE entries into one TLB entry is new to MM (CONT_PTE). > > I think we need a clarification of CONT_PTE from Christoph. > > From the context of ARM CPUs, CONT_PTE might be a group of PTEs with contiguous > bit set. It was used by hugetlb and kernel linear mapping before Ryan added > CONT_PTE support for mTHPs. Yes indeed. Note the macro "PTE_CONT" is private to the arm64 arch and is never used directly by the core-mm. It's been around for a while and used for hugetlb and kernel memory. So the only new use is for regular user memory (anon and page cache). So I don't think there are any risks from HW conformance PoV, if that was the concern. > This requires software support (setting contiguous bits) > to be able to coalesce PTEs. But ARM also has this Hardware Page Aggregation (HPA) > feature[1], which can coalesce PTEs without software intervention. I am not > sure which ARM CPUs actually implement it. All of the latest Arm-implemented cores support HPA. However sometimes it needs to be explicitly enabled by EL3 FW. The N1 used in Ampere Altra has it, but it is not propoerly enabled and due to errata, it is not possible to fully enable it even with access to EL3. Thanks, Ryan > > From the context of all CPUs, AMD has "PTE coalescing/clustering"[2] feature > from Zen1. It is similar to ARM's HPA, not requiring software changes to > coalesce PTEs. RISC-V also has Svnapot (Naturally-Aligned Power-of-Two > Address-Translation Contiguity) [3], which requires software help. > > So with Matthew's folio patches back in 2020, hardware-only CONT_PTE > would work since then. But software-assist CONT_PTE just began to work > on ARM CPUs with Ryan's cont-pte patchset for anonymous memory and page cache. > >>> >>> Ultimately it would be useful to have mTHP support also provide larger >>> blocksize capabilities for filesystem etc etc. mTHP needs to mature and an >>> analysis of the arguable a bit experimental state of affairs can help a lot >>> in getting there. >> >> Have you been paying attention to anything that's been happening in Linux >> development in the last three years? 7b230db3b8d3 introduced folios >> in December 2020 (was merged in November 2021 for v5.16). v5.17 (March >> 2022) did everything short of enabling large folios for the page cache, >> which landed in v5.18 (May 2022). We started using cont-PTEs for large >> folios in August 2023. Again, the page cache led the way here and we're >> just adding support for anonymous large folios (called mTHP) now. > > Matthew, your cont-PTE here is "New page table range API" right? There is > no ARM contiguous bit manipulation, right? > >> >> There's still a ton of work to do, but we've been busy doing it since >> LSFMM in Puerto Rico (2019) with READ_ONLY_THP_FOR_FS being the very >> first result from the group of interested developers. >> >> And if you haven't seen the results that Ryan Roberts has posted for >> the tests he's run, I suggest you look them up. He does a great job >> of breaking down how much benefit he sees from the hardware side (use of >> contPTE) vs the software side (shorter LRU lists, fewer atomic ops). > > It is definitely helpful to distinguish hardware and software benefits, > since not all CPUs can coalesce PTEs. > > > [1] https://developer.arm.com/documentation/100616/0301/register-descriptions/aarch64-system-registers/cpuectlr-el1--cpu-extended-control-register--el1 > [2] https://www.eliot.so/memsys23.pdf > [3] https://github.com/riscv/virtual-memory?tab=readme-ov-file#svnapot-naturally-aligned-power-of-two-address-translation-contiguity > > -- > Best Regards, > Yan, Zi [-- Attachment #2: folios roadmap.pdf --] [-- Type: application/pdf, Size: 75861 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 2024-04-09 10:47 ` Ryan Roberts @ 2024-06-25 11:12 ` Ryan Roberts 2024-06-25 18:11 ` Christoph Lameter (Ampere) 2024-06-27 20:54 ` Yang Shi 0 siblings, 2 replies; 16+ messages in thread From: Ryan Roberts @ 2024-06-25 11:12 UTC (permalink / raw) To: Yang Shi Cc: Jonathan Cameron, lsf-pc, olivier.singla, Linux MM, Michal Hocko, Dan Williams, Christoph Lameter, Matthew Wilcox, Zi Yan On 09/04/2024 11:47, Ryan Roberts wrote: > Thanks for the CC, Zi! I must admit I'm not great at following the list... > > > On 08/04/2024 19:56, Zi Yan wrote: >> On 8 Apr 2024, at 12:30, Matthew Wilcox wrote: >> >>> On Thu, Apr 04, 2024 at 11:57:03AM -0700, Christoph Lameter (Ampere) wrote: >>>> On Mon, 1 Apr 2024, Jonathan Cameron wrote: >>>> >>>>> Sounds like useful data, but is it a suitable topic for LSF-MM? >>>>> What open questions etc is it raising? > > I'm happy to see others looking at mTHP, and would be very keen to be involved > in any discussion. Unfortunately I won't be able to make it to LSFMM this year - > my wife is expecting a baby the same week. I'll register for online, but even > joining that is looking unlikely. [...] Hi Yang Shi, I finally got around to watching the video of your presentation; Thanks for doing the work to benchmark this on your system. I just wanted to raise a couple of points, first on your results and secondly on your conclusions... Results ======= As I'm sure you have seen, I've done some benchmarking with mTHP and contpte, also on an Ampere Altra system. Although my system has 2 NUMA nodes (80 CPUs per node), I've deliberately disabled one of the nodes to avoid noise from cross socket IO. So the HW should look and behave approximately the same as yours. We have one overlapping benchmark - kernel compilation - and our results are not a million miles apart. You can see my results for 4KPS at [1] (and you can take 16KPS and 64KPS results for reference from [2]). page size | Ryan | Yang Shi ------------|--------|--------- 16K (4KPS) | -6.1% | -5% 16K (16KPS) | -9.2% | -15% 64K (64KPS) | -11.4% | -16% For 4KPS, my "mTHP + contpte" line is equivalent to what you have tested. I'm seeing -6% vs your -5%. But the 16KPS and 64KPS results diverge more. I'm not sure why these results diverge so much, perhaps you have an idea? From my side, I've run these benchmarks many many times with successive kernels and revised patches etc, and the numbers are always similar for me. I repeat multiple times across multiple reboots and also disable kaslr and (user) aslr to avoid any unwanted noise/skew. The actual test is essentially: $ make defconfig && time make –s –j80 Image I'd also be interested in how you are measuring memory. I've measured both peak and mean memory (by putting the workload in a cgroup) and see almost double the memory increase that you report for 16KPS. Our measurements for other configs match. But I also want to raise a more general point; We are not done with the optimizations yet. contpte can also improve performance for iTLB, but this requires a change to the page cache to store text in (at least) 64K folios. Typically the iTLB is under a lot of pressure and this can help reduce it. This change is not in mainline yet (and I still need to figure out how to make the patch acceptable), but is worth another ~1.5% for the 4KPS case. I suspect this will also move the needle on the other benchmarks you ran. See [3] - I'd appreciate any thoughts you have on how to get something like this accepted. [1] https://lore.kernel.org/all/20240215103205.2607016-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.roberts@arm.com/ [3] https://lore.kernel.org/lkml/20240111154106.3692206-1-ryan.roberts@arm.com/ Conclusions =========== I think people in the room already said most of what I want to say; Unfortunately there is a trade-off between performance and memory consumption. And it is not always practical to dole out the biggest THP we can allocate; lots of partially used 2M chunks would lead to a lot of wasted memory. So we need a way to let user space configure the kernel for their desired mTHP sizes. In the long term, it would be great to support an "auto" mode, and the current interfaces leave the door open to that. Perhaps your suggestion to start out with 64K and collapse to higher orders is one tool that could take us in that direction. But 64K is arm64-specific. AMD wants 32K. So you still need some mechanism to determine that (and the community wasn't keen on having the arch tell us that). It may actually turn out that we need a more complex interface to allow a (set of) mTHP order(s) to be enabled for a specific VMA. We previously concluded that if/when the time comes, then madvise_process() should give us what we need. That would allow better integration with user space. Your suggestion about splitting higher orders to 64K at swap out is interesting; that might help with some swap fragmentation issues we are currently grappling with. But ultimately spitting a folio is expensive and we want to avoid that cost as much as possible. I'd prefer to continue down the route that Chris Li is taking us so that we can do a better job of allocating swap in the first place. Thanks, Ryan ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 2024-06-25 11:12 ` Ryan Roberts @ 2024-06-25 18:11 ` Christoph Lameter (Ampere) 2024-06-26 10:47 ` Ryan Roberts 2024-06-27 20:54 ` Yang Shi 1 sibling, 1 reply; 16+ messages in thread From: Christoph Lameter (Ampere) @ 2024-06-25 18:11 UTC (permalink / raw) To: Ryan Roberts Cc: Yang Shi, Jonathan Cameron, lsf-pc, olivier.singla, Linux MM, Michal Hocko, Dan Williams, Matthew Wilcox, Zi Yan On Tue, 25 Jun 2024, Ryan Roberts wrote: > But I also want to raise a more general point; We are not done with the > optimizations yet. contpte can also improve performance for iTLB, but this > requires a change to the page cache to store text in (at least) 64K folios. > Typically the iTLB is under a lot of pressure and this can help reduce it. This > change is not in mainline yet (and I still need to figure out how to make the > patch acceptable), but is worth another ~1.5% for the 4KPS case. I suspect this > will also move the needle on the other benchmarks you ran. See [3] - I'd > appreciate any thoughts you have on how to get something like this accepted. > > [3] https://lore.kernel.org/lkml/20240111154106.3692206-1-ryan.roberts@arm.com/ The discussion here seems to indicate that readahead is already ok for order-2 (16K mTHP size?). So this is only for 64K mTHP on 4K? From what I read in the ARM64 manuals it seems that CONT_PTE can only be used for 64K mTHP on 4K kernels. The 16K case will not benefit from CONT_PTE nor any other intermediate size than 64K. Quoting: https://developer.arm.com/documentation/ddi0406/c/System-Level-Architecture/Virtual-Memory-System-Architecture--VMSA-/Memory-region-attributes/Long-descriptor-format-memory-region-attributes?lang=en#BEIIBEIJ "Contiguous hint The Long-descriptor translation table format descriptors contain a Contiguous hint bit. Setting this bit to 1 indicates that 16 adjacent translation table entries point to a contiguous output address range. These 16 entries must be aligned in the translation table so that the top 5 bits of their input addresses, that index their position in the translation table, are the same. For example, referring to Figure 12.21, to use this hint for a block of 16 entries in the third-level translation table, bits[20:16] of the input addresses for the 16 entries must be the same. The contiguous output address range must be aligned to size of 16 translation table entries at the same translation table level. Use of this hint means that the TLB can cache a single entry to cover the 16 translation table entries. This bit is only a hint bit. The architecture does not require a processor to cache TLB entries in this way. To avoid TLB coherency issues, any TLB maintenance by address must not assume any optimization of the TLB tables that might result from use of the hint bit. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 2024-06-25 18:11 ` Christoph Lameter (Ampere) @ 2024-06-26 10:47 ` Ryan Roberts 0 siblings, 0 replies; 16+ messages in thread From: Ryan Roberts @ 2024-06-26 10:47 UTC (permalink / raw) To: Christoph Lameter (Ampere) Cc: Yang Shi, Jonathan Cameron, lsf-pc, olivier.singla, Linux MM, Michal Hocko, Dan Williams, Matthew Wilcox, Zi Yan On 25/06/2024 19:11, Christoph Lameter (Ampere) wrote: > On Tue, 25 Jun 2024, Ryan Roberts wrote: > >> But I also want to raise a more general point; We are not done with the >> optimizations yet. contpte can also improve performance for iTLB, but this >> requires a change to the page cache to store text in (at least) 64K folios. >> Typically the iTLB is under a lot of pressure and this can help reduce it. This >> change is not in mainline yet (and I still need to figure out how to make the >> patch acceptable), but is worth another ~1.5% for the 4KPS case. I suspect this >> will also move the needle on the other benchmarks you ran. See [3] - I'd >> appreciate any thoughts you have on how to get something like this accepted. >> >> [3] https://lore.kernel.org/lkml/20240111154106.3692206-1-ryan.roberts@arm.com/ > > The discussion here seems to indicate that readahead is already ok for order-2 > (16K mTHP size?). So this is only for 64K mTHP on 4K? Kind of; for fiflesystems that report support for large folios, readahead starts with order-2 folio, then increments the folio order by 2 orders for every subsequent readahead marker that is hit. But text is rarely accessed sequentially so readahead markers are rarely hit in practice and therefore all the text folios tend to end up as order-2 (16K for 4K base pages). But the important bit is that the filesystem needs to support large folios in the first place, without that, we are always stuck using small (order-0) folios. XFS and a few other (network) filesystems support large folios today, but ext4 doesn't - that's being worked on though. > > From what I read in the ARM64 manuals it seems that CONT_PTE can only be used > for 64K mTHP on 4K kernels. The 16K case will not benefit from CONT_PTE nor any > other intermediate size than 64K. Yes and no. The contiguous hint, when applied, constitutes a single fixed size and that size depends on the base page size. Its 64K for 4KPS, 2M for 16KPS and 2M for 64KPS. However, most modern Arm-designed CPUs support a micro-architectural feature called Hardware Page Aggregation (HPA), which can aggregate up to 4 pages into a single TLB in a way that is transparent to SW. So that feature can benefit from 16K folios when using 4K base pages. Although HPA is implemented in the Neoverse N1 CPU (which is what I believe is in the Ampere Altra), it is disabled and due to an errata can't be enabled. So HPA is not relevant for Altra. > > Quoting: > > https://developer.arm.com/documentation/ddi0406/c/System-Level-Architecture/Virtual-Memory-System-Architecture--VMSA-/Memory-region-attributes/Long-descriptor-format-memory-region-attributes?lang=en#BEIIBEIJ Note this link is for armv7A, not v8. But hopefully my explanation about answers everything. Thanks, Ryan > > "Contiguous hint > > The Long-descriptor translation table format descriptors contain a Contiguous > hint bit. Setting this bit to 1 indicates that 16 adjacent translation table > entries point to a contiguous output address range. These 16 entries must be > aligned in the translation table so that the top 5 bits of their input > addresses, that index their position in the translation table, are the same. For > example, referring to Figure 12.21, to use this hint for a block of 16 entries > in the third-level translation table, bits[20:16] of the input addresses for the > 16 entries must be the same. > > The contiguous output address range must be aligned to size of 16 translation > table entries at the same translation table level. > > Use of this hint means that the TLB can cache a single entry to cover the 16 > translation table entries. > > This bit is only a hint bit. The architecture does not require a processor to > cache TLB entries in this way. To avoid TLB coherency issues, any TLB > maintenance by address must not assume any optimization of the TLB tables that > might result from use of the hint bit. > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 2024-06-25 11:12 ` Ryan Roberts 2024-06-25 18:11 ` Christoph Lameter (Ampere) @ 2024-06-27 20:54 ` Yang Shi 1 sibling, 0 replies; 16+ messages in thread From: Yang Shi @ 2024-06-27 20:54 UTC (permalink / raw) To: Ryan Roberts Cc: Jonathan Cameron, lsf-pc, olivier.singla, Linux MM, Michal Hocko, Dan Williams, Christoph Lameter, Matthew Wilcox, Zi Yan On Tue, Jun 25, 2024 at 4:12 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 09/04/2024 11:47, Ryan Roberts wrote: > > Thanks for the CC, Zi! I must admit I'm not great at following the list... > > > > > > On 08/04/2024 19:56, Zi Yan wrote: > >> On 8 Apr 2024, at 12:30, Matthew Wilcox wrote: > >> > >>> On Thu, Apr 04, 2024 at 11:57:03AM -0700, Christoph Lameter (Ampere) wrote: > >>>> On Mon, 1 Apr 2024, Jonathan Cameron wrote: > >>>> > >>>>> Sounds like useful data, but is it a suitable topic for LSF-MM? > >>>>> What open questions etc is it raising? > > > > I'm happy to see others looking at mTHP, and would be very keen to be involved > > in any discussion. Unfortunately I won't be able to make it to LSFMM this year - > > my wife is expecting a baby the same week. I'll register for online, but even > > joining that is looking unlikely. > > [...] > > Hi Yang Shi, > > I finally got around to watching the video of your presentation; Thanks for > doing the work to benchmark this on your system. > > I just wanted to raise a couple of points, first on your results and secondly on > your conclusions... Thanks for following up. Sorry for the late reply, I just came back from a 2 week vacation and still suffered from jet lag... > > Results > ======= > > As I'm sure you have seen, I've done some benchmarking with mTHP and contpte, > also on an Ampere Altra system. Although my system has 2 NUMA nodes (80 CPUs per > node), I've deliberately disabled one of the nodes to avoid noise from cross > socket IO. So the HW should look and behave approximately the same as yours. I used 1 socket system, but 128 cores per node. I used taskset to bind kernel build tasks on core 10 - 89. > > We have one overlapping benchmark - kernel compilation - and our results are not > a million miles apart. You can see my results for 4KPS at [1] (and you can take > 16KPS and 64KPS results for reference from [2]). > > page size | Ryan | Yang Shi > ------------|--------|--------- > 16K (4KPS) | -6.1% | -5% > 16K (16KPS) | -9.2% | -15% > 64K (64KPS) | -11.4% | -16% > > For 4KPS, my "mTHP + contpte" line is equivalent to what you have tested. I'm > seeing -6% vs your -5%. But the 16KPS and 64KPS results diverge more. I'm not > sure why these results diverge so much, perhaps you have an idea? From my side, > I've run these benchmarks many many times with successive kernels and revised > patches etc, and the numbers are always similar for me. I repeat multiple times > across multiple reboots and also disable kaslr and (user) aslr to avoid any > unwanted noise/skew. > > The actual test is essentially: > > $ make defconfig && time make –s –j80 Image I'm not sure whether the config may make some difference or not. I used the default Fedora config. And I'm running my test on Fedora 39 with gcc (GCC) 13.2.1 20230918. I saw you were using ubuntu 22.04. Not sure whether this is correlated or not. And Matthew said he didn't see any number close to our number (I can't remember what exactly he said, but he should mean it) in the discussion. I'm not sure what number Matthew meant, or he meant your number? > > I'd also be interested in how you are measuring memory. I've measured both peak > and mean memory (by putting the workload in a cgroup) and see almost double the > memory increase that you report for 16KPS. Our measurements for other configs match. I also used memory.peak to measure the memory consumption. I didn't try different configs. I just noticed more cores may incur more memory consumption. It is more noticeable with 64KPS. > > But I also want to raise a more general point; We are not done with the > optimizations yet. contpte can also improve performance for iTLB, but this > requires a change to the page cache to store text in (at least) 64K folios. > Typically the iTLB is under a lot of pressure and this can help reduce it. This > change is not in mainline yet (and I still need to figure out how to make the > patch acceptable), but is worth another ~1.5% for the 4KPS case. I suspect this > will also move the needle on the other benchmarks you ran. See [3] - I'd > appreciate any thoughts you have on how to get something like this accepted. AFAIK, the improvement from reduced iTLB really depends on workloads. IIRC, MySQL is more sensitive to it. We did some tests with CONFIG_READ_ONLY_THP_FOR_FS enabled for MySQL, we saw decent improvement, but I really don't remember the exact number. > > [1] https://lore.kernel.org/all/20240215103205.2607016-1-ryan.roberts@arm.com/ > [2] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.roberts@arm.com/ > [3] https://lore.kernel.org/lkml/20240111154106.3692206-1-ryan.roberts@arm.com/ > > Conclusions > =========== > > I think people in the room already said most of what I want to say; > Unfortunately there is a trade-off between performance and memory consumption. > And it is not always practical to dole out the biggest THP we can allocate; lots > of partially used 2M chunks would lead to a lot of wasted memory. So we need a > way to let user space configure the kernel for their desired mTHP sizes. > > In the long term, it would be great to support an "auto" mode, and the current > interfaces leave the door open to that. Perhaps your suggestion to start out > with 64K and collapse to higher orders is one tool that could take us in that > direction. But 64K is arm64-specific. AMD wants 32K. So you still need some > mechanism to determine that (and the community wasn't keen on having the arch > tell us that). > > It may actually turn out that we need a more complex interface to allow a (set > of) mTHP order(s) to be enabled for a specific VMA. We previously concluded that > if/when the time comes, then madvise_process() should give us what we need. That > would allow better integration with user space. The internal fragmentation or memory waste for 2M THP is a chronic problem. The medium sized THP can help tackle this, but the performance may not be as good as 2M THP. So after the discussion I was actually thinking that we may need two policies based on the workloads since there seems to be no one policy that works for everyone. One for max TLB utilization improvement, the other for memory conservative. For example, the workload which doesn't care too much about memory waste, they can choose to allocate THP from the biggest suitable order, for example, 2M for some VM workloads. On the other side of the spectrum, we can start allocating from smaller order then collapse to larger order. The system can have a default policy, the users can change the policy by calling some interfaces, for example, madvise(). Anyway, just off the top of my head, I haven't invested too much time in this aspect yet. I don't think 64K vs 32K is a problem. The two 32K chunks in the same 64K chunk are properly aligned. 64K is not a very high order, so starting from 64K for everyone should not be a problem. I don't see why we have to care about this. By all the means mentioned above, we may be able to achieve full "auto" mode in the future. Actually another problem about the current interface is we may end up having the same behavior with different settings. For example, having "inherit" for all orders and have "always" for top level knob may behave the same as having all orders and top level knob set to "always". This may result in some confusion and violate the rule for sysfs interfaces. > > Your suggestion about splitting higher orders to 64K at swap out is interesting; > that might help with some swap fragmentation issues we are currently grappling > with. But ultimately spitting a folio is expensive and we want to avoid that > cost as much as possible. I'd prefer to continue down the route that Chris Li is > taking us so that we can do a better job of allocating swap in the first place. I think I meant splitting to 64K when we have to split. I don't mean we split to 64K all the time. If we run into swap fragmentation, splitting to small order may help reduce the premature OOM and the cost of splitting may be worth it. Just like what we did for other paths, for example, page demotion, migration, etc, we split the large folio if there is not enough memory. I may not articulate this in the slides and the discussion, sorry for the confusion. If we have a better way to tackle the swap fragmentation without splitting, that is definitely more preferred. > > Thanks, > Ryan > ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2024-06-27 20:54 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-03-28 16:47 [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 Yang Shi 2024-04-01 18:16 ` Jonathan Cameron 2024-04-02 20:04 ` Yang Shi 2024-04-04 18:57 ` Christoph Lameter (Ampere) 2024-04-04 19:33 ` David Hildenbrand 2024-04-09 18:41 ` Yang Shi 2024-04-09 18:44 ` David Hildenbrand 2024-04-30 14:41 ` Michal Hocko 2024-05-01 16:37 ` Yang Shi 2024-04-08 16:30 ` Matthew Wilcox 2024-04-08 18:56 ` Zi Yan 2024-04-09 10:47 ` Ryan Roberts 2024-06-25 11:12 ` Ryan Roberts 2024-06-25 18:11 ` Christoph Lameter (Ampere) 2024-06-26 10:47 ` Ryan Roberts 2024-06-27 20:54 ` Yang Shi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox