* [RFC 0/2] mm: introduce THP deferred setting
@ 2024-07-29 22:27 Nico Pache
2024-07-29 22:27 ` [RFC 1/2] mm: defer THP insertion to khugepaged Nico Pache
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Nico Pache @ 2024-07-29 22:27 UTC (permalink / raw)
To: linux-kernel, linux-doc, linux-mm
Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Barry Song,
Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Zi Yan,
Rafael Aquini, Andrea Arcangeli, Jonathan Corbet
We've seen cases were customers switching from RHEL7 to RHEL8 see a
significant increase in the memory footprint for the same workloads.
Through our investigations we found that a large contributing factor to
the increase in RSS was an increase in THP usage.
For workloads like MySQL, or when using allocators like jemalloc, it is
often recommended to set /transparent_hugepages/enabled=never. This is
in part due to performance degradations and increased memory waste.
This series introduces enabled=defer, this setting acts as a middle
ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
page fault handler will act normally, making a hugepage if possible. If
the allocation is not MADV_HUGEPAGE, then the page fault handler will
default to the base size allocation. The caveat is that khugepaged can
still operate on pages thats not MADV_HUGEPAGE.
This allows for two things... one, applications specifically designed to
use hugepages will get them, and two, applications that don't use
hugepages can still benefit from them without aggressively inserting
THPs at every possible chance. This curbs the memory waste, and defers
the use of hugepages to khugepaged. Khugepaged can then scan the memory
for eligible collapsing.
Admins may want to lower max_ptes_none, if not, khugepaged may
aggressively collapse single allocations into hugepages.
RFC note
==========
Im not sure if im missing anything related to the mTHP
changes. I think now that we have hugepage_pmd_enabled in
commit 00f58104202c ("mm: fix khugepaged activation policy") everything
should work as expected.
Nico Pache (2):
mm: defer THP insertion to khugepaged
mm: document transparent_hugepage=defer usage
Documentation/admin-guide/mm/transhuge.rst | 18 ++++++++++---
include/linux/huge_mm.h | 15 +++++++++--
mm/huge_memory.c | 31 +++++++++++++++++++---
3 files changed, 55 insertions(+), 9 deletions(-)
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Barry Song <baohua@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
--
2.45.2
^ permalink raw reply [flat|nested] 19+ messages in thread* [RFC 1/2] mm: defer THP insertion to khugepaged 2024-07-29 22:27 [RFC 0/2] mm: introduce THP deferred setting Nico Pache @ 2024-07-29 22:27 ` Nico Pache 2024-07-29 22:27 ` [RFC 2/2] mm: document transparent_hugepage=defer usage Nico Pache 2024-07-30 1:26 ` [RFC 0/2] mm: introduce THP deferred setting Zi Yan 2 siblings, 0 replies; 19+ messages in thread From: Nico Pache @ 2024-07-29 22:27 UTC (permalink / raw) To: linux-kernel, linux-doc, linux-mm Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Zi Yan, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet setting /transparent_hugepages/enabled=always allows applications to benefit from THPs without having to madvise. However, the pf handler takes very few considerations to decide weather or not to actually use a THP. This can lead to a lot of wasted memory. khugepaged only operates on memory that was either allocated with enabled=always or MADV_HUGEPAGE. Introduce the ability to set enabled=defer, which will prevent THPs from being allocated by the page fault handler unless madvise is set, leaving it up to khugepaged to decide which allocations will collapse to a THP. This should allow applications to benefits from THPs, while curbing some of the memory waste. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Barry Song <baohua@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Nico Pache <npache@redhat.com> --- include/linux/huge_mm.h | 15 +++++++++++++-- mm/huge_memory.c | 31 +++++++++++++++++++++++++++---- 2 files changed, 40 insertions(+), 6 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index e25d9ebfdf89..7ef7d3e2f5c6 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -45,6 +45,7 @@ enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_UNSUPPORTED, TRANSPARENT_HUGEPAGE_FLAG, TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, + TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG, TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, @@ -125,6 +126,7 @@ static inline bool hugepage_global_enabled(void) { return transparent_hugepage_flags & ((1<<TRANSPARENT_HUGEPAGE_FLAG) | + (1<<TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG) | (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)); } @@ -134,6 +136,12 @@ static inline bool hugepage_global_always(void) (1<<TRANSPARENT_HUGEPAGE_FLAG); } +static inline bool hugepage_global_defer(void) +{ + return transparent_hugepage_flags & + (1<<TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG); +} + static inline int highest_order(unsigned long orders) { return fls_long(orders) - 1; @@ -243,13 +251,16 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, unsigned long tva_flags, unsigned long orders) { + if ((tva_flags & TVA_IN_PF) && hugepage_global_defer() && + !(vm_flags & VM_HUGEPAGE)) + return 0; + /* Optimization to check if required orders are enabled early. */ if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) { unsigned long mask = READ_ONCE(huge_anon_orders_always); - if (vm_flags & VM_HUGEPAGE) mask |= READ_ONCE(huge_anon_orders_madvise); - if (hugepage_global_always() || + if (hugepage_global_always() || hugepage_global_defer() || ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled())) mask |= READ_ONCE(huge_anon_orders_inherit); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f4be468e06a4..a2975ad2cd91 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -294,12 +294,15 @@ static ssize_t enabled_show(struct kobject *kobj, const char *output; if (test_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags)) - output = "[always] madvise never"; + output = "[always] madvise defer never"; else if (test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags)) - output = "always [madvise] never"; + output = "always [madvise] defer never"; + else if (test_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG, + &transparent_hugepage_flags)) + output = "always madvise [defer] never"; else - output = "always madvise [never]"; + output = "always madvise defer [never]"; return sysfs_emit(buf, "%s\n", output); } @@ -312,13 +315,20 @@ static ssize_t enabled_store(struct kobject *kobj, if (sysfs_streq(buf, "always")) { clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags); + clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG, &transparent_hugepage_flags); set_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags); + } else if (sysfs_streq(buf, "defer")) { + clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags); + clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags); + set_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG, &transparent_hugepage_flags); } else if (sysfs_streq(buf, "madvise")) { clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags); + clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG, &transparent_hugepage_flags); set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags); } else if (sysfs_streq(buf, "never")) { clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags); clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags); + clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG, &transparent_hugepage_flags); } else ret = -EINVAL; @@ -817,18 +827,31 @@ static int __init setup_transparent_hugepage(char *str) &transparent_hugepage_flags); clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags); + clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG, + &transparent_hugepage_flags); ret = 1; + } else if (!strcmp(str, "defer")) { + clear_bit(TRANSPARENT_HUGEPAGE_FLAG, + &transparent_hugepage_flags); + clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, + &transparent_hugepage_flags); + set_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG, + &transparent_hugepage_flags); } else if (!strcmp(str, "madvise")) { clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags); + clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG, + &transparent_hugepage_flags); set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, - &transparent_hugepage_flags); + &transparent_hugepage_flags); ret = 1; } else if (!strcmp(str, "never")) { clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags); clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags); + clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_INST_FLAG, + &transparent_hugepage_flags); ret = 1; } out: -- 2.45.2 ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC 2/2] mm: document transparent_hugepage=defer usage 2024-07-29 22:27 [RFC 0/2] mm: introduce THP deferred setting Nico Pache 2024-07-29 22:27 ` [RFC 1/2] mm: defer THP insertion to khugepaged Nico Pache @ 2024-07-29 22:27 ` Nico Pache 2024-07-30 1:26 ` [RFC 0/2] mm: introduce THP deferred setting Zi Yan 2 siblings, 0 replies; 19+ messages in thread From: Nico Pache @ 2024-07-29 22:27 UTC (permalink / raw) To: linux-kernel, linux-doc, linux-mm Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Zi Yan, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet The new transparent_hugepage=defer option allows for a more conservative approach to THPs. Document its usage in the transhuge admin-guide. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Barry Song <baohua@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Nico Pache <npache@redhat.com> --- Documentation/admin-guide/mm/transhuge.rst | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 058485daf186..1946fbb789b2 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -88,8 +88,9 @@ In certain cases when hugepages are enabled system wide, application may end up allocating more memory resources. An application may mmap a large region but only touch 1 byte of it, in that case a 2M page might be allocated instead of a 4k page for no good. This is why it's -possible to disable hugepages system-wide and to only have them inside -MADV_HUGEPAGE madvise regions. +possible to disable hugepages system-wide, only have them inside +MADV_HUGEPAGE madvise regions, or defer them away from the page fault +handler to khugepaged. Embedded systems should enable hugepages only inside madvise regions to eliminate any risk of wasting any precious byte of memory and to @@ -99,6 +100,15 @@ Applications that gets a lot of benefit from hugepages and that don't risk to lose memory by using hugepages, should use madvise(MADV_HUGEPAGE) on their critical mmapped regions. +Applications that would like to benefit from THPs but would still like a +more memory conservative approach can choose 'defer'. This avoids +inserting THPs at the page fault handler unless they are MADV_HUGEPAGE. +Khugepaged will then scan the mappings for potential collapses into PMD +sized pages. Admins using this the 'defer' setting should consider +tweaking khugepaged/max_ptes_none. The current default of 511 may +aggressively collapse your PTEs into PMDs. Lower this value to conserve +more memory (ie. max_ptes_none=64). + .. _thp_sysfs: sysfs @@ -136,6 +146,7 @@ The top-level setting (for use with "inherit") can be set by issuing one of the following commands:: echo always >/sys/kernel/mm/transparent_hugepage/enabled + echo defer >/sys/kernel/mm/transparent_hugepage/enabled echo madvise >/sys/kernel/mm/transparent_hugepage/enabled echo never >/sys/kernel/mm/transparent_hugepage/enabled @@ -264,7 +275,8 @@ of small pages into one large page:: A higher value leads to use additional memory for programs. A lower value leads to gain less thp performance. Value of max_ptes_none can waste cpu time very little, you can -ignore it. +ignore it. Consider lowering this value when using +``transparent_hugepage=defer`` ``max_ptes_swap`` specifies how many pages can be brought in from swap when collapsing a group of pages into a transparent huge page:: -- 2.45.2 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/2] mm: introduce THP deferred setting 2024-07-29 22:27 [RFC 0/2] mm: introduce THP deferred setting Nico Pache 2024-07-29 22:27 ` [RFC 1/2] mm: defer THP insertion to khugepaged Nico Pache 2024-07-29 22:27 ` [RFC 2/2] mm: document transparent_hugepage=defer usage Nico Pache @ 2024-07-30 1:26 ` Zi Yan 2024-07-30 22:37 ` Nico Pache 2 siblings, 1 reply; 19+ messages in thread From: Zi Yan @ 2024-07-30 1:26 UTC (permalink / raw) To: Nico Pache Cc: linux-kernel, linux-doc, linux-mm, Andrew Morton, David Hildenbrand, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet, Kirill A . Shutemov [-- Attachment #1: Type: text/plain, Size: 3104 bytes --] +Kirill On 29 Jul 2024, at 18:27, Nico Pache wrote: > We've seen cases were customers switching from RHEL7 to RHEL8 see a > significant increase in the memory footprint for the same workloads. > > Through our investigations we found that a large contributing factor to > the increase in RSS was an increase in THP usage. Any knob is changed from RHEL7 to RHEL8 to cause more THP usage? > > For workloads like MySQL, or when using allocators like jemalloc, it is > often recommended to set /transparent_hugepages/enabled=never. This is > in part due to performance degradations and increased memory waste. > > This series introduces enabled=defer, this setting acts as a middle > ground between always and madvise. If the mapping is MADV_HUGEPAGE, the > page fault handler will act normally, making a hugepage if possible. If > the allocation is not MADV_HUGEPAGE, then the page fault handler will > default to the base size allocation. The caveat is that khugepaged can > still operate on pages thats not MADV_HUGEPAGE. Why? If user does not explicitly want huge page, why bother providing huge pages? Wouldn't it increase memory footprint? > > This allows for two things... one, applications specifically designed to > use hugepages will get them, and two, applications that don't use > hugepages can still benefit from them without aggressively inserting > THPs at every possible chance. This curbs the memory waste, and defers > the use of hugepages to khugepaged. Khugepaged can then scan the memory > for eligible collapsing. khugepaged would replace application memory with huge pages without specific goal. Why not use a user space agent with process_madvise() to collapse huge pages? Admin might have more knobs to tweak than khugepaged. > > Admins may want to lower max_ptes_none, if not, khugepaged may > aggressively collapse single allocations into hugepages. > > RFC note > ========== > Im not sure if im missing anything related to the mTHP > changes. I think now that we have hugepage_pmd_enabled in > commit 00f58104202c ("mm: fix khugepaged activation policy") everything > should work as expected. > > Nico Pache (2): > mm: defer THP insertion to khugepaged > mm: document transparent_hugepage=defer usage > > Documentation/admin-guide/mm/transhuge.rst | 18 ++++++++++--- > include/linux/huge_mm.h | 15 +++++++++-- > mm/huge_memory.c | 31 +++++++++++++++++++--- > 3 files changed, 55 insertions(+), 9 deletions(-) > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: David Hildenbrand <david@redhat.com> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Barry Song <baohua@kernel.org> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > Cc: Lance Yang <ioworker0@gmail.com> > Cc: Peter Xu <peterx@redhat.com> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Rafael Aquini <aquini@redhat.com> > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Jonathan Corbet <corbet@lwn.net> > -- > 2.45.2 -- Best Regards, Yan, Zi [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 854 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/2] mm: introduce THP deferred setting 2024-07-30 1:26 ` [RFC 0/2] mm: introduce THP deferred setting Zi Yan @ 2024-07-30 22:37 ` Nico Pache 2024-08-26 15:40 ` Nico Pache 0 siblings, 1 reply; 19+ messages in thread From: Nico Pache @ 2024-07-30 22:37 UTC (permalink / raw) To: Zi Yan Cc: linux-kernel, linux-doc, linux-mm, Andrew Morton, David Hildenbrand, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet, Kirill A . Shutemov Hi Zi Yan, On Mon, Jul 29, 2024 at 7:26 PM Zi Yan <ziy@nvidia.com> wrote: > > +Kirill > > On 29 Jul 2024, at 18:27, Nico Pache wrote: > > > We've seen cases were customers switching from RHEL7 to RHEL8 see a > > significant increase in the memory footprint for the same workloads. > > > > Through our investigations we found that a large contributing factor to > > the increase in RSS was an increase in THP usage. > > Any knob is changed from RHEL7 to RHEL8 to cause more THP usage? IIRC, most of the systems tuning is the same. We attributed the increase in THP usage to a combination of improvements in the kernel, and improvements in the libraries (better alignments). That allowed THP allocations to succeed at a higher rate. I can go back and confirm this tomorrow though. > > > > > For workloads like MySQL, or when using allocators like jemalloc, it is > > often recommended to set /transparent_hugepages/enabled=never. This is > > in part due to performance degradations and increased memory waste. > > > > This series introduces enabled=defer, this setting acts as a middle > > ground between always and madvise. If the mapping is MADV_HUGEPAGE, the > > page fault handler will act normally, making a hugepage if possible. If > > the allocation is not MADV_HUGEPAGE, then the page fault handler will > > default to the base size allocation. The caveat is that khugepaged can > > still operate on pages thats not MADV_HUGEPAGE. > > Why? If user does not explicitly want huge page, why bother providing huge > pages? Wouldn't it increase memory footprint? So we have "always", which will always try to allocate a THP when it can. This setting gives good performance in a lot of conditions, but tends to waste memory. Additionally applications DON'T need to be modified to take advantage of THPs. We have "madvise" which will only satisfy allocations that are MADV_HUGEPAGE, this gives you granular control, and a lot of times these madvises come from libraries. Unlike "always" you DO need to modify your application if you want to use THPs. Then we have "never", which of course, never allocates THPs. Ok. back to your question, like "madvise", "defer" gives you the benefits of THPs when you specifically know you want them (madv_hugepage), but also benefits applications that dont specifically ask for them (or cant be modified to ask for them), like "always" does. The applications that dont ask for THPs must wait for khugepaged to get them (avoid insertions at PF time)-- this curbs a lot of memory waste, and gives an increased tunability over "always". Another added benefit is that khugepaged will most likely not operate on short lived allocations, meaning that only longstanding memory will be collapsed to THPs. The memory waste can be tuned with max_ptes_none... lets say you want ~90% of your PMD to be full before collapsing into a huge page. simply set max_ptes_none=64. or no waste, set max_ptes_none=0, requiring the 512 pages to be present before being collapsed. > > > > > This allows for two things... one, applications specifically designed to > > use hugepages will get them, and two, applications that don't use > > hugepages can still benefit from them without aggressively inserting > > THPs at every possible chance. This curbs the memory waste, and defers > > the use of hugepages to khugepaged. Khugepaged can then scan the memory > > for eligible collapsing. > > khugepaged would replace application memory with huge pages without specific > goal. Why not use a user space agent with process_madvise() to collapse > huge pages? Admin might have more knobs to tweak than khugepaged. The benefits of "always" are that no userspace agent is needed, and applications dont have to be modified to use madvise(MADV_HUGEPAGE) to benefit from THPs. This setting hopes to gain some of the same benefits without the significant waste of memory and an increased tunability. future changes I have in the works are to make khugepaged more "smart". Moving it away from the round robin fashion it currently operates in, to instead make smart and informed decisions of what memory to collapse (and potentially split). Hopefully that helped explain the motivation for this new setting! Cheer! -- Nico > > > > > Admins may want to lower max_ptes_none, if not, khugepaged may > > aggressively collapse single allocations into hugepages. > > > > RFC note > > ========== > > Im not sure if im missing anything related to the mTHP > > changes. I think now that we have hugepage_pmd_enabled in > > commit 00f58104202c ("mm: fix khugepaged activation policy") everything > > should work as expected. > > > > Nico Pache (2): > > mm: defer THP insertion to khugepaged > > mm: document transparent_hugepage=defer usage > > > > Documentation/admin-guide/mm/transhuge.rst | 18 ++++++++++--- > > include/linux/huge_mm.h | 15 +++++++++-- > > mm/huge_memory.c | 31 +++++++++++++++++++--- > > 3 files changed, 55 insertions(+), 9 deletions(-) > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: David Hildenbrand <david@redhat.com> > > Cc: Matthew Wilcox <willy@infradead.org> > > Cc: Barry Song <baohua@kernel.org> > > Cc: Ryan Roberts <ryan.roberts@arm.com> > > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > > Cc: Lance Yang <ioworker0@gmail.com> > > Cc: Peter Xu <peterx@redhat.com> > > Cc: Zi Yan <ziy@nvidia.com> > > Cc: Rafael Aquini <aquini@redhat.com> > > Cc: Andrea Arcangeli <aarcange@redhat.com> > > Cc: Jonathan Corbet <corbet@lwn.net> > > -- > > 2.45.2 > > -- > Best Regards, > Yan, Zi ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/2] mm: introduce THP deferred setting 2024-07-30 22:37 ` Nico Pache @ 2024-08-26 15:40 ` Nico Pache 2024-08-26 16:47 ` Usama Arif 0 siblings, 1 reply; 19+ messages in thread From: Nico Pache @ 2024-08-26 15:40 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: linux-doc, Andrew Morton, David Hildenbrand, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet, Kirill A . Shutemov, Zi Yan, usamaarif642 On Tue, Jul 30, 2024 at 4:37 PM Nico Pache <npache@redhat.com> wrote: > > Hi Zi Yan, > On Mon, Jul 29, 2024 at 7:26 PM Zi Yan <ziy@nvidia.com> wrote: > > > > +Kirill > > > > On 29 Jul 2024, at 18:27, Nico Pache wrote: > > > > > We've seen cases were customers switching from RHEL7 to RHEL8 see a > > > significant increase in the memory footprint for the same workloads. > > > > > > Through our investigations we found that a large contributing factor to > > > the increase in RSS was an increase in THP usage. > > > > Any knob is changed from RHEL7 to RHEL8 to cause more THP usage? > IIRC, most of the systems tuning is the same. We attributed the > increase in THP usage to a combination of improvements in the kernel, > and improvements in the libraries (better alignments). That allowed > THP allocations to succeed at a higher rate. I can go back and confirm > this tomorrow though. > > > > > > > > For workloads like MySQL, or when using allocators like jemalloc, it is > > > often recommended to set /transparent_hugepages/enabled=never. This is > > > in part due to performance degradations and increased memory waste. > > > > > > This series introduces enabled=defer, this setting acts as a middle > > > ground between always and madvise. If the mapping is MADV_HUGEPAGE, the > > > page fault handler will act normally, making a hugepage if possible. If > > > the allocation is not MADV_HUGEPAGE, then the page fault handler will > > > default to the base size allocation. The caveat is that khugepaged can > > > still operate on pages thats not MADV_HUGEPAGE. > > > > Why? If user does not explicitly want huge page, why bother providing huge > > pages? Wouldn't it increase memory footprint? > > So we have "always", which will always try to allocate a THP when it > can. This setting gives good performance in a lot of conditions, but > tends to waste memory. Additionally applications DON'T need to be > modified to take advantage of THPs. > > We have "madvise" which will only satisfy allocations that are > MADV_HUGEPAGE, this gives you granular control, and a lot of times > these madvises come from libraries. Unlike "always" you DO need to > modify your application if you want to use THPs. > > Then we have "never", which of course, never allocates THPs. > > Ok. back to your question, like "madvise", "defer" gives you the > benefits of THPs when you specifically know you want them > (madv_hugepage), but also benefits applications that dont specifically > ask for them (or cant be modified to ask for them), like "always" > does. The applications that dont ask for THPs must wait for khugepaged > to get them (avoid insertions at PF time)-- this curbs a lot of memory > waste, and gives an increased tunability over "always". Another added > benefit is that khugepaged will most likely not operate on short lived > allocations, meaning that only longstanding memory will be collapsed > to THPs. > > The memory waste can be tuned with max_ptes_none... lets say you want > ~90% of your PMD to be full before collapsing into a huge page. simply > set max_ptes_none=64. or no waste, set max_ptes_none=0, requiring the > 512 pages to be present before being collapsed. > > > > > > > > > This allows for two things... one, applications specifically designed to > > > use hugepages will get them, and two, applications that don't use > > > hugepages can still benefit from them without aggressively inserting > > > THPs at every possible chance. This curbs the memory waste, and defers > > > the use of hugepages to khugepaged. Khugepaged can then scan the memory > > > for eligible collapsing. > > > > khugepaged would replace application memory with huge pages without specific > > goal. Why not use a user space agent with process_madvise() to collapse > > huge pages? Admin might have more knobs to tweak than khugepaged. > > The benefits of "always" are that no userspace agent is needed, and > applications dont have to be modified to use madvise(MADV_HUGEPAGE) to > benefit from THPs. This setting hopes to gain some of the same > benefits without the significant waste of memory and an increased > tunability. > > future changes I have in the works are to make khugepaged more > "smart". Moving it away from the round robin fashion it currently > operates in, to instead make smart and informed decisions of what > memory to collapse (and potentially split). > > Hopefully that helped explain the motivation for this new setting! Any last comments before I resend this? Ive been made aware of https://lore.kernel.org/all/20240730125346.1580150-1-usamaarif642@gmail.com/T/#u which introduces THP splitting. These are both trying to achieve the same thing through different means. Our approach leverages khugepaged to promote pages, while Usama's uses the reclaim path to demote hugepages and shrink the underlying memory. I will leave it up to reviewers to determine which is better; However, we can't have both, as we'd be introducing trashing conditions. Cheers, -- Nico > > Cheer! > -- Nico > > > > > > > > Admins may want to lower max_ptes_none, if not, khugepaged may > > > aggressively collapse single allocations into hugepages. > > > > > > RFC note > > > ========== > > > Im not sure if im missing anything related to the mTHP > > > changes. I think now that we have hugepage_pmd_enabled in > > > commit 00f58104202c ("mm: fix khugepaged activation policy") everything > > > should work as expected. > > > > > > Nico Pache (2): > > > mm: defer THP insertion to khugepaged > > > mm: document transparent_hugepage=defer usage > > > > > > Documentation/admin-guide/mm/transhuge.rst | 18 ++++++++++--- > > > include/linux/huge_mm.h | 15 +++++++++-- > > > mm/huge_memory.c | 31 +++++++++++++++++++--- > > > 3 files changed, 55 insertions(+), 9 deletions(-) > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > Cc: David Hildenbrand <david@redhat.com> > > > Cc: Matthew Wilcox <willy@infradead.org> > > > Cc: Barry Song <baohua@kernel.org> > > > Cc: Ryan Roberts <ryan.roberts@arm.com> > > > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > > > Cc: Lance Yang <ioworker0@gmail.com> > > > Cc: Peter Xu <peterx@redhat.com> > > > Cc: Zi Yan <ziy@nvidia.com> > > > Cc: Rafael Aquini <aquini@redhat.com> > > > Cc: Andrea Arcangeli <aarcange@redhat.com> > > > Cc: Jonathan Corbet <corbet@lwn.net> > > > -- > > > 2.45.2 > > > > -- > > Best Regards, > > Yan, Zi ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/2] mm: introduce THP deferred setting 2024-08-26 15:40 ` Nico Pache @ 2024-08-26 16:47 ` Usama Arif 2024-08-26 21:14 ` Nico Pache 0 siblings, 1 reply; 19+ messages in thread From: Usama Arif @ 2024-08-26 16:47 UTC (permalink / raw) To: Nico Pache, linux-mm, linux-kernel Cc: linux-doc, Andrew Morton, David Hildenbrand, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet, Kirill A . Shutemov, Zi Yan, Johannes Weiner On 26/08/2024 11:40, Nico Pache wrote: > On Tue, Jul 30, 2024 at 4:37 PM Nico Pache <npache@redhat.com> wrote: >> >> Hi Zi Yan, >> On Mon, Jul 29, 2024 at 7:26 PM Zi Yan <ziy@nvidia.com> wrote: >>> >>> +Kirill >>> >>> On 29 Jul 2024, at 18:27, Nico Pache wrote: >>> >>>> We've seen cases were customers switching from RHEL7 to RHEL8 see a >>>> significant increase in the memory footprint for the same workloads. >>>> >>>> Through our investigations we found that a large contributing factor to >>>> the increase in RSS was an increase in THP usage. >>> >>> Any knob is changed from RHEL7 to RHEL8 to cause more THP usage? >> IIRC, most of the systems tuning is the same. We attributed the >> increase in THP usage to a combination of improvements in the kernel, >> and improvements in the libraries (better alignments). That allowed >> THP allocations to succeed at a higher rate. I can go back and confirm >> this tomorrow though. >>> >>>> >>>> For workloads like MySQL, or when using allocators like jemalloc, it is >>>> often recommended to set /transparent_hugepages/enabled=never. This is >>>> in part due to performance degradations and increased memory waste. >>>> >>>> This series introduces enabled=defer, this setting acts as a middle >>>> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the >>>> page fault handler will act normally, making a hugepage if possible. If >>>> the allocation is not MADV_HUGEPAGE, then the page fault handler will >>>> default to the base size allocation. The caveat is that khugepaged can >>>> still operate on pages thats not MADV_HUGEPAGE. >>> >>> Why? If user does not explicitly want huge page, why bother providing huge >>> pages? Wouldn't it increase memory footprint? >> >> So we have "always", which will always try to allocate a THP when it >> can. This setting gives good performance in a lot of conditions, but >> tends to waste memory. Additionally applications DON'T need to be >> modified to take advantage of THPs. >> >> We have "madvise" which will only satisfy allocations that are >> MADV_HUGEPAGE, this gives you granular control, and a lot of times >> these madvises come from libraries. Unlike "always" you DO need to >> modify your application if you want to use THPs. >> >> Then we have "never", which of course, never allocates THPs. >> >> Ok. back to your question, like "madvise", "defer" gives you the >> benefits of THPs when you specifically know you want them >> (madv_hugepage), but also benefits applications that dont specifically >> ask for them (or cant be modified to ask for them), like "always" >> does. The applications that dont ask for THPs must wait for khugepaged >> to get them (avoid insertions at PF time)-- this curbs a lot of memory >> waste, and gives an increased tunability over "always". Another added >> benefit is that khugepaged will most likely not operate on short lived >> allocations, meaning that only longstanding memory will be collapsed >> to THPs. >> >> The memory waste can be tuned with max_ptes_none... lets say you want >> ~90% of your PMD to be full before collapsing into a huge page. simply >> set max_ptes_none=64. or no waste, set max_ptes_none=0, requiring the >> 512 pages to be present before being collapsed. >> >>> >>>> >>>> This allows for two things... one, applications specifically designed to >>>> use hugepages will get them, and two, applications that don't use >>>> hugepages can still benefit from them without aggressively inserting >>>> THPs at every possible chance. This curbs the memory waste, and defers >>>> the use of hugepages to khugepaged. Khugepaged can then scan the memory >>>> for eligible collapsing. >>> >>> khugepaged would replace application memory with huge pages without specific >>> goal. Why not use a user space agent with process_madvise() to collapse >>> huge pages? Admin might have more knobs to tweak than khugepaged. >> >> The benefits of "always" are that no userspace agent is needed, and >> applications dont have to be modified to use madvise(MADV_HUGEPAGE) to >> benefit from THPs. This setting hopes to gain some of the same >> benefits without the significant waste of memory and an increased >> tunability. >> >> future changes I have in the works are to make khugepaged more >> "smart". Moving it away from the round robin fashion it currently >> operates in, to instead make smart and informed decisions of what >> memory to collapse (and potentially split). >> >> Hopefully that helped explain the motivation for this new setting! > > Any last comments before I resend this? > > Ive been made aware of > https://lore.kernel.org/all/20240730125346.1580150-1-usamaarif642@gmail.com/T/#u > which introduces THP splitting. These are both trying to achieve the > same thing through different means. Our approach leverages khugepaged > to promote pages, while Usama's uses the reclaim path to demote > hugepages and shrink the underlying memory. > > I will leave it up to reviewers to determine which is better; However, > we can't have both, as we'd be introducing trashing conditions. > Hi, Just inserting this here from my cover letter: Waiting for khugepaged to scan memory and collapse pages into THP can be slow and unpredictable in terms of performance (i.e. you dont know when the collapse will happen), while production environments require predictable performance. If there is enough memory available, its better for both performance and predictability to have a THP from fault time, i.e. THP=always rather than wait for khugepaged to collapse it, and deal with sparsely populated THPs when the system is running out of memory. I just went through your patches, and am not sure why we can't have both? Both use max_ptes_none as the tunable. If the number of zero-filled pages is above max_ptes_none, the shrinker will split them, and khugepaged will not collapse them (SCAN_EXCEED_NONE_PTE), so I don't see how it causes trashing? > Cheers, > -- Nico > > > > > >> >> Cheer! >> -- Nico >>> >>>> >>>> Admins may want to lower max_ptes_none, if not, khugepaged may >>>> aggressively collapse single allocations into hugepages. >>>> >>>> RFC note >>>> ========== >>>> Im not sure if im missing anything related to the mTHP >>>> changes. I think now that we have hugepage_pmd_enabled in >>>> commit 00f58104202c ("mm: fix khugepaged activation policy") everything >>>> should work as expected. >>>> >>>> Nico Pache (2): >>>> mm: defer THP insertion to khugepaged >>>> mm: document transparent_hugepage=defer usage >>>> >>>> Documentation/admin-guide/mm/transhuge.rst | 18 ++++++++++--- >>>> include/linux/huge_mm.h | 15 +++++++++-- >>>> mm/huge_memory.c | 31 +++++++++++++++++++--- >>>> 3 files changed, 55 insertions(+), 9 deletions(-) >>>> >>>> Cc: Andrew Morton <akpm@linux-foundation.org> >>>> Cc: David Hildenbrand <david@redhat.com> >>>> Cc: Matthew Wilcox <willy@infradead.org> >>>> Cc: Barry Song <baohua@kernel.org> >>>> Cc: Ryan Roberts <ryan.roberts@arm.com> >>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>>> Cc: Lance Yang <ioworker0@gmail.com> >>>> Cc: Peter Xu <peterx@redhat.com> >>>> Cc: Zi Yan <ziy@nvidia.com> >>>> Cc: Rafael Aquini <aquini@redhat.com> >>>> Cc: Andrea Arcangeli <aarcange@redhat.com> >>>> Cc: Jonathan Corbet <corbet@lwn.net> >>>> -- >>>> 2.45.2 >>> >>> -- >>> Best Regards, >>> Yan, Zi > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/2] mm: introduce THP deferred setting 2024-08-26 16:47 ` Usama Arif @ 2024-08-26 21:14 ` Nico Pache 2024-08-27 10:37 ` Usama Arif 0 siblings, 1 reply; 19+ messages in thread From: Nico Pache @ 2024-08-26 21:14 UTC (permalink / raw) To: Usama Arif Cc: linux-mm, linux-kernel, linux-doc, Andrew Morton, David Hildenbrand, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet, Kirill A . Shutemov, Zi Yan, Johannes Weiner On Mon, Aug 26, 2024 at 10:47 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 26/08/2024 11:40, Nico Pache wrote: > > On Tue, Jul 30, 2024 at 4:37 PM Nico Pache <npache@redhat.com> wrote: > >> > >> Hi Zi Yan, > >> On Mon, Jul 29, 2024 at 7:26 PM Zi Yan <ziy@nvidia.com> wrote: > >>> > >>> +Kirill > >>> > >>> On 29 Jul 2024, at 18:27, Nico Pache wrote: > >>> > >>>> We've seen cases were customers switching from RHEL7 to RHEL8 see a > >>>> significant increase in the memory footprint for the same workloads. > >>>> > >>>> Through our investigations we found that a large contributing factor to > >>>> the increase in RSS was an increase in THP usage. > >>> > >>> Any knob is changed from RHEL7 to RHEL8 to cause more THP usage? > >> IIRC, most of the systems tuning is the same. We attributed the > >> increase in THP usage to a combination of improvements in the kernel, > >> and improvements in the libraries (better alignments). That allowed > >> THP allocations to succeed at a higher rate. I can go back and confirm > >> this tomorrow though. > >>> > >>>> > >>>> For workloads like MySQL, or when using allocators like jemalloc, it is > >>>> often recommended to set /transparent_hugepages/enabled=never. This is > >>>> in part due to performance degradations and increased memory waste. > >>>> > >>>> This series introduces enabled=defer, this setting acts as a middle > >>>> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the > >>>> page fault handler will act normally, making a hugepage if possible. If > >>>> the allocation is not MADV_HUGEPAGE, then the page fault handler will > >>>> default to the base size allocation. The caveat is that khugepaged can > >>>> still operate on pages thats not MADV_HUGEPAGE. > >>> > >>> Why? If user does not explicitly want huge page, why bother providing huge > >>> pages? Wouldn't it increase memory footprint? > >> > >> So we have "always", which will always try to allocate a THP when it > >> can. This setting gives good performance in a lot of conditions, but > >> tends to waste memory. Additionally applications DON'T need to be > >> modified to take advantage of THPs. > >> > >> We have "madvise" which will only satisfy allocations that are > >> MADV_HUGEPAGE, this gives you granular control, and a lot of times > >> these madvises come from libraries. Unlike "always" you DO need to > >> modify your application if you want to use THPs. > >> > >> Then we have "never", which of course, never allocates THPs. > >> > >> Ok. back to your question, like "madvise", "defer" gives you the > >> benefits of THPs when you specifically know you want them > >> (madv_hugepage), but also benefits applications that dont specifically > >> ask for them (or cant be modified to ask for them), like "always" > >> does. The applications that dont ask for THPs must wait for khugepaged > >> to get them (avoid insertions at PF time)-- this curbs a lot of memory > >> waste, and gives an increased tunability over "always". Another added > >> benefit is that khugepaged will most likely not operate on short lived > >> allocations, meaning that only longstanding memory will be collapsed > >> to THPs. > >> > >> The memory waste can be tuned with max_ptes_none... lets say you want > >> ~90% of your PMD to be full before collapsing into a huge page. simply > >> set max_ptes_none=64. or no waste, set max_ptes_none=0, requiring the > >> 512 pages to be present before being collapsed. > >> > >>> > >>>> > >>>> This allows for two things... one, applications specifically designed to > >>>> use hugepages will get them, and two, applications that don't use > >>>> hugepages can still benefit from them without aggressively inserting > >>>> THPs at every possible chance. This curbs the memory waste, and defers > >>>> the use of hugepages to khugepaged. Khugepaged can then scan the memory > >>>> for eligible collapsing. > >>> > >>> khugepaged would replace application memory with huge pages without specific > >>> goal. Why not use a user space agent with process_madvise() to collapse > >>> huge pages? Admin might have more knobs to tweak than khugepaged. > >> > >> The benefits of "always" are that no userspace agent is needed, and > >> applications dont have to be modified to use madvise(MADV_HUGEPAGE) to > >> benefit from THPs. This setting hopes to gain some of the same > >> benefits without the significant waste of memory and an increased > >> tunability. > >> > >> future changes I have in the works are to make khugepaged more > >> "smart". Moving it away from the round robin fashion it currently > >> operates in, to instead make smart and informed decisions of what > >> memory to collapse (and potentially split). > >> > >> Hopefully that helped explain the motivation for this new setting! > > > > Any last comments before I resend this? > > > > Ive been made aware of > > https://lore.kernel.org/all/20240730125346.1580150-1-usamaarif642@gmail.com/T/#u > > which introduces THP splitting. These are both trying to achieve the > > same thing through different means. Our approach leverages khugepaged > > to promote pages, while Usama's uses the reclaim path to demote > > hugepages and shrink the underlying memory. > > > > I will leave it up to reviewers to determine which is better; However, > > we can't have both, as we'd be introducing trashing conditions. > > > > Hi, > > Just inserting this here from my cover letter: > > Waiting for khugepaged to scan memory and > collapse pages into THP can be slow and unpredictable in terms of performance Obviously not part of my patchset here, but I have been testing some changes to khugepaged to make it more aware of what processes are hot. Ideally then it can make better choices of what to operate on. > (i.e. you dont know when the collapse will happen), while production > environments require predictable performance. If there is enough memory > available, its better for both performance and predictability to have > a THP from fault time, i.e. THP=always rather than wait for khugepaged > to collapse it, and deal with sparsely populated THPs when the system is > running out of memory. > > I just went through your patches, and am not sure why we can't have both? Fair point, we can. I've been playing around with splitting hugepages and via khugepaged and was thinking of the trashing conditions there-- but your implementation takes a different approach. I've been working on performance testing my "defer" changes, once I find the appropriate workloads I'll try adding your changes to the mix. I have a feeling my approach is better for latency sensitive workloads, while yours is better for throughput, but let me find a way to confirm that. > Both use max_ptes_none as the tunable. If the number of zero-filled pages > is above max_ptes_none, the shrinker will split them, and khugepaged will not collapse > them (SCAN_EXCEED_NONE_PTE), so I don't see how it causes trashing? > > > > Cheers, > > -- Nico > > > > > > > > > > > >> > >> Cheer! > >> -- Nico > >>> > >>>> > >>>> Admins may want to lower max_ptes_none, if not, khugepaged may > >>>> aggressively collapse single allocations into hugepages. > >>>> > >>>> RFC note > >>>> ========== > >>>> Im not sure if im missing anything related to the mTHP > >>>> changes. I think now that we have hugepage_pmd_enabled in > >>>> commit 00f58104202c ("mm: fix khugepaged activation policy") everything > >>>> should work as expected. > >>>> > >>>> Nico Pache (2): > >>>> mm: defer THP insertion to khugepaged > >>>> mm: document transparent_hugepage=defer usage > >>>> > >>>> Documentation/admin-guide/mm/transhuge.rst | 18 ++++++++++--- > >>>> include/linux/huge_mm.h | 15 +++++++++-- > >>>> mm/huge_memory.c | 31 +++++++++++++++++++--- > >>>> 3 files changed, 55 insertions(+), 9 deletions(-) > >>>> > >>>> Cc: Andrew Morton <akpm@linux-foundation.org> > >>>> Cc: David Hildenbrand <david@redhat.com> > >>>> Cc: Matthew Wilcox <willy@infradead.org> > >>>> Cc: Barry Song <baohua@kernel.org> > >>>> Cc: Ryan Roberts <ryan.roberts@arm.com> > >>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > >>>> Cc: Lance Yang <ioworker0@gmail.com> > >>>> Cc: Peter Xu <peterx@redhat.com> > >>>> Cc: Zi Yan <ziy@nvidia.com> > >>>> Cc: Rafael Aquini <aquini@redhat.com> > >>>> Cc: Andrea Arcangeli <aarcange@redhat.com> > >>>> Cc: Jonathan Corbet <corbet@lwn.net> > >>>> -- > >>>> 2.45.2 > >>> > >>> -- > >>> Best Regards, > >>> Yan, Zi > > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/2] mm: introduce THP deferred setting 2024-08-26 21:14 ` Nico Pache @ 2024-08-27 10:37 ` Usama Arif 2024-08-27 11:09 ` Johannes Weiner 0 siblings, 1 reply; 19+ messages in thread From: Usama Arif @ 2024-08-27 10:37 UTC (permalink / raw) To: Nico Pache Cc: linux-mm, linux-kernel, linux-doc, Andrew Morton, David Hildenbrand, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet, Kirill A . Shutemov, Zi Yan, Johannes Weiner On 26/08/2024 17:14, Nico Pache wrote: > On Mon, Aug 26, 2024 at 10:47 AM Usama Arif <usamaarif642@gmail.com> wrote: >> >> >> >> On 26/08/2024 11:40, Nico Pache wrote: >>> On Tue, Jul 30, 2024 at 4:37 PM Nico Pache <npache@redhat.com> wrote: >>>> >>>> Hi Zi Yan, >>>> On Mon, Jul 29, 2024 at 7:26 PM Zi Yan <ziy@nvidia.com> wrote: >>>>> >>>>> +Kirill >>>>> >>>>> On 29 Jul 2024, at 18:27, Nico Pache wrote: >>>>> >>>>>> We've seen cases were customers switching from RHEL7 to RHEL8 see a >>>>>> significant increase in the memory footprint for the same workloads. >>>>>> >>>>>> Through our investigations we found that a large contributing factor to >>>>>> the increase in RSS was an increase in THP usage. >>>>> >>>>> Any knob is changed from RHEL7 to RHEL8 to cause more THP usage? >>>> IIRC, most of the systems tuning is the same. We attributed the >>>> increase in THP usage to a combination of improvements in the kernel, >>>> and improvements in the libraries (better alignments). That allowed >>>> THP allocations to succeed at a higher rate. I can go back and confirm >>>> this tomorrow though. >>>>> >>>>>> >>>>>> For workloads like MySQL, or when using allocators like jemalloc, it is >>>>>> often recommended to set /transparent_hugepages/enabled=never. This is >>>>>> in part due to performance degradations and increased memory waste. >>>>>> >>>>>> This series introduces enabled=defer, this setting acts as a middle >>>>>> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the >>>>>> page fault handler will act normally, making a hugepage if possible. If >>>>>> the allocation is not MADV_HUGEPAGE, then the page fault handler will >>>>>> default to the base size allocation. The caveat is that khugepaged can >>>>>> still operate on pages thats not MADV_HUGEPAGE. >>>>> >>>>> Why? If user does not explicitly want huge page, why bother providing huge >>>>> pages? Wouldn't it increase memory footprint? >>>> >>>> So we have "always", which will always try to allocate a THP when it >>>> can. This setting gives good performance in a lot of conditions, but >>>> tends to waste memory. Additionally applications DON'T need to be >>>> modified to take advantage of THPs. >>>> >>>> We have "madvise" which will only satisfy allocations that are >>>> MADV_HUGEPAGE, this gives you granular control, and a lot of times >>>> these madvises come from libraries. Unlike "always" you DO need to >>>> modify your application if you want to use THPs. >>>> >>>> Then we have "never", which of course, never allocates THPs. >>>> >>>> Ok. back to your question, like "madvise", "defer" gives you the >>>> benefits of THPs when you specifically know you want them >>>> (madv_hugepage), but also benefits applications that dont specifically >>>> ask for them (or cant be modified to ask for them), like "always" >>>> does. The applications that dont ask for THPs must wait for khugepaged >>>> to get them (avoid insertions at PF time)-- this curbs a lot of memory >>>> waste, and gives an increased tunability over "always". Another added >>>> benefit is that khugepaged will most likely not operate on short lived >>>> allocations, meaning that only longstanding memory will be collapsed >>>> to THPs. >>>> >>>> The memory waste can be tuned with max_ptes_none... lets say you want >>>> ~90% of your PMD to be full before collapsing into a huge page. simply >>>> set max_ptes_none=64. or no waste, set max_ptes_none=0, requiring the >>>> 512 pages to be present before being collapsed. >>>> >>>>> >>>>>> >>>>>> This allows for two things... one, applications specifically designed to >>>>>> use hugepages will get them, and two, applications that don't use >>>>>> hugepages can still benefit from them without aggressively inserting >>>>>> THPs at every possible chance. This curbs the memory waste, and defers >>>>>> the use of hugepages to khugepaged. Khugepaged can then scan the memory >>>>>> for eligible collapsing. >>>>> >>>>> khugepaged would replace application memory with huge pages without specific >>>>> goal. Why not use a user space agent with process_madvise() to collapse >>>>> huge pages? Admin might have more knobs to tweak than khugepaged. >>>> >>>> The benefits of "always" are that no userspace agent is needed, and >>>> applications dont have to be modified to use madvise(MADV_HUGEPAGE) to >>>> benefit from THPs. This setting hopes to gain some of the same >>>> benefits without the significant waste of memory and an increased >>>> tunability. >>>> >>>> future changes I have in the works are to make khugepaged more >>>> "smart". Moving it away from the round robin fashion it currently >>>> operates in, to instead make smart and informed decisions of what >>>> memory to collapse (and potentially split). >>>> >>>> Hopefully that helped explain the motivation for this new setting! >>> >>> Any last comments before I resend this? >>> >>> Ive been made aware of >>> https://lore.kernel.org/all/20240730125346.1580150-1-usamaarif642@gmail.com/T/#u >>> which introduces THP splitting. These are both trying to achieve the >>> same thing through different means. Our approach leverages khugepaged >>> to promote pages, while Usama's uses the reclaim path to demote >>> hugepages and shrink the underlying memory. >>> >>> I will leave it up to reviewers to determine which is better; However, >>> we can't have both, as we'd be introducing trashing conditions. >>> >> >> Hi, >> >> Just inserting this here from my cover letter: >> >> Waiting for khugepaged to scan memory and >> collapse pages into THP can be slow and unpredictable in terms of performance > Obviously not part of my patchset here, but I have been testing some > changes to khugepaged to make it more aware of what processes are hot. > Ideally then it can make better choices of what to operate on. >> (i.e. you dont know when the collapse will happen), while production >> environments require predictable performance. If there is enough memory >> available, its better for both performance and predictability to have >> a THP from fault time, i.e. THP=always rather than wait for khugepaged >> to collapse it, and deal with sparsely populated THPs when the system is >> running out of memory. >> >> I just went through your patches, and am not sure why we can't have both? > Fair point, we can. I've been playing around with splitting hugepages > and via khugepaged and was thinking of the trashing conditions there-- > but your implementation takes a different approach. > I've been working on performance testing my "defer" changes, once I > find the appropriate workloads I'll try adding your changes to the > mix. I have a feeling my approach is better for latency sensitive > workloads, while yours is better for throughput, but let me find a way > to confirm that. > > Hmm, I am not sure if its latency vs throughput. There are 2 things we probably want to consider, short lived and long lived mappings, and in each of these situations, having enough memory and running out of memory. For short lived mappings, I believe reducing page faults is a bigger factor in improving performance. In that case, khugepaged won't have enough time to work, so THP=always will perform better than THP=defer. THP=defer in this case will perform the same as THP=madvise? If there is enough memory, then the changes I introduced in the shrinker won't cost anything as the shrinker won't run, and the system performance will be the same as THP=always. If there is low memory and the shrinker runs, it will only split THPs that have zero-filled pages more than max_ptes_none, and map the zero-filled pages to shared zero-pages saving memory. There is ofcourse a cost to splitting and running the shrinker, but hopefully it only splits underused THPs. For long lived mappings, reduced TLB misses would be the bigger factor in improving performance. For the initial run of the application THP=always will perform better wrt TLB misses as page fault handler will give THPs from start. Later on in the run, the memory might look similar between THP=always with shrinker and max_ptes_none < HPAGE_PMD_NR vs THP=defer and max_ptes_none < HPAGE_PMD_NR? This is because khugepaged will have collapsed pages that might have initially been faulted in. And collapsing has a cost, which would not have been incurred if the THPs were present from fault. If there is low memory, then shrinker would split memory (which has a cost as well) and the system memory would look similar or better than THP=defer, as the shrinker would split THPs that initially might not have been underused, but are underused at time of memory pressure. With THP=always + underused shrinker, the cost (splitting) is incurred only if needed and when its needed. While with THP=defer the cost (higher page faults, higher TLB misses + khugepaged collapse) is incurred all the time, even if the system might have plenty of memory available and there is no need to take a performance hit. > > > >> Both use max_ptes_none as the tunable. If the number of zero-filled pages >> is above max_ptes_none, the shrinker will split them, and khugepaged will not collapse >> them (SCAN_EXCEED_NONE_PTE), so I don't see how it causes trashing? >> >> >>> Cheers, >>> -- Nico >>> >>> >>> >>> >>> >>>> >>>> Cheer! >>>> -- Nico >>>>> >>>>>> >>>>>> Admins may want to lower max_ptes_none, if not, khugepaged may >>>>>> aggressively collapse single allocations into hugepages. >>>>>> >>>>>> RFC note >>>>>> ========== >>>>>> Im not sure if im missing anything related to the mTHP >>>>>> changes. I think now that we have hugepage_pmd_enabled in >>>>>> commit 00f58104202c ("mm: fix khugepaged activation policy") everything >>>>>> should work as expected. >>>>>> >>>>>> Nico Pache (2): >>>>>> mm: defer THP insertion to khugepaged >>>>>> mm: document transparent_hugepage=defer usage >>>>>> >>>>>> Documentation/admin-guide/mm/transhuge.rst | 18 ++++++++++--- >>>>>> include/linux/huge_mm.h | 15 +++++++++-- >>>>>> mm/huge_memory.c | 31 +++++++++++++++++++--- >>>>>> 3 files changed, 55 insertions(+), 9 deletions(-) >>>>>> >>>>>> Cc: Andrew Morton <akpm@linux-foundation.org> >>>>>> Cc: David Hildenbrand <david@redhat.com> >>>>>> Cc: Matthew Wilcox <willy@infradead.org> >>>>>> Cc: Barry Song <baohua@kernel.org> >>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com> >>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>>>>> Cc: Lance Yang <ioworker0@gmail.com> >>>>>> Cc: Peter Xu <peterx@redhat.com> >>>>>> Cc: Zi Yan <ziy@nvidia.com> >>>>>> Cc: Rafael Aquini <aquini@redhat.com> >>>>>> Cc: Andrea Arcangeli <aarcange@redhat.com> >>>>>> Cc: Jonathan Corbet <corbet@lwn.net> >>>>>> -- >>>>>> 2.45.2 >>>>> >>>>> -- >>>>> Best Regards, >>>>> Yan, Zi >>> >> > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/2] mm: introduce THP deferred setting 2024-08-27 10:37 ` Usama Arif @ 2024-08-27 11:09 ` Johannes Weiner 2024-08-27 11:46 ` David Hildenbrand 2024-08-28 1:18 ` Rik van Riel 0 siblings, 2 replies; 19+ messages in thread From: Johannes Weiner @ 2024-08-27 11:09 UTC (permalink / raw) To: Usama Arif Cc: Nico Pache, linux-mm, linux-kernel, linux-doc, Andrew Morton, David Hildenbrand, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet, Kirill A . Shutemov, Zi Yan On Tue, Aug 27, 2024 at 11:37:14AM +0100, Usama Arif wrote: > > > On 26/08/2024 17:14, Nico Pache wrote: > > On Mon, Aug 26, 2024 at 10:47 AM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >> > >> > >> On 26/08/2024 11:40, Nico Pache wrote: > >>> On Tue, Jul 30, 2024 at 4:37 PM Nico Pache <npache@redhat.com> wrote: > >>>> > >>>> Hi Zi Yan, > >>>> On Mon, Jul 29, 2024 at 7:26 PM Zi Yan <ziy@nvidia.com> wrote: > >>>>> > >>>>> +Kirill > >>>>> > >>>>> On 29 Jul 2024, at 18:27, Nico Pache wrote: > >>>>> > >>>>>> We've seen cases were customers switching from RHEL7 to RHEL8 see a > >>>>>> significant increase in the memory footprint for the same workloads. > >>>>>> > >>>>>> Through our investigations we found that a large contributing factor to > >>>>>> the increase in RSS was an increase in THP usage. > >>>>> > >>>>> Any knob is changed from RHEL7 to RHEL8 to cause more THP usage? > >>>> IIRC, most of the systems tuning is the same. We attributed the > >>>> increase in THP usage to a combination of improvements in the kernel, > >>>> and improvements in the libraries (better alignments). That allowed > >>>> THP allocations to succeed at a higher rate. I can go back and confirm > >>>> this tomorrow though. > >>>>> > >>>>>> > >>>>>> For workloads like MySQL, or when using allocators like jemalloc, it is > >>>>>> often recommended to set /transparent_hugepages/enabled=never. This is > >>>>>> in part due to performance degradations and increased memory waste. > >>>>>> > >>>>>> This series introduces enabled=defer, this setting acts as a middle > >>>>>> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the > >>>>>> page fault handler will act normally, making a hugepage if possible. If > >>>>>> the allocation is not MADV_HUGEPAGE, then the page fault handler will > >>>>>> default to the base size allocation. The caveat is that khugepaged can > >>>>>> still operate on pages thats not MADV_HUGEPAGE. > >>>>> > >>>>> Why? If user does not explicitly want huge page, why bother providing huge > >>>>> pages? Wouldn't it increase memory footprint? > >>>> > >>>> So we have "always", which will always try to allocate a THP when it > >>>> can. This setting gives good performance in a lot of conditions, but > >>>> tends to waste memory. Additionally applications DON'T need to be > >>>> modified to take advantage of THPs. > >>>> > >>>> We have "madvise" which will only satisfy allocations that are > >>>> MADV_HUGEPAGE, this gives you granular control, and a lot of times > >>>> these madvises come from libraries. Unlike "always" you DO need to > >>>> modify your application if you want to use THPs. > >>>> > >>>> Then we have "never", which of course, never allocates THPs. > >>>> > >>>> Ok. back to your question, like "madvise", "defer" gives you the > >>>> benefits of THPs when you specifically know you want them > >>>> (madv_hugepage), but also benefits applications that dont specifically > >>>> ask for them (or cant be modified to ask for them), like "always" > >>>> does. The applications that dont ask for THPs must wait for khugepaged > >>>> to get them (avoid insertions at PF time)-- this curbs a lot of memory > >>>> waste, and gives an increased tunability over "always". Another added > >>>> benefit is that khugepaged will most likely not operate on short lived > >>>> allocations, meaning that only longstanding memory will be collapsed > >>>> to THPs. > >>>> > >>>> The memory waste can be tuned with max_ptes_none... lets say you want > >>>> ~90% of your PMD to be full before collapsing into a huge page. simply > >>>> set max_ptes_none=64. or no waste, set max_ptes_none=0, requiring the > >>>> 512 pages to be present before being collapsed. > >>>> > >>>>> > >>>>>> > >>>>>> This allows for two things... one, applications specifically designed to > >>>>>> use hugepages will get them, and two, applications that don't use > >>>>>> hugepages can still benefit from them without aggressively inserting > >>>>>> THPs at every possible chance. This curbs the memory waste, and defers > >>>>>> the use of hugepages to khugepaged. Khugepaged can then scan the memory > >>>>>> for eligible collapsing. > >>>>> > >>>>> khugepaged would replace application memory with huge pages without specific > >>>>> goal. Why not use a user space agent with process_madvise() to collapse > >>>>> huge pages? Admin might have more knobs to tweak than khugepaged. > >>>> > >>>> The benefits of "always" are that no userspace agent is needed, and > >>>> applications dont have to be modified to use madvise(MADV_HUGEPAGE) to > >>>> benefit from THPs. This setting hopes to gain some of the same > >>>> benefits without the significant waste of memory and an increased > >>>> tunability. > >>>> > >>>> future changes I have in the works are to make khugepaged more > >>>> "smart". Moving it away from the round robin fashion it currently > >>>> operates in, to instead make smart and informed decisions of what > >>>> memory to collapse (and potentially split). > >>>> > >>>> Hopefully that helped explain the motivation for this new setting! > >>> > >>> Any last comments before I resend this? > >>> > >>> Ive been made aware of > >>> https://lore.kernel.org/all/20240730125346.1580150-1-usamaarif642@gmail.com/T/#u > >>> which introduces THP splitting. These are both trying to achieve the > >>> same thing through different means. Our approach leverages khugepaged > >>> to promote pages, while Usama's uses the reclaim path to demote > >>> hugepages and shrink the underlying memory. > >>> > >>> I will leave it up to reviewers to determine which is better; However, > >>> we can't have both, as we'd be introducing trashing conditions. > >>> > >> > >> Hi, > >> > >> Just inserting this here from my cover letter: > >> > >> Waiting for khugepaged to scan memory and > >> collapse pages into THP can be slow and unpredictable in terms of performance > > Obviously not part of my patchset here, but I have been testing some > > changes to khugepaged to make it more aware of what processes are hot. > > Ideally then it can make better choices of what to operate on. > >> (i.e. you dont know when the collapse will happen), while production > >> environments require predictable performance. If there is enough memory > >> available, its better for both performance and predictability to have > >> a THP from fault time, i.e. THP=always rather than wait for khugepaged > >> to collapse it, and deal with sparsely populated THPs when the system is > >> running out of memory. > >> > >> I just went through your patches, and am not sure why we can't have both? > > Fair point, we can. I've been playing around with splitting hugepages > > and via khugepaged and was thinking of the trashing conditions there-- > > but your implementation takes a different approach. > > I've been working on performance testing my "defer" changes, once I > > find the appropriate workloads I'll try adding your changes to the > > mix. I have a feeling my approach is better for latency sensitive > > workloads, while yours is better for throughput, but let me find a way > > to confirm that. > > > > > Hmm, I am not sure if its latency vs throughput. > > There are 2 things we probably want to consider, short lived and long lived mappings, and > in each of these situations, having enough memory and running out of memory. > > For short lived mappings, I believe reducing page faults is a bigger factor in > improving performance. In that case, khugepaged won't have enough time to work, > so THP=always will perform better than THP=defer. THP=defer in this case will perform > the same as THP=madvise? > If there is enough memory, then the changes I introduced in the shrinker won't cost anything > as the shrinker won't run, and the system performance will be the same as THP=always. > If there is low memory and the shrinker runs, it will only split THPs that have zero-filled > pages more than max_ptes_none, and map the zero-filled pages to shared zero-pages saving memory. > There is ofcourse a cost to splitting and running the shrinker, but hopefully it only splits > underused THPs. > > For long lived mappings, reduced TLB misses would be the bigger factor in improving performance. > For the initial run of the application THP=always will perform better wrt TLB misses as > page fault handler will give THPs from start. > Later on in the run, the memory might look similar between THP=always with shrinker and > max_ptes_none < HPAGE_PMD_NR vs THP=defer and max_ptes_none < HPAGE_PMD_NR? > This is because khugepaged will have collapsed pages that might have initially been faulted in. > And collapsing has a cost, which would not have been incurred if the THPs were present from fault. > If there is low memory, then shrinker would split memory (which has a cost as well) and the system > memory would look similar or better than THP=defer, as the shrinker would split THPs that initially > might not have been underused, but are underused at time of memory pressure. > > With THP=always + underused shrinker, the cost (splitting) is incurred only if needed and when its needed. > While with THP=defer the cost (higher page faults, higher TLB misses + khugepaged collapse) is incurred all the time, > even if the system might have plenty of memory available and there is no need to take a performance hit. I agree with this. The defer mode is an improvement over the upstream status quo, no doubt. However, both defer mode and the shrinker solve the issue of memory waste under pressure, while the shrinker permits more desirable behavior when memory is abundant. So my take is that the shrinker is the way to go, and I don't see a bonafide usecase for defer mode that the shrinker couldn't cover. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/2] mm: introduce THP deferred setting 2024-08-27 11:09 ` Johannes Weiner @ 2024-08-27 11:46 ` David Hildenbrand 2024-08-27 13:05 ` Johannes Weiner 2024-08-28 1:18 ` Rik van Riel 1 sibling, 1 reply; 19+ messages in thread From: David Hildenbrand @ 2024-08-27 11:46 UTC (permalink / raw) To: Johannes Weiner, Usama Arif Cc: Nico Pache, linux-mm, linux-kernel, linux-doc, Andrew Morton, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet, Kirill A . Shutemov, Zi Yan On 27.08.24 13:09, Johannes Weiner wrote: > On Tue, Aug 27, 2024 at 11:37:14AM +0100, Usama Arif wrote: >> >> >> On 26/08/2024 17:14, Nico Pache wrote: >>> On Mon, Aug 26, 2024 at 10:47 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>> >>>> >>>> >>>> On 26/08/2024 11:40, Nico Pache wrote: >>>>> On Tue, Jul 30, 2024 at 4:37 PM Nico Pache <npache@redhat.com> wrote: >>>>>> >>>>>> Hi Zi Yan, >>>>>> On Mon, Jul 29, 2024 at 7:26 PM Zi Yan <ziy@nvidia.com> wrote: >>>>>>> >>>>>>> +Kirill >>>>>>> >>>>>>> On 29 Jul 2024, at 18:27, Nico Pache wrote: >>>>>>> >>>>>>>> We've seen cases were customers switching from RHEL7 to RHEL8 see a >>>>>>>> significant increase in the memory footprint for the same workloads. >>>>>>>> >>>>>>>> Through our investigations we found that a large contributing factor to >>>>>>>> the increase in RSS was an increase in THP usage. >>>>>>> >>>>>>> Any knob is changed from RHEL7 to RHEL8 to cause more THP usage? >>>>>> IIRC, most of the systems tuning is the same. We attributed the >>>>>> increase in THP usage to a combination of improvements in the kernel, >>>>>> and improvements in the libraries (better alignments). That allowed >>>>>> THP allocations to succeed at a higher rate. I can go back and confirm >>>>>> this tomorrow though. >>>>>>> >>>>>>>> >>>>>>>> For workloads like MySQL, or when using allocators like jemalloc, it is >>>>>>>> often recommended to set /transparent_hugepages/enabled=never. This is >>>>>>>> in part due to performance degradations and increased memory waste. >>>>>>>> >>>>>>>> This series introduces enabled=defer, this setting acts as a middle >>>>>>>> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the >>>>>>>> page fault handler will act normally, making a hugepage if possible. If >>>>>>>> the allocation is not MADV_HUGEPAGE, then the page fault handler will >>>>>>>> default to the base size allocation. The caveat is that khugepaged can >>>>>>>> still operate on pages thats not MADV_HUGEPAGE. >>>>>>> >>>>>>> Why? If user does not explicitly want huge page, why bother providing huge >>>>>>> pages? Wouldn't it increase memory footprint? >>>>>> >>>>>> So we have "always", which will always try to allocate a THP when it >>>>>> can. This setting gives good performance in a lot of conditions, but >>>>>> tends to waste memory. Additionally applications DON'T need to be >>>>>> modified to take advantage of THPs. >>>>>> >>>>>> We have "madvise" which will only satisfy allocations that are >>>>>> MADV_HUGEPAGE, this gives you granular control, and a lot of times >>>>>> these madvises come from libraries. Unlike "always" you DO need to >>>>>> modify your application if you want to use THPs. >>>>>> >>>>>> Then we have "never", which of course, never allocates THPs. >>>>>> >>>>>> Ok. back to your question, like "madvise", "defer" gives you the >>>>>> benefits of THPs when you specifically know you want them >>>>>> (madv_hugepage), but also benefits applications that dont specifically >>>>>> ask for them (or cant be modified to ask for them), like "always" >>>>>> does. The applications that dont ask for THPs must wait for khugepaged >>>>>> to get them (avoid insertions at PF time)-- this curbs a lot of memory >>>>>> waste, and gives an increased tunability over "always". Another added >>>>>> benefit is that khugepaged will most likely not operate on short lived >>>>>> allocations, meaning that only longstanding memory will be collapsed >>>>>> to THPs. >>>>>> >>>>>> The memory waste can be tuned with max_ptes_none... lets say you want >>>>>> ~90% of your PMD to be full before collapsing into a huge page. simply >>>>>> set max_ptes_none=64. or no waste, set max_ptes_none=0, requiring the >>>>>> 512 pages to be present before being collapsed. >>>>>> >>>>>>> >>>>>>>> >>>>>>>> This allows for two things... one, applications specifically designed to >>>>>>>> use hugepages will get them, and two, applications that don't use >>>>>>>> hugepages can still benefit from them without aggressively inserting >>>>>>>> THPs at every possible chance. This curbs the memory waste, and defers >>>>>>>> the use of hugepages to khugepaged. Khugepaged can then scan the memory >>>>>>>> for eligible collapsing. >>>>>>> >>>>>>> khugepaged would replace application memory with huge pages without specific >>>>>>> goal. Why not use a user space agent with process_madvise() to collapse >>>>>>> huge pages? Admin might have more knobs to tweak than khugepaged. >>>>>> >>>>>> The benefits of "always" are that no userspace agent is needed, and >>>>>> applications dont have to be modified to use madvise(MADV_HUGEPAGE) to >>>>>> benefit from THPs. This setting hopes to gain some of the same >>>>>> benefits without the significant waste of memory and an increased >>>>>> tunability. >>>>>> >>>>>> future changes I have in the works are to make khugepaged more >>>>>> "smart". Moving it away from the round robin fashion it currently >>>>>> operates in, to instead make smart and informed decisions of what >>>>>> memory to collapse (and potentially split). >>>>>> >>>>>> Hopefully that helped explain the motivation for this new setting! >>>>> >>>>> Any last comments before I resend this? >>>>> >>>>> Ive been made aware of >>>>> https://lore.kernel.org/all/20240730125346.1580150-1-usamaarif642@gmail.com/T/#u >>>>> which introduces THP splitting. These are both trying to achieve the >>>>> same thing through different means. Our approach leverages khugepaged >>>>> to promote pages, while Usama's uses the reclaim path to demote >>>>> hugepages and shrink the underlying memory. >>>>> >>>>> I will leave it up to reviewers to determine which is better; However, >>>>> we can't have both, as we'd be introducing trashing conditions. >>>>> >>>> >>>> Hi, >>>> >>>> Just inserting this here from my cover letter: >>>> >>>> Waiting for khugepaged to scan memory and >>>> collapse pages into THP can be slow and unpredictable in terms of performance >>> Obviously not part of my patchset here, but I have been testing some >>> changes to khugepaged to make it more aware of what processes are hot. >>> Ideally then it can make better choices of what to operate on. >>>> (i.e. you dont know when the collapse will happen), while production >>>> environments require predictable performance. If there is enough memory >>>> available, its better for both performance and predictability to have >>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged >>>> to collapse it, and deal with sparsely populated THPs when the system is >>>> running out of memory. >>>> >>>> I just went through your patches, and am not sure why we can't have both? >>> Fair point, we can. I've been playing around with splitting hugepages >>> and via khugepaged and was thinking of the trashing conditions there-- >>> but your implementation takes a different approach. >>> I've been working on performance testing my "defer" changes, once I >>> find the appropriate workloads I'll try adding your changes to the >>> mix. I have a feeling my approach is better for latency sensitive >>> workloads, while yours is better for throughput, but let me find a way >>> to confirm that. >>> >>> >> Hmm, I am not sure if its latency vs throughput. >> >> There are 2 things we probably want to consider, short lived and long lived mappings, and >> in each of these situations, having enough memory and running out of memory. >> >> For short lived mappings, I believe reducing page faults is a bigger factor in >> improving performance. In that case, khugepaged won't have enough time to work, >> so THP=always will perform better than THP=defer. THP=defer in this case will perform >> the same as THP=madvise? >> If there is enough memory, then the changes I introduced in the shrinker won't cost anything >> as the shrinker won't run, and the system performance will be the same as THP=always. >> If there is low memory and the shrinker runs, it will only split THPs that have zero-filled >> pages more than max_ptes_none, and map the zero-filled pages to shared zero-pages saving memory. >> There is ofcourse a cost to splitting and running the shrinker, but hopefully it only splits >> underused THPs. >> >> For long lived mappings, reduced TLB misses would be the bigger factor in improving performance. >> For the initial run of the application THP=always will perform better wrt TLB misses as >> page fault handler will give THPs from start. >> Later on in the run, the memory might look similar between THP=always with shrinker and >> max_ptes_none < HPAGE_PMD_NR vs THP=defer and max_ptes_none < HPAGE_PMD_NR? >> This is because khugepaged will have collapsed pages that might have initially been faulted in. >> And collapsing has a cost, which would not have been incurred if the THPs were present from fault. >> If there is low memory, then shrinker would split memory (which has a cost as well) and the system >> memory would look similar or better than THP=defer, as the shrinker would split THPs that initially >> might not have been underused, but are underused at time of memory pressure. >> >> With THP=always + underused shrinker, the cost (splitting) is incurred only if needed and when its needed. >> While with THP=defer the cost (higher page faults, higher TLB misses + khugepaged collapse) is incurred all the time, >> even if the system might have plenty of memory available and there is no need to take a performance hit. > > I agree with this. The defer mode is an improvement over the upstream > status quo, no doubt. However, both defer mode and the shrinker solve > the issue of memory waste under pressure, while the shrinker permits > more desirable behavior when memory is abundant. > > So my take is that the shrinker is the way to go, and I don't see a > bonafide usecase for defer mode that the shrinker couldn't cover. Page fault latency? IOW, zeroing a complete THP, which might be up to 512 MiB on arm64. This is one of the things people bring up, where FreeBSD is different because it will zero fragments on-demand (but also result in more pagefaults). On the downside, in the past (before) we could easily and repeatedly fail to collapse THPs in busy environments. With per-VMA locks this might have improved in the meantime. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/2] mm: introduce THP deferred setting 2024-08-27 11:46 ` David Hildenbrand @ 2024-08-27 13:05 ` Johannes Weiner 2024-08-27 13:22 ` David Hildenbrand 0 siblings, 1 reply; 19+ messages in thread From: Johannes Weiner @ 2024-08-27 13:05 UTC (permalink / raw) To: David Hildenbrand Cc: Usama Arif, Nico Pache, linux-mm, linux-kernel, linux-doc, Andrew Morton, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet, Kirill A . Shutemov, Zi Yan On Tue, Aug 27, 2024 at 01:46:26PM +0200, David Hildenbrand wrote: > On 27.08.24 13:09, Johannes Weiner wrote: > > On Tue, Aug 27, 2024 at 11:37:14AM +0100, Usama Arif wrote: > >> > >> > >> On 26/08/2024 17:14, Nico Pache wrote: > >>> On Mon, Aug 26, 2024 at 10:47 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>> > >>>> > >>>> > >>>> On 26/08/2024 11:40, Nico Pache wrote: > >>>>> On Tue, Jul 30, 2024 at 4:37 PM Nico Pache <npache@redhat.com> wrote: > >>>>>> > >>>>>> Hi Zi Yan, > >>>>>> On Mon, Jul 29, 2024 at 7:26 PM Zi Yan <ziy@nvidia.com> wrote: > >>>>>>> > >>>>>>> +Kirill > >>>>>>> > >>>>>>> On 29 Jul 2024, at 18:27, Nico Pache wrote: > >>>>>>> > >>>>>>>> We've seen cases were customers switching from RHEL7 to RHEL8 see a > >>>>>>>> significant increase in the memory footprint for the same workloads. > >>>>>>>> > >>>>>>>> Through our investigations we found that a large contributing factor to > >>>>>>>> the increase in RSS was an increase in THP usage. > >>>>>>> > >>>>>>> Any knob is changed from RHEL7 to RHEL8 to cause more THP usage? > >>>>>> IIRC, most of the systems tuning is the same. We attributed the > >>>>>> increase in THP usage to a combination of improvements in the kernel, > >>>>>> and improvements in the libraries (better alignments). That allowed > >>>>>> THP allocations to succeed at a higher rate. I can go back and confirm > >>>>>> this tomorrow though. > >>>>>>> > >>>>>>>> > >>>>>>>> For workloads like MySQL, or when using allocators like jemalloc, it is > >>>>>>>> often recommended to set /transparent_hugepages/enabled=never. This is > >>>>>>>> in part due to performance degradations and increased memory waste. > >>>>>>>> > >>>>>>>> This series introduces enabled=defer, this setting acts as a middle > >>>>>>>> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the > >>>>>>>> page fault handler will act normally, making a hugepage if possible. If > >>>>>>>> the allocation is not MADV_HUGEPAGE, then the page fault handler will > >>>>>>>> default to the base size allocation. The caveat is that khugepaged can > >>>>>>>> still operate on pages thats not MADV_HUGEPAGE. > >>>>>>> > >>>>>>> Why? If user does not explicitly want huge page, why bother providing huge > >>>>>>> pages? Wouldn't it increase memory footprint? > >>>>>> > >>>>>> So we have "always", which will always try to allocate a THP when it > >>>>>> can. This setting gives good performance in a lot of conditions, but > >>>>>> tends to waste memory. Additionally applications DON'T need to be > >>>>>> modified to take advantage of THPs. > >>>>>> > >>>>>> We have "madvise" which will only satisfy allocations that are > >>>>>> MADV_HUGEPAGE, this gives you granular control, and a lot of times > >>>>>> these madvises come from libraries. Unlike "always" you DO need to > >>>>>> modify your application if you want to use THPs. > >>>>>> > >>>>>> Then we have "never", which of course, never allocates THPs. > >>>>>> > >>>>>> Ok. back to your question, like "madvise", "defer" gives you the > >>>>>> benefits of THPs when you specifically know you want them > >>>>>> (madv_hugepage), but also benefits applications that dont specifically > >>>>>> ask for them (or cant be modified to ask for them), like "always" > >>>>>> does. The applications that dont ask for THPs must wait for khugepaged > >>>>>> to get them (avoid insertions at PF time)-- this curbs a lot of memory > >>>>>> waste, and gives an increased tunability over "always". Another added > >>>>>> benefit is that khugepaged will most likely not operate on short lived > >>>>>> allocations, meaning that only longstanding memory will be collapsed > >>>>>> to THPs. > >>>>>> > >>>>>> The memory waste can be tuned with max_ptes_none... lets say you want > >>>>>> ~90% of your PMD to be full before collapsing into a huge page. simply > >>>>>> set max_ptes_none=64. or no waste, set max_ptes_none=0, requiring the > >>>>>> 512 pages to be present before being collapsed. > >>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> This allows for two things... one, applications specifically designed to > >>>>>>>> use hugepages will get them, and two, applications that don't use > >>>>>>>> hugepages can still benefit from them without aggressively inserting > >>>>>>>> THPs at every possible chance. This curbs the memory waste, and defers > >>>>>>>> the use of hugepages to khugepaged. Khugepaged can then scan the memory > >>>>>>>> for eligible collapsing. > >>>>>>> > >>>>>>> khugepaged would replace application memory with huge pages without specific > >>>>>>> goal. Why not use a user space agent with process_madvise() to collapse > >>>>>>> huge pages? Admin might have more knobs to tweak than khugepaged. > >>>>>> > >>>>>> The benefits of "always" are that no userspace agent is needed, and > >>>>>> applications dont have to be modified to use madvise(MADV_HUGEPAGE) to > >>>>>> benefit from THPs. This setting hopes to gain some of the same > >>>>>> benefits without the significant waste of memory and an increased > >>>>>> tunability. > >>>>>> > >>>>>> future changes I have in the works are to make khugepaged more > >>>>>> "smart". Moving it away from the round robin fashion it currently > >>>>>> operates in, to instead make smart and informed decisions of what > >>>>>> memory to collapse (and potentially split). > >>>>>> > >>>>>> Hopefully that helped explain the motivation for this new setting! > >>>>> > >>>>> Any last comments before I resend this? > >>>>> > >>>>> Ive been made aware of > >>>>> https://lore.kernel.org/all/20240730125346.1580150-1-usamaarif642@gmail.com/T/#u > >>>>> which introduces THP splitting. These are both trying to achieve the > >>>>> same thing through different means. Our approach leverages khugepaged > >>>>> to promote pages, while Usama's uses the reclaim path to demote > >>>>> hugepages and shrink the underlying memory. > >>>>> > >>>>> I will leave it up to reviewers to determine which is better; However, > >>>>> we can't have both, as we'd be introducing trashing conditions. > >>>>> > >>>> > >>>> Hi, > >>>> > >>>> Just inserting this here from my cover letter: > >>>> > >>>> Waiting for khugepaged to scan memory and > >>>> collapse pages into THP can be slow and unpredictable in terms of performance > >>> Obviously not part of my patchset here, but I have been testing some > >>> changes to khugepaged to make it more aware of what processes are hot. > >>> Ideally then it can make better choices of what to operate on. > >>>> (i.e. you dont know when the collapse will happen), while production > >>>> environments require predictable performance. If there is enough memory > >>>> available, its better for both performance and predictability to have > >>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged > >>>> to collapse it, and deal with sparsely populated THPs when the system is > >>>> running out of memory. > >>>> > >>>> I just went through your patches, and am not sure why we can't have both? > >>> Fair point, we can. I've been playing around with splitting hugepages > >>> and via khugepaged and was thinking of the trashing conditions there-- > >>> but your implementation takes a different approach. > >>> I've been working on performance testing my "defer" changes, once I > >>> find the appropriate workloads I'll try adding your changes to the > >>> mix. I have a feeling my approach is better for latency sensitive > >>> workloads, while yours is better for throughput, but let me find a way > >>> to confirm that. > >>> > >>> > >> Hmm, I am not sure if its latency vs throughput. > >> > >> There are 2 things we probably want to consider, short lived and long lived mappings, and > >> in each of these situations, having enough memory and running out of memory. > >> > >> For short lived mappings, I believe reducing page faults is a bigger factor in > >> improving performance. In that case, khugepaged won't have enough time to work, > >> so THP=always will perform better than THP=defer. THP=defer in this case will perform > >> the same as THP=madvise? > >> If there is enough memory, then the changes I introduced in the shrinker won't cost anything > >> as the shrinker won't run, and the system performance will be the same as THP=always. > >> If there is low memory and the shrinker runs, it will only split THPs that have zero-filled > >> pages more than max_ptes_none, and map the zero-filled pages to shared zero-pages saving memory. > >> There is ofcourse a cost to splitting and running the shrinker, but hopefully it only splits > >> underused THPs. > >> > >> For long lived mappings, reduced TLB misses would be the bigger factor in improving performance. > >> For the initial run of the application THP=always will perform better wrt TLB misses as > >> page fault handler will give THPs from start. > >> Later on in the run, the memory might look similar between THP=always with shrinker and > >> max_ptes_none < HPAGE_PMD_NR vs THP=defer and max_ptes_none < HPAGE_PMD_NR? > >> This is because khugepaged will have collapsed pages that might have initially been faulted in. > >> And collapsing has a cost, which would not have been incurred if the THPs were present from fault. > >> If there is low memory, then shrinker would split memory (which has a cost as well) and the system > >> memory would look similar or better than THP=defer, as the shrinker would split THPs that initially > >> might not have been underused, but are underused at time of memory pressure. > >> > >> With THP=always + underused shrinker, the cost (splitting) is incurred only if needed and when its needed. > >> While with THP=defer the cost (higher page faults, higher TLB misses + khugepaged collapse) is incurred all the time, > >> even if the system might have plenty of memory available and there is no need to take a performance hit. > > > > I agree with this. The defer mode is an improvement over the upstream > > status quo, no doubt. However, both defer mode and the shrinker solve > > the issue of memory waste under pressure, while the shrinker permits > > more desirable behavior when memory is abundant. > > > > So my take is that the shrinker is the way to go, and I don't see a > > bonafide usecase for defer mode that the shrinker couldn't cover. > > Page fault latency? IOW, zeroing a complete THP, which might be up to > 512 MiB on arm64. This is one of the things people bring up, where > FreeBSD is different because it will zero fragments on-demand (but also > result in more pagefaults). Good point, I suppose this could matter to some workloads. Would be interesting to get real world data on that if that's the argument. Btw, I'm curious if anybody actually uses or tried to use 512MB THPs. I understand it's technically possible, and there are usecases for 1G pages through static hugetlb management, but 512M seems wild for a "real" page size - real as in: allocated on-demand and actually paged to/from storage. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/2] mm: introduce THP deferred setting 2024-08-27 13:05 ` Johannes Weiner @ 2024-08-27 13:22 ` David Hildenbrand 2024-08-27 13:57 ` Usama Arif 0 siblings, 1 reply; 19+ messages in thread From: David Hildenbrand @ 2024-08-27 13:22 UTC (permalink / raw) To: Johannes Weiner Cc: Usama Arif, Nico Pache, linux-mm, linux-kernel, linux-doc, Andrew Morton, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet, Kirill A . Shutemov, Zi Yan On 27.08.24 15:05, Johannes Weiner wrote: > On Tue, Aug 27, 2024 at 01:46:26PM +0200, David Hildenbrand wrote: >> On 27.08.24 13:09, Johannes Weiner wrote: >>> On Tue, Aug 27, 2024 at 11:37:14AM +0100, Usama Arif wrote: >>>> >>>> >>>> On 26/08/2024 17:14, Nico Pache wrote: >>>>> On Mon, Aug 26, 2024 at 10:47 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 26/08/2024 11:40, Nico Pache wrote: >>>>>>> On Tue, Jul 30, 2024 at 4:37 PM Nico Pache <npache@redhat.com> wrote: >>>>>>>> >>>>>>>> Hi Zi Yan, >>>>>>>> On Mon, Jul 29, 2024 at 7:26 PM Zi Yan <ziy@nvidia.com> wrote: >>>>>>>>> >>>>>>>>> +Kirill >>>>>>>>> >>>>>>>>> On 29 Jul 2024, at 18:27, Nico Pache wrote: >>>>>>>>> >>>>>>>>>> We've seen cases were customers switching from RHEL7 to RHEL8 see a >>>>>>>>>> significant increase in the memory footprint for the same workloads. >>>>>>>>>> >>>>>>>>>> Through our investigations we found that a large contributing factor to >>>>>>>>>> the increase in RSS was an increase in THP usage. >>>>>>>>> >>>>>>>>> Any knob is changed from RHEL7 to RHEL8 to cause more THP usage? >>>>>>>> IIRC, most of the systems tuning is the same. We attributed the >>>>>>>> increase in THP usage to a combination of improvements in the kernel, >>>>>>>> and improvements in the libraries (better alignments). That allowed >>>>>>>> THP allocations to succeed at a higher rate. I can go back and confirm >>>>>>>> this tomorrow though. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> For workloads like MySQL, or when using allocators like jemalloc, it is >>>>>>>>>> often recommended to set /transparent_hugepages/enabled=never. This is >>>>>>>>>> in part due to performance degradations and increased memory waste. >>>>>>>>>> >>>>>>>>>> This series introduces enabled=defer, this setting acts as a middle >>>>>>>>>> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the >>>>>>>>>> page fault handler will act normally, making a hugepage if possible. If >>>>>>>>>> the allocation is not MADV_HUGEPAGE, then the page fault handler will >>>>>>>>>> default to the base size allocation. The caveat is that khugepaged can >>>>>>>>>> still operate on pages thats not MADV_HUGEPAGE. >>>>>>>>> >>>>>>>>> Why? If user does not explicitly want huge page, why bother providing huge >>>>>>>>> pages? Wouldn't it increase memory footprint? >>>>>>>> >>>>>>>> So we have "always", which will always try to allocate a THP when it >>>>>>>> can. This setting gives good performance in a lot of conditions, but >>>>>>>> tends to waste memory. Additionally applications DON'T need to be >>>>>>>> modified to take advantage of THPs. >>>>>>>> >>>>>>>> We have "madvise" which will only satisfy allocations that are >>>>>>>> MADV_HUGEPAGE, this gives you granular control, and a lot of times >>>>>>>> these madvises come from libraries. Unlike "always" you DO need to >>>>>>>> modify your application if you want to use THPs. >>>>>>>> >>>>>>>> Then we have "never", which of course, never allocates THPs. >>>>>>>> >>>>>>>> Ok. back to your question, like "madvise", "defer" gives you the >>>>>>>> benefits of THPs when you specifically know you want them >>>>>>>> (madv_hugepage), but also benefits applications that dont specifically >>>>>>>> ask for them (or cant be modified to ask for them), like "always" >>>>>>>> does. The applications that dont ask for THPs must wait for khugepaged >>>>>>>> to get them (avoid insertions at PF time)-- this curbs a lot of memory >>>>>>>> waste, and gives an increased tunability over "always". Another added >>>>>>>> benefit is that khugepaged will most likely not operate on short lived >>>>>>>> allocations, meaning that only longstanding memory will be collapsed >>>>>>>> to THPs. >>>>>>>> >>>>>>>> The memory waste can be tuned with max_ptes_none... lets say you want >>>>>>>> ~90% of your PMD to be full before collapsing into a huge page. simply >>>>>>>> set max_ptes_none=64. or no waste, set max_ptes_none=0, requiring the >>>>>>>> 512 pages to be present before being collapsed. >>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> This allows for two things... one, applications specifically designed to >>>>>>>>>> use hugepages will get them, and two, applications that don't use >>>>>>>>>> hugepages can still benefit from them without aggressively inserting >>>>>>>>>> THPs at every possible chance. This curbs the memory waste, and defers >>>>>>>>>> the use of hugepages to khugepaged. Khugepaged can then scan the memory >>>>>>>>>> for eligible collapsing. >>>>>>>>> >>>>>>>>> khugepaged would replace application memory with huge pages without specific >>>>>>>>> goal. Why not use a user space agent with process_madvise() to collapse >>>>>>>>> huge pages? Admin might have more knobs to tweak than khugepaged. >>>>>>>> >>>>>>>> The benefits of "always" are that no userspace agent is needed, and >>>>>>>> applications dont have to be modified to use madvise(MADV_HUGEPAGE) to >>>>>>>> benefit from THPs. This setting hopes to gain some of the same >>>>>>>> benefits without the significant waste of memory and an increased >>>>>>>> tunability. >>>>>>>> >>>>>>>> future changes I have in the works are to make khugepaged more >>>>>>>> "smart". Moving it away from the round robin fashion it currently >>>>>>>> operates in, to instead make smart and informed decisions of what >>>>>>>> memory to collapse (and potentially split). >>>>>>>> >>>>>>>> Hopefully that helped explain the motivation for this new setting! >>>>>>> >>>>>>> Any last comments before I resend this? >>>>>>> >>>>>>> Ive been made aware of >>>>>>> https://lore.kernel.org/all/20240730125346.1580150-1-usamaarif642@gmail.com/T/#u >>>>>>> which introduces THP splitting. These are both trying to achieve the >>>>>>> same thing through different means. Our approach leverages khugepaged >>>>>>> to promote pages, while Usama's uses the reclaim path to demote >>>>>>> hugepages and shrink the underlying memory. >>>>>>> >>>>>>> I will leave it up to reviewers to determine which is better; However, >>>>>>> we can't have both, as we'd be introducing trashing conditions. >>>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> Just inserting this here from my cover letter: >>>>>> >>>>>> Waiting for khugepaged to scan memory and >>>>>> collapse pages into THP can be slow and unpredictable in terms of performance >>>>> Obviously not part of my patchset here, but I have been testing some >>>>> changes to khugepaged to make it more aware of what processes are hot. >>>>> Ideally then it can make better choices of what to operate on. >>>>>> (i.e. you dont know when the collapse will happen), while production >>>>>> environments require predictable performance. If there is enough memory >>>>>> available, its better for both performance and predictability to have >>>>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged >>>>>> to collapse it, and deal with sparsely populated THPs when the system is >>>>>> running out of memory. >>>>>> >>>>>> I just went through your patches, and am not sure why we can't have both? >>>>> Fair point, we can. I've been playing around with splitting hugepages >>>>> and via khugepaged and was thinking of the trashing conditions there-- >>>>> but your implementation takes a different approach. >>>>> I've been working on performance testing my "defer" changes, once I >>>>> find the appropriate workloads I'll try adding your changes to the >>>>> mix. I have a feeling my approach is better for latency sensitive >>>>> workloads, while yours is better for throughput, but let me find a way >>>>> to confirm that. >>>>> >>>>> >>>> Hmm, I am not sure if its latency vs throughput. >>>> >>>> There are 2 things we probably want to consider, short lived and long lived mappings, and >>>> in each of these situations, having enough memory and running out of memory. >>>> >>>> For short lived mappings, I believe reducing page faults is a bigger factor in >>>> improving performance. In that case, khugepaged won't have enough time to work, >>>> so THP=always will perform better than THP=defer. THP=defer in this case will perform >>>> the same as THP=madvise? >>>> If there is enough memory, then the changes I introduced in the shrinker won't cost anything >>>> as the shrinker won't run, and the system performance will be the same as THP=always. >>>> If there is low memory and the shrinker runs, it will only split THPs that have zero-filled >>>> pages more than max_ptes_none, and map the zero-filled pages to shared zero-pages saving memory. >>>> There is ofcourse a cost to splitting and running the shrinker, but hopefully it only splits >>>> underused THPs. >>>> >>>> For long lived mappings, reduced TLB misses would be the bigger factor in improving performance. >>>> For the initial run of the application THP=always will perform better wrt TLB misses as >>>> page fault handler will give THPs from start. >>>> Later on in the run, the memory might look similar between THP=always with shrinker and >>>> max_ptes_none < HPAGE_PMD_NR vs THP=defer and max_ptes_none < HPAGE_PMD_NR? >>>> This is because khugepaged will have collapsed pages that might have initially been faulted in. >>>> And collapsing has a cost, which would not have been incurred if the THPs were present from fault. >>>> If there is low memory, then shrinker would split memory (which has a cost as well) and the system >>>> memory would look similar or better than THP=defer, as the shrinker would split THPs that initially >>>> might not have been underused, but are underused at time of memory pressure. >>>> >>>> With THP=always + underused shrinker, the cost (splitting) is incurred only if needed and when its needed. >>>> While with THP=defer the cost (higher page faults, higher TLB misses + khugepaged collapse) is incurred all the time, >>>> even if the system might have plenty of memory available and there is no need to take a performance hit. >>> >>> I agree with this. The defer mode is an improvement over the upstream >>> status quo, no doubt. However, both defer mode and the shrinker solve >>> the issue of memory waste under pressure, while the shrinker permits >>> more desirable behavior when memory is abundant. >>> >>> So my take is that the shrinker is the way to go, and I don't see a >>> bonafide usecase for defer mode that the shrinker couldn't cover. >> >> Page fault latency? IOW, zeroing a complete THP, which might be up to >> 512 MiB on arm64. This is one of the things people bring up, where >> FreeBSD is different because it will zero fragments on-demand (but also >> result in more pagefaults). > > Good point, I suppose this could matter to some workloads. Would be > interesting to get real world data on that if that's the argument. IIRC, there is mentioning about some of these workloads in THP optimization papers, where even the cost of defragmenation+reclaim+zeroing of 2MiB apparently affects some latency-sensitive workloads. But the challenge that THP-deferred and the shrinker is trying to solve right now is more towards memory over-allocation, and for the the shrinker should *mostly* be able to do what the deferred option would achieve. There are these corner cases where a workload accesses some pages and then long-term pins them (e.g., iouring fixed buffers), where the over-allcoation cannot be reclaimed until the longterm pin is gone. But it's not been raised to be a real problem in practice, so I'm just mentioning it as a key difference: the shrinker runs after memory over-allocation with the cost of memory scanning and temporary memory overallcoation+zeroing. "deferred" avoids memory over-allocation, with the cost of more page faults + relying on khugepaged making progress. > > Btw, I'm curious if anybody actually uses or tried to use 512MB > THPs. I understand it's technically possible, and there are usecases > for 1G pages through static hugetlb management, but 512M seems wild > for a "real" page size - real as in: allocated on-demand and actually > paged to/from storage. I think it's less a matter of "using" or trying", but more THPs being default enabled in many setups nowadays, and if you have sufficient memory in the system, even these weird 512MiB THPs will up in your address space if a (sparse) VMA is large enough :) -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/2] mm: introduce THP deferred setting 2024-08-27 13:22 ` David Hildenbrand @ 2024-08-27 13:57 ` Usama Arif 2024-08-27 22:04 ` Nico Pache 0 siblings, 1 reply; 19+ messages in thread From: Usama Arif @ 2024-08-27 13:57 UTC (permalink / raw) To: David Hildenbrand, Johannes Weiner Cc: Nico Pache, linux-mm, linux-kernel, linux-doc, Andrew Morton, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet, Kirill A . Shutemov, Zi Yan On 27/08/2024 09:22, David Hildenbrand wrote: > On 27.08.24 15:05, Johannes Weiner wrote: >> On Tue, Aug 27, 2024 at 01:46:26PM +0200, David Hildenbrand wrote: >>> On 27.08.24 13:09, Johannes Weiner wrote: >>>> On Tue, Aug 27, 2024 at 11:37:14AM +0100, Usama Arif wrote: >>>>> >>>>> >>>>> On 26/08/2024 17:14, Nico Pache wrote: >>>>>> On Mon, Aug 26, 2024 at 10:47 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 26/08/2024 11:40, Nico Pache wrote: >>>>>>>> On Tue, Jul 30, 2024 at 4:37 PM Nico Pache <npache@redhat.com> wrote: >>>>>>>>> >>>>>>>>> Hi Zi Yan, >>>>>>>>> On Mon, Jul 29, 2024 at 7:26 PM Zi Yan <ziy@nvidia.com> wrote: >>>>>>>>>> >>>>>>>>>> +Kirill >>>>>>>>>> >>>>>>>>>> On 29 Jul 2024, at 18:27, Nico Pache wrote: >>>>>>>>>> >>>>>>>>>>> We've seen cases were customers switching from RHEL7 to RHEL8 see a >>>>>>>>>>> significant increase in the memory footprint for the same workloads. >>>>>>>>>>> >>>>>>>>>>> Through our investigations we found that a large contributing factor to >>>>>>>>>>> the increase in RSS was an increase in THP usage. >>>>>>>>>> >>>>>>>>>> Any knob is changed from RHEL7 to RHEL8 to cause more THP usage? >>>>>>>>> IIRC, most of the systems tuning is the same. We attributed the >>>>>>>>> increase in THP usage to a combination of improvements in the kernel, >>>>>>>>> and improvements in the libraries (better alignments). That allowed >>>>>>>>> THP allocations to succeed at a higher rate. I can go back and confirm >>>>>>>>> this tomorrow though. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> For workloads like MySQL, or when using allocators like jemalloc, it is >>>>>>>>>>> often recommended to set /transparent_hugepages/enabled=never. This is >>>>>>>>>>> in part due to performance degradations and increased memory waste. >>>>>>>>>>> >>>>>>>>>>> This series introduces enabled=defer, this setting acts as a middle >>>>>>>>>>> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the >>>>>>>>>>> page fault handler will act normally, making a hugepage if possible. If >>>>>>>>>>> the allocation is not MADV_HUGEPAGE, then the page fault handler will >>>>>>>>>>> default to the base size allocation. The caveat is that khugepaged can >>>>>>>>>>> still operate on pages thats not MADV_HUGEPAGE. >>>>>>>>>> >>>>>>>>>> Why? If user does not explicitly want huge page, why bother providing huge >>>>>>>>>> pages? Wouldn't it increase memory footprint? >>>>>>>>> >>>>>>>>> So we have "always", which will always try to allocate a THP when it >>>>>>>>> can. This setting gives good performance in a lot of conditions, but >>>>>>>>> tends to waste memory. Additionally applications DON'T need to be >>>>>>>>> modified to take advantage of THPs. >>>>>>>>> >>>>>>>>> We have "madvise" which will only satisfy allocations that are >>>>>>>>> MADV_HUGEPAGE, this gives you granular control, and a lot of times >>>>>>>>> these madvises come from libraries. Unlike "always" you DO need to >>>>>>>>> modify your application if you want to use THPs. >>>>>>>>> >>>>>>>>> Then we have "never", which of course, never allocates THPs. >>>>>>>>> >>>>>>>>> Ok. back to your question, like "madvise", "defer" gives you the >>>>>>>>> benefits of THPs when you specifically know you want them >>>>>>>>> (madv_hugepage), but also benefits applications that dont specifically >>>>>>>>> ask for them (or cant be modified to ask for them), like "always" >>>>>>>>> does. The applications that dont ask for THPs must wait for khugepaged >>>>>>>>> to get them (avoid insertions at PF time)-- this curbs a lot of memory >>>>>>>>> waste, and gives an increased tunability over "always". Another added >>>>>>>>> benefit is that khugepaged will most likely not operate on short lived >>>>>>>>> allocations, meaning that only longstanding memory will be collapsed >>>>>>>>> to THPs. >>>>>>>>> >>>>>>>>> The memory waste can be tuned with max_ptes_none... lets say you want >>>>>>>>> ~90% of your PMD to be full before collapsing into a huge page. simply >>>>>>>>> set max_ptes_none=64. or no waste, set max_ptes_none=0, requiring the >>>>>>>>> 512 pages to be present before being collapsed. >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> This allows for two things... one, applications specifically designed to >>>>>>>>>>> use hugepages will get them, and two, applications that don't use >>>>>>>>>>> hugepages can still benefit from them without aggressively inserting >>>>>>>>>>> THPs at every possible chance. This curbs the memory waste, and defers >>>>>>>>>>> the use of hugepages to khugepaged. Khugepaged can then scan the memory >>>>>>>>>>> for eligible collapsing. >>>>>>>>>> >>>>>>>>>> khugepaged would replace application memory with huge pages without specific >>>>>>>>>> goal. Why not use a user space agent with process_madvise() to collapse >>>>>>>>>> huge pages? Admin might have more knobs to tweak than khugepaged. >>>>>>>>> >>>>>>>>> The benefits of "always" are that no userspace agent is needed, and >>>>>>>>> applications dont have to be modified to use madvise(MADV_HUGEPAGE) to >>>>>>>>> benefit from THPs. This setting hopes to gain some of the same >>>>>>>>> benefits without the significant waste of memory and an increased >>>>>>>>> tunability. >>>>>>>>> >>>>>>>>> future changes I have in the works are to make khugepaged more >>>>>>>>> "smart". Moving it away from the round robin fashion it currently >>>>>>>>> operates in, to instead make smart and informed decisions of what >>>>>>>>> memory to collapse (and potentially split). >>>>>>>>> >>>>>>>>> Hopefully that helped explain the motivation for this new setting! >>>>>>>> >>>>>>>> Any last comments before I resend this? >>>>>>>> >>>>>>>> Ive been made aware of >>>>>>>> https://lore.kernel.org/all/20240730125346.1580150-1-usamaarif642@gmail.com/T/#u >>>>>>>> which introduces THP splitting. These are both trying to achieve the >>>>>>>> same thing through different means. Our approach leverages khugepaged >>>>>>>> to promote pages, while Usama's uses the reclaim path to demote >>>>>>>> hugepages and shrink the underlying memory. >>>>>>>> >>>>>>>> I will leave it up to reviewers to determine which is better; However, >>>>>>>> we can't have both, as we'd be introducing trashing conditions. >>>>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Just inserting this here from my cover letter: >>>>>>> >>>>>>> Waiting for khugepaged to scan memory and >>>>>>> collapse pages into THP can be slow and unpredictable in terms of performance >>>>>> Obviously not part of my patchset here, but I have been testing some >>>>>> changes to khugepaged to make it more aware of what processes are hot. >>>>>> Ideally then it can make better choices of what to operate on. >>>>>>> (i.e. you dont know when the collapse will happen), while production >>>>>>> environments require predictable performance. If there is enough memory >>>>>>> available, its better for both performance and predictability to have >>>>>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged >>>>>>> to collapse it, and deal with sparsely populated THPs when the system is >>>>>>> running out of memory. >>>>>>> >>>>>>> I just went through your patches, and am not sure why we can't have both? >>>>>> Fair point, we can. I've been playing around with splitting hugepages >>>>>> and via khugepaged and was thinking of the trashing conditions there-- >>>>>> but your implementation takes a different approach. >>>>>> I've been working on performance testing my "defer" changes, once I >>>>>> find the appropriate workloads I'll try adding your changes to the >>>>>> mix. I have a feeling my approach is better for latency sensitive >>>>>> workloads, while yours is better for throughput, but let me find a way >>>>>> to confirm that. >>>>>> >>>>>> >>>>> Hmm, I am not sure if its latency vs throughput. >>>>> >>>>> There are 2 things we probably want to consider, short lived and long lived mappings, and >>>>> in each of these situations, having enough memory and running out of memory. >>>>> >>>>> For short lived mappings, I believe reducing page faults is a bigger factor in >>>>> improving performance. In that case, khugepaged won't have enough time to work, >>>>> so THP=always will perform better than THP=defer. THP=defer in this case will perform >>>>> the same as THP=madvise? >>>>> If there is enough memory, then the changes I introduced in the shrinker won't cost anything >>>>> as the shrinker won't run, and the system performance will be the same as THP=always. >>>>> If there is low memory and the shrinker runs, it will only split THPs that have zero-filled >>>>> pages more than max_ptes_none, and map the zero-filled pages to shared zero-pages saving memory. >>>>> There is ofcourse a cost to splitting and running the shrinker, but hopefully it only splits >>>>> underused THPs. >>>>> >>>>> For long lived mappings, reduced TLB misses would be the bigger factor in improving performance. >>>>> For the initial run of the application THP=always will perform better wrt TLB misses as >>>>> page fault handler will give THPs from start. >>>>> Later on in the run, the memory might look similar between THP=always with shrinker and >>>>> max_ptes_none < HPAGE_PMD_NR vs THP=defer and max_ptes_none < HPAGE_PMD_NR? >>>>> This is because khugepaged will have collapsed pages that might have initially been faulted in. >>>>> And collapsing has a cost, which would not have been incurred if the THPs were present from fault. >>>>> If there is low memory, then shrinker would split memory (which has a cost as well) and the system >>>>> memory would look similar or better than THP=defer, as the shrinker would split THPs that initially >>>>> might not have been underused, but are underused at time of memory pressure. >>>>> >>>>> With THP=always + underused shrinker, the cost (splitting) is incurred only if needed and when its needed. >>>>> While with THP=defer the cost (higher page faults, higher TLB misses + khugepaged collapse) is incurred all the time, >>>>> even if the system might have plenty of memory available and there is no need to take a performance hit. >>>> >>>> I agree with this. The defer mode is an improvement over the upstream >>>> status quo, no doubt. However, both defer mode and the shrinker solve >>>> the issue of memory waste under pressure, while the shrinker permits >>>> more desirable behavior when memory is abundant. >>>> >>>> So my take is that the shrinker is the way to go, and I don't see a >>>> bonafide usecase for defer mode that the shrinker couldn't cover. >>> >>> Page fault latency? IOW, zeroing a complete THP, which might be up to >>> 512 MiB on arm64. This is one of the things people bring up, where >>> FreeBSD is different because it will zero fragments on-demand (but also >>> result in more pagefaults). >> >> Good point, I suppose this could matter to some workloads. Would be >> interesting to get real world data on that if that's the argument. > > IIRC, there is mentioning about some of these workloads in THP optimization papers, where even the cost of defragmenation+reclaim+zeroing of 2MiB apparently affects some latency-sensitive workloads. > > But the challenge that THP-deferred and the shrinker is trying to solve right now is more towards memory over-allocation, and for the the shrinker should *mostly* be able to do what the deferred option would achieve. > I would say underused shrinker is trying to solve memory over-allocation due to THP=always *while* trying to preserve the performance benefits of THP=always, which THP=defer might not achieve with the same performance due to more running costs? > There are these corner cases where a workload accesses some pages and then long-term pins them (e.g., iouring fixed buffers), where the over-allcoation cannot be reclaimed until the longterm pin is gone. > > But it's not been raised to be a real problem in practice, so I'm just mentioning it as a key difference: the shrinker runs after memory over-allocation with the cost of memory scanning and temporary memory overallcoation+zeroing. "deferred" avoids memory over-allocation, with the cost of more page faults + relying on khugepaged making progress. > and the deferred option also has the cost of higher TLB misses (which can be quite significant) when the pages haven't collapsed yet, and all of these are costs that shouldn't even be incurred if the system has plenty of memory. If there aren't many underused pages, then for long-term mappings, the memory would look similar in both cases after sometime? as khugepaged would just collapse pages that would just have been there from page fault in THP=always (especially if khugepaged is made more affective). In that case, the shrinker solution would have had no cost for that time (or maybe just page fault latency?), while THP=defer would incur all the 3 costs? Both solutions do have their pros and cons and I am not against having THP=defer. But I do feel the shrinker is a better solution with the experiments in Meta production [1] and only having to incur a cost if and when needed. I could be biased as well :) [1] https://lore.kernel.org/all/20240819023145.2415299-1-usamaarif642@gmail.com/#t >> >> Btw, I'm curious if anybody actually uses or tried to use 512MB >> THPs. I understand it's technically possible, and there are usecases >> for 1G pages through static hugetlb management, but 512M seems wild >> for a "real" page size - real as in: allocated on-demand and actually >> paged to/from storage. > > I think it's less a matter of "using" or trying", but more THPs being default enabled in many setups nowadays, and if you have sufficient memory in the system, even these weird 512MiB THPs will up in your address space if a (sparse) VMA is large enough :) > The page fault latency is a good point for ARM servers. I would imagine more ARM servers run with 64K page size compared to 4K, and if the default config option for THP is not changed, then we could run into 512M THPs! ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/2] mm: introduce THP deferred setting 2024-08-27 13:57 ` Usama Arif @ 2024-08-27 22:04 ` Nico Pache 0 siblings, 0 replies; 19+ messages in thread From: Nico Pache @ 2024-08-27 22:04 UTC (permalink / raw) To: Usama Arif, Johannes Weiner, David Hildenbrand Cc: linux-mm, linux-kernel, linux-doc, Andrew Morton, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet, Kirill A . Shutemov, Zi Yan On Tue, Aug 27, 2024 at 7:57 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 27/08/2024 09:22, David Hildenbrand wrote: > > On 27.08.24 15:05, Johannes Weiner wrote: > >> On Tue, Aug 27, 2024 at 01:46:26PM +0200, David Hildenbrand wrote: > >>> On 27.08.24 13:09, Johannes Weiner wrote: > >>>> On Tue, Aug 27, 2024 at 11:37:14AM +0100, Usama Arif wrote: > >>>>> > >>>>> > >>>>> On 26/08/2024 17:14, Nico Pache wrote: > >>>>>> On Mon, Aug 26, 2024 at 10:47 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On 26/08/2024 11:40, Nico Pache wrote: > >>>>>>>> On Tue, Jul 30, 2024 at 4:37 PM Nico Pache <npache@redhat.com> wrote: > >>>>>>>>> > >>>>>>>>> Hi Zi Yan, > >>>>>>>>> On Mon, Jul 29, 2024 at 7:26 PM Zi Yan <ziy@nvidia.com> wrote: > >>>>>>>>>> > >>>>>>>>>> +Kirill > >>>>>>>>>> > >>>>>>>>>> On 29 Jul 2024, at 18:27, Nico Pache wrote: > >>>>>>>>>> > >>>>>>>>>>> We've seen cases were customers switching from RHEL7 to RHEL8 see a > >>>>>>>>>>> significant increase in the memory footprint for the same workloads. > >>>>>>>>>>> > >>>>>>>>>>> Through our investigations we found that a large contributing factor to > >>>>>>>>>>> the increase in RSS was an increase in THP usage. > >>>>>>>>>> > >>>>>>>>>> Any knob is changed from RHEL7 to RHEL8 to cause more THP usage? > >>>>>>>>> IIRC, most of the systems tuning is the same. We attributed the > >>>>>>>>> increase in THP usage to a combination of improvements in the kernel, > >>>>>>>>> and improvements in the libraries (better alignments). That allowed > >>>>>>>>> THP allocations to succeed at a higher rate. I can go back and confirm > >>>>>>>>> this tomorrow though. > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> For workloads like MySQL, or when using allocators like jemalloc, it is > >>>>>>>>>>> often recommended to set /transparent_hugepages/enabled=never. This is > >>>>>>>>>>> in part due to performance degradations and increased memory waste. > >>>>>>>>>>> > >>>>>>>>>>> This series introduces enabled=defer, this setting acts as a middle > >>>>>>>>>>> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the > >>>>>>>>>>> page fault handler will act normally, making a hugepage if possible. If > >>>>>>>>>>> the allocation is not MADV_HUGEPAGE, then the page fault handler will > >>>>>>>>>>> default to the base size allocation. The caveat is that khugepaged can > >>>>>>>>>>> still operate on pages thats not MADV_HUGEPAGE. > >>>>>>>>>> > >>>>>>>>>> Why? If user does not explicitly want huge page, why bother providing huge > >>>>>>>>>> pages? Wouldn't it increase memory footprint? > >>>>>>>>> > >>>>>>>>> So we have "always", which will always try to allocate a THP when it > >>>>>>>>> can. This setting gives good performance in a lot of conditions, but > >>>>>>>>> tends to waste memory. Additionally applications DON'T need to be > >>>>>>>>> modified to take advantage of THPs. > >>>>>>>>> > >>>>>>>>> We have "madvise" which will only satisfy allocations that are > >>>>>>>>> MADV_HUGEPAGE, this gives you granular control, and a lot of times > >>>>>>>>> these madvises come from libraries. Unlike "always" you DO need to > >>>>>>>>> modify your application if you want to use THPs. > >>>>>>>>> > >>>>>>>>> Then we have "never", which of course, never allocates THPs. > >>>>>>>>> > >>>>>>>>> Ok. back to your question, like "madvise", "defer" gives you the > >>>>>>>>> benefits of THPs when you specifically know you want them > >>>>>>>>> (madv_hugepage), but also benefits applications that dont specifically > >>>>>>>>> ask for them (or cant be modified to ask for them), like "always" > >>>>>>>>> does. The applications that dont ask for THPs must wait for khugepaged > >>>>>>>>> to get them (avoid insertions at PF time)-- this curbs a lot of memory > >>>>>>>>> waste, and gives an increased tunability over "always". Another added > >>>>>>>>> benefit is that khugepaged will most likely not operate on short lived > >>>>>>>>> allocations, meaning that only longstanding memory will be collapsed > >>>>>>>>> to THPs. > >>>>>>>>> > >>>>>>>>> The memory waste can be tuned with max_ptes_none... lets say you want > >>>>>>>>> ~90% of your PMD to be full before collapsing into a huge page. simply > >>>>>>>>> set max_ptes_none=64. or no waste, set max_ptes_none=0, requiring the > >>>>>>>>> 512 pages to be present before being collapsed. > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> This allows for two things... one, applications specifically designed to > >>>>>>>>>>> use hugepages will get them, and two, applications that don't use > >>>>>>>>>>> hugepages can still benefit from them without aggressively inserting > >>>>>>>>>>> THPs at every possible chance. This curbs the memory waste, and defers > >>>>>>>>>>> the use of hugepages to khugepaged. Khugepaged can then scan the memory > >>>>>>>>>>> for eligible collapsing. > >>>>>>>>>> > >>>>>>>>>> khugepaged would replace application memory with huge pages without specific > >>>>>>>>>> goal. Why not use a user space agent with process_madvise() to collapse > >>>>>>>>>> huge pages? Admin might have more knobs to tweak than khugepaged. > >>>>>>>>> > >>>>>>>>> The benefits of "always" are that no userspace agent is needed, and > >>>>>>>>> applications dont have to be modified to use madvise(MADV_HUGEPAGE) to > >>>>>>>>> benefit from THPs. This setting hopes to gain some of the same > >>>>>>>>> benefits without the significant waste of memory and an increased > >>>>>>>>> tunability. > >>>>>>>>> > >>>>>>>>> future changes I have in the works are to make khugepaged more > >>>>>>>>> "smart". Moving it away from the round robin fashion it currently > >>>>>>>>> operates in, to instead make smart and informed decisions of what > >>>>>>>>> memory to collapse (and potentially split). > >>>>>>>>> > >>>>>>>>> Hopefully that helped explain the motivation for this new setting! > >>>>>>>> > >>>>>>>> Any last comments before I resend this? > >>>>>>>> > >>>>>>>> Ive been made aware of > >>>>>>>> https://lore.kernel.org/all/20240730125346.1580150-1-usamaarif642@gmail.com/T/#u > >>>>>>>> which introduces THP splitting. These are both trying to achieve the > >>>>>>>> same thing through different means. Our approach leverages khugepaged > >>>>>>>> to promote pages, while Usama's uses the reclaim path to demote > >>>>>>>> hugepages and shrink the underlying memory. > >>>>>>>> > >>>>>>>> I will leave it up to reviewers to determine which is better; However, > >>>>>>>> we can't have both, as we'd be introducing trashing conditions. > >>>>>>>> > >>>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> Just inserting this here from my cover letter: > >>>>>>> > >>>>>>> Waiting for khugepaged to scan memory and > >>>>>>> collapse pages into THP can be slow and unpredictable in terms of performance > >>>>>> Obviously not part of my patchset here, but I have been testing some > >>>>>> changes to khugepaged to make it more aware of what processes are hot. > >>>>>> Ideally then it can make better choices of what to operate on. > >>>>>>> (i.e. you dont know when the collapse will happen), while production > >>>>>>> environments require predictable performance. If there is enough memory > >>>>>>> available, its better for both performance and predictability to have > >>>>>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged > >>>>>>> to collapse it, and deal with sparsely populated THPs when the system is > >>>>>>> running out of memory. > >>>>>>> > >>>>>>> I just went through your patches, and am not sure why we can't have both? > >>>>>> Fair point, we can. I've been playing around with splitting hugepages > >>>>>> and via khugepaged and was thinking of the trashing conditions there-- > >>>>>> but your implementation takes a different approach. > >>>>>> I've been working on performance testing my "defer" changes, once I > >>>>>> find the appropriate workloads I'll try adding your changes to the > >>>>>> mix. I have a feeling my approach is better for latency sensitive > >>>>>> workloads, while yours is better for throughput, but let me find a way > >>>>>> to confirm that. > >>>>>> > >>>>>> > >>>>> Hmm, I am not sure if its latency vs throughput. > >>>>> > >>>>> There are 2 things we probably want to consider, short lived and long lived mappings, and > >>>>> in each of these situations, having enough memory and running out of memory. > >>>>> > >>>>> For short lived mappings, I believe reducing page faults is a bigger factor in > >>>>> improving performance. In that case, khugepaged won't have enough time to work, > >>>>> so THP=always will perform better than THP=defer. THP=defer in this case will perform > >>>>> the same as THP=madvise? > >>>>> If there is enough memory, then the changes I introduced in the shrinker won't cost anything > >>>>> as the shrinker won't run, and the system performance will be the same as THP=always. > >>>>> If there is low memory and the shrinker runs, it will only split THPs that have zero-filled > >>>>> pages more than max_ptes_none, and map the zero-filled pages to shared zero-pages saving memory. > >>>>> There is ofcourse a cost to splitting and running the shrinker, but hopefully it only splits > >>>>> underused THPs. > >>>>> > >>>>> For long lived mappings, reduced TLB misses would be the bigger factor in improving performance. > >>>>> For the initial run of the application THP=always will perform better wrt TLB misses as > >>>>> page fault handler will give THPs from start. > >>>>> Later on in the run, the memory might look similar between THP=always with shrinker and > >>>>> max_ptes_none < HPAGE_PMD_NR vs THP=defer and max_ptes_none < HPAGE_PMD_NR? > >>>>> This is because khugepaged will have collapsed pages that might have initially been faulted in. > >>>>> And collapsing has a cost, which would not have been incurred if the THPs were present from fault. > >>>>> If there is low memory, then shrinker would split memory (which has a cost as well) and the system > >>>>> memory would look similar or better than THP=defer, as the shrinker would split THPs that initially > >>>>> might not have been underused, but are underused at time of memory pressure. > >>>>> > >>>>> With THP=always + underused shrinker, the cost (splitting) is incurred only if needed and when its needed. > >>>>> While with THP=defer the cost (higher page faults, higher TLB misses + khugepaged collapse) is incurred all the time, > >>>>> even if the system might have plenty of memory available and there is no need to take a performance hit. > >>>> > >>>> I agree with this. The defer mode is an improvement over the upstream > >>>> status quo, no doubt. However, both defer mode and the shrinker solve > >>>> the issue of memory waste under pressure, while the shrinker permits > >>>> more desirable behavior when memory is abundant. > >>>> > >>>> So my take is that the shrinker is the way to go, and I don't see a > >>>> bonafide usecase for defer mode that the shrinker couldn't cover. > >>> > >>> Page fault latency? IOW, zeroing a complete THP, which might be up to > >>> 512 MiB on arm64. This is one of the things people bring up, where > >>> FreeBSD is different because it will zero fragments on-demand (but also > >>> result in more pagefaults). > >> > >> Good point, I suppose this could matter to some workloads. Would be > >> interesting to get real world data on that if that's the argument. > > > > IIRC, there is mentioning about some of these workloads in THP optimization papers, where even the cost of defragmenation+reclaim+zeroing of 2MiB apparently affects some latency-sensitive workloads. > > > > But the challenge that THP-deferred and the shrinker is trying to solve right now is more towards memory over-allocation, and for the the shrinker should *mostly* be able to do what the deferred option would achieve. > > > > I would say underused shrinker is trying to solve memory over-allocation due to THP=always *while* trying to preserve the performance benefits of THP=always, > which THP=defer might not achieve with the same performance due to more running costs? > > > There are these corner cases where a workload accesses some pages and then long-term pins them (e.g., iouring fixed buffers), where the over-allcoation cannot be reclaimed until the longterm pin is gone. > > > > But it's not been raised to be a real problem in practice, so I'm just mentioning it as a key difference: the shrinker runs after memory over-allocation with the cost of memory scanning and temporary memory overallcoation+zeroing. "deferred" avoids memory over-allocation, with the cost of more page faults + relying on khugepaged making progress. > > > and the deferred option also has the cost of higher TLB misses (which can be quite significant) when the pages haven't collapsed yet, > and all of these are costs that shouldn't even be incurred if the system has plenty of memory. > > If there aren't many underused pages, then for long-term mappings, the memory would look similar in both cases after sometime? > as khugepaged would just collapse pages that would just have been there from page fault in THP=always (especially if khugepaged > is made more affective). In that case, the shrinker solution would have had no cost for that time (or maybe just page fault > latency?), while THP=defer would incur all the 3 costs? > > Both solutions do have their pros and cons and I am not against having THP=defer. But I do feel the shrinker is a better solution > with the experiments in Meta production [1] and only having to incur a cost if and when needed. I could be biased as well :) I agree, they both do have pros and cons, and I think the shrinker solution is the better solution for the "always" approach. It's important to remember that "defer" isn't necessarily trying to beat "always". It's closer in behavior to "madvise", but with the added benefit that applications that need to be "transparent" (no madv calls) can still leverage THPs. I will try to find some latency benchmarks to show that "always" can negatively impact latency in some cases, and "madvise" (or "defer") would be the better solution for that workload. In these cases the defer solution would be allowing applications to transparently benefit from THPs and reduce their TLB pressure over time, without incurring the latency penalty associated with requesting a THP at fault. > > [1] https://lore.kernel.org/all/20240819023145.2415299-1-usamaarif642@gmail.com/#t > > >> > >> Btw, I'm curious if anybody actually uses or tried to use 512MB > >> THPs. I understand it's technically possible, and there are usecases > >> for 1G pages through static hugetlb management, but 512M seems wild > >> for a "real" page size - real as in: allocated on-demand and actually > >> paged to/from storage. > > > > I think it's less a matter of "using" or trying", but more THPs being default enabled in many setups nowadays, and if you have sufficient memory in the system, even these weird 512MiB THPs will up in your address space if a (sparse) VMA is large enough :) > > > > The page fault latency is a good point for ARM servers. I would imagine more ARM servers run with 64K page size compared to 4K, and if the default config option for THP is not changed, then we could run into 512M THPs! > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/2] mm: introduce THP deferred setting 2024-08-27 11:09 ` Johannes Weiner 2024-08-27 11:46 ` David Hildenbrand @ 2024-08-28 1:18 ` Rik van Riel 2024-08-28 6:17 ` Kirill A . Shutemov 1 sibling, 1 reply; 19+ messages in thread From: Rik van Riel @ 2024-08-28 1:18 UTC (permalink / raw) To: Johannes Weiner, Usama Arif Cc: Nico Pache, linux-mm, linux-kernel, linux-doc, Andrew Morton, David Hildenbrand, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet, Kirill A . Shutemov, Zi Yan On Tue, 2024-08-27 at 13:09 +0200, Johannes Weiner wrote: > > I agree with this. The defer mode is an improvement over the upstream > status quo, no doubt. However, both defer mode and the shrinker solve > the issue of memory waste under pressure, while the shrinker permits > more desirable behavior when memory is abundant. > > So my take is that the shrinker is the way to go, and I don't see a > bonafide usecase for defer mode that the shrinker couldn't cover. > > I would like to take one step back, and think about what some real world workloads might want as a tunable for THP. Workload owners are going to have a real problem trying to figure out what the best value of max_ptes_none should be for their workloads. However, giving workload owners the ability to say "this workload should not waste more than 1GB of memory on zero pages inside THPs", or 500MB, or 4GB or whatever, would then allow the kernel to automatically adjust the max_ptes_none threshold. Once a workload is close to, or exceeding the maximum amount of THP zero page overhead, we could both shrink THPs, and disable direct THP allocation at page fault time for that workload. If we want to give workload owners a predictable, easy to work with tunable, we probably want both the shrinker and the deferred allocation. -- All Rights Reversed. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/2] mm: introduce THP deferred setting 2024-08-28 1:18 ` Rik van Riel @ 2024-08-28 6:17 ` Kirill A . Shutemov 2024-08-28 10:44 ` Usama Arif 2024-08-28 12:54 ` Rik van Riel 0 siblings, 2 replies; 19+ messages in thread From: Kirill A . Shutemov @ 2024-08-28 6:17 UTC (permalink / raw) To: Rik van Riel Cc: Johannes Weiner, Usama Arif, Nico Pache, linux-mm, linux-kernel, linux-doc, Andrew Morton, David Hildenbrand, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet, Zi Yan On Tue, Aug 27, 2024 at 09:18:58PM -0400, Rik van Riel wrote: > On Tue, 2024-08-27 at 13:09 +0200, Johannes Weiner wrote: > > > > I agree with this. The defer mode is an improvement over the upstream > > status quo, no doubt. However, both defer mode and the shrinker solve > > the issue of memory waste under pressure, while the shrinker permits > > more desirable behavior when memory is abundant. > > > > So my take is that the shrinker is the way to go, and I don't see a > > bonafide usecase for defer mode that the shrinker couldn't cover. > > > > > I would like to take one step back, and think about what some real > world workloads might want as a tunable for THP. > > Workload owners are going to have a real problem trying to figure > out what the best value of max_ptes_none should be for their > workloads. > > However, giving workload owners the ability to say "this workload > should not waste more than 1GB of memory on zero pages inside THPs", > or 500MB, or 4GB or whatever, would then allow the kernel to > automatically adjust the max_ptes_none threshold. The problem is that we don't have and cannot have the info on zero pages inside THPs readily available. It requires memory scanning which is prohibitively expensive if we want the info to be somewhat up-to-date. We don't have enough input from HW on the access pattern. It would be nice to decouple A/D bit (or maybe just A) from page table structure and get higher resolution on the access pattern for THPs. I tried to talk to HW folk, but it went nowhere. Maybe if there would be a customer demand... Just saying... -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/2] mm: introduce THP deferred setting 2024-08-28 6:17 ` Kirill A . Shutemov @ 2024-08-28 10:44 ` Usama Arif 2024-08-28 12:54 ` Rik van Riel 1 sibling, 0 replies; 19+ messages in thread From: Usama Arif @ 2024-08-28 10:44 UTC (permalink / raw) To: Kirill A . Shutemov, Rik van Riel, Nico Pache Cc: Johannes Weiner, linux-mm, linux-kernel, linux-doc, Andrew Morton, David Hildenbrand, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet, Zi Yan On 28/08/2024 02:17, Kirill A . Shutemov wrote: > On Tue, Aug 27, 2024 at 09:18:58PM -0400, Rik van Riel wrote: >> On Tue, 2024-08-27 at 13:09 +0200, Johannes Weiner wrote: >>> >>> I agree with this. The defer mode is an improvement over the upstream >>> status quo, no doubt. However, both defer mode and the shrinker solve >>> the issue of memory waste under pressure, while the shrinker permits >>> more desirable behavior when memory is abundant. >>> >>> So my take is that the shrinker is the way to go, and I don't see a >>> bonafide usecase for defer mode that the shrinker couldn't cover. >>> >>> >> I would like to take one step back, and think about what some real >> world workloads might want as a tunable for THP. >> >> Workload owners are going to have a real problem trying to figure >> out what the best value of max_ptes_none should be for their >> workloads. >> Yes, I agree. For both solutions, max_ptes_none needs to be adjusted, and would require experimentation with different values which workload owners might not do or want to do. But as Kirill said, the information on the number of zero pages in THPs isn't available. A possible solution might be randomly sampling a number of THPs at certain time intervals, but I don't think its a good idea to use that as a representation of the entire system. Its ok from my side to have both the solutions in kernel as they don't interfere with each other. THP=defer makes sense to have as well if there are real world workloads or benchmarks that show page fault latency is problem due to THP=always as Nico mentioned in his reply [1] [1] https://lore.kernel.org/all/CAA1CXcCyRd+qfszM4GGvKqW=95AV9v8LG5oihByEBGLtW4tD4g@mail.gmail.com/ >> However, giving workload owners the ability to say "this workload >> should not waste more than 1GB of memory on zero pages inside THPs", >> or 500MB, or 4GB or whatever, would then allow the kernel to >> automatically adjust the max_ptes_none threshold. > > The problem is that we don't have and cannot have the info on zero pages > inside THPs readily available. It requires memory scanning which is > prohibitively expensive if we want the info to be somewhat up-to-date. > > We don't have enough input from HW on the access pattern. It would be nice > to decouple A/D bit (or maybe just A) from page table structure and get > higher resolution on the access pattern for THPs. > > I tried to talk to HW folk, but it went nowhere. Maybe if there would be a > customer demand... Just saying... > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/2] mm: introduce THP deferred setting 2024-08-28 6:17 ` Kirill A . Shutemov 2024-08-28 10:44 ` Usama Arif @ 2024-08-28 12:54 ` Rik van Riel 1 sibling, 0 replies; 19+ messages in thread From: Rik van Riel @ 2024-08-28 12:54 UTC (permalink / raw) To: Kirill A . Shutemov Cc: Johannes Weiner, Usama Arif, Nico Pache, linux-mm, linux-kernel, linux-doc, Andrew Morton, David Hildenbrand, Matthew Wilcox, Barry Song, Ryan Roberts, Baolin Wang, Lance Yang, Peter Xu, Rafael Aquini, Andrea Arcangeli, Jonathan Corbet, Zi Yan On Wed, 2024-08-28 at 09:17 +0300, Kirill A . Shutemov wrote: > On Tue, Aug 27, 2024 at 09:18:58PM -0400, Rik van Riel wrote: > > > > Workload owners are going to have a real problem trying to figure > > out what the best value of max_ptes_none should be for their > > workloads. > > > > However, giving workload owners the ability to say "this workload > > should not waste more than 1GB of memory on zero pages inside > > THPs", > > or 500MB, or 4GB or whatever, would then allow the kernel to > > automatically adjust the max_ptes_none threshold. > > The problem is that we don't have and cannot have the info on zero > pages > inside THPs readily available. It requires memory scanning which is > prohibitively expensive if we want the info to be somewhat up-to- > date. > I'm not sure it needs to be super up to date. After all, we only care when there is memory pressure, and when there is memory pressure we will be doing some sort of scanning, anyway. With a shrinker in the mix, we do not need totally up to date information, but can gradually approximate the target. -- All Rights Reversed. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2024-08-28 12:54 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-07-29 22:27 [RFC 0/2] mm: introduce THP deferred setting Nico Pache 2024-07-29 22:27 ` [RFC 1/2] mm: defer THP insertion to khugepaged Nico Pache 2024-07-29 22:27 ` [RFC 2/2] mm: document transparent_hugepage=defer usage Nico Pache 2024-07-30 1:26 ` [RFC 0/2] mm: introduce THP deferred setting Zi Yan 2024-07-30 22:37 ` Nico Pache 2024-08-26 15:40 ` Nico Pache 2024-08-26 16:47 ` Usama Arif 2024-08-26 21:14 ` Nico Pache 2024-08-27 10:37 ` Usama Arif 2024-08-27 11:09 ` Johannes Weiner 2024-08-27 11:46 ` David Hildenbrand 2024-08-27 13:05 ` Johannes Weiner 2024-08-27 13:22 ` David Hildenbrand 2024-08-27 13:57 ` Usama Arif 2024-08-27 22:04 ` Nico Pache 2024-08-28 1:18 ` Rik van Riel 2024-08-28 6:17 ` Kirill A . Shutemov 2024-08-28 10:44 ` Usama Arif 2024-08-28 12:54 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox