From: David Hildenbrand <david@redhat.com>
To: Usama Arif <usamaarif642@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org
Cc: hannes@cmpxchg.org, shakeel.butt@linux.dev, riel@surriel.com,
ziy@nvidia.com, baolin.wang@linux.alibaba.com,
lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com,
hughd@google.com, linux-kernel@vger.kernel.org,
linux-doc@vger.kernel.org, kernel-team@meta.com
Subject: Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
Date: Fri, 6 Jun 2025 19:37:39 +0200 [thread overview]
Message-ID: <4c1d5033-0c90-4672-84a1-15978ced245d@redhat.com> (raw)
In-Reply-To: <20250606143700.3256414-1-usamaarif642@gmail.com>
On 06.06.25 16:37, Usama Arif wrote:
> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
> watermarks are evaluated to extremely high values, for e.g. a server with
> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
> of the sizes set to never, the min, low and high watermarks evaluate to
> 11.2G, 14G and 16.8G respectively.
> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
> and 1G respectively.
> This is because set_recommended_min_free_kbytes is designed for PMD
> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
> Such high watermark values can cause performance and latency issues in
> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
> most of them would never actually use a 512M PMD THP.
>
> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
> folio order enabled in set_recommended_min_free_kbytes.
> With this patch, when only 2M THP hugepage size is set to madvise for the
> same machine with 64K page size, with the rest of the sizes set to never,
> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
> respectively. When 512M THP hugepage size is set to madvise for the same
> machine with 64K page size, the min, low and high watermarks evaluate to
> 11.2G, 14G and 16.8G respectively, the same as without this patch.
>
> An alternative solution would be to change PAGE_BLOCK_ORDER by changing
> ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this
> is not dynamic with hugepage size, will need different kernel builds for
> different hugepage sizes and most users won't know that this needs to be
> done as it can be difficult to detmermine that the performance and latency
> issues are coming from the high watermark values.
>
> All watermark numbers are for zones of nodes that had the highest number
> of pages, i.e. the value for min size for 4K is obtained using:
> cat /proc/zoneinfo | grep -i min | awk '{print $2}' | sort -n | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}';
> and for 64K using:
> cat /proc/zoneinfo | grep -i min | awk '{print $2}' | sort -n | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}';
>
> An arbirtary min of 128 pages is used for when no hugepage sizes are set
> enabled.
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
> include/linux/huge_mm.h | 25 +++++++++++++++++++++++++
> mm/khugepaged.c | 32 ++++++++++++++++++++++++++++----
> mm/shmem.c | 29 +++++------------------------
> 3 files changed, 58 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..fb4e51ef0acb 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item)
> }
> #endif
>
> +/*
> + * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
> + *
> + * SHMEM_HUGE_NEVER:
> + * disables huge pages for the mount;
> + * SHMEM_HUGE_ALWAYS:
> + * enables huge pages for the mount;
> + * SHMEM_HUGE_WITHIN_SIZE:
> + * only allocate huge pages if the page will be fully within i_size,
> + * also respect madvise() hints;
> + * SHMEM_HUGE_ADVISE:
> + * only allocate huge pages if requested with madvise();
> + */
> +
> + #define SHMEM_HUGE_NEVER 0
> + #define SHMEM_HUGE_ALWAYS 1
> + #define SHMEM_HUGE_WITHIN_SIZE 2
> + #define SHMEM_HUGE_ADVISE 3
> +
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>
> extern unsigned long transparent_hugepage_flags;
> @@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always;
> extern unsigned long huge_anon_orders_madvise;
> extern unsigned long huge_anon_orders_inherit;
>
> +extern int shmem_huge __read_mostly;
> +extern unsigned long huge_shmem_orders_always;
> +extern unsigned long huge_shmem_orders_madvise;
> +extern unsigned long huge_shmem_orders_inherit;
> +extern unsigned long huge_shmem_orders_within_size;
Do really all of these have to be exported?
> +
> static inline bool hugepage_global_enabled(void)
> {
> return transparent_hugepage_flags &
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 15203ea7d007..e64cba74eb2a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2607,6 +2607,26 @@ static int khugepaged(void *none)
> return 0;
> }
>
> +static int thp_highest_allowable_order(void)
Did you mean "largest" ?
> +{
> + unsigned long orders = READ_ONCE(huge_anon_orders_always)
> + | READ_ONCE(huge_anon_orders_madvise)
> + | READ_ONCE(huge_shmem_orders_always)
> + | READ_ONCE(huge_shmem_orders_madvise)
> + | READ_ONCE(huge_shmem_orders_within_size);
> + if (hugepage_global_enabled())
> + orders |= READ_ONCE(huge_anon_orders_inherit);
> + if (shmem_huge != SHMEM_HUGE_NEVER)
> + orders |= READ_ONCE(huge_shmem_orders_inherit);
> +
> + return orders == 0 ? 0 : fls(orders) - 1;
> +}
But how does this interact with large folios / THPs in the page cache?
> +
> +static unsigned long min_thp_pageblock_nr_pages(void)
Reading the function name, I have no idea what this function is supposed
to do.
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2025-06-06 17:37 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-06 14:37 Usama Arif
2025-06-06 15:01 ` Usama Arif
2025-06-06 15:18 ` Zi Yan
2025-06-06 15:38 ` Usama Arif
2025-06-06 16:10 ` Zi Yan
2025-06-07 8:35 ` Lorenzo Stoakes
2025-06-08 0:04 ` Zi Yan
2025-06-09 11:13 ` Usama Arif
2025-06-09 13:19 ` Zi Yan
2025-06-09 14:11 ` Usama Arif
2025-06-09 14:16 ` Lorenzo Stoakes
2025-06-09 14:37 ` Zi Yan
2025-06-09 14:50 ` Lorenzo Stoakes
2025-06-09 15:20 ` Zi Yan
2025-06-09 19:40 ` Lorenzo Stoakes
2025-06-09 19:49 ` Zi Yan
2025-06-09 20:03 ` Usama Arif
2025-06-09 20:24 ` Zi Yan
2025-06-10 10:41 ` Usama Arif
2025-06-10 14:03 ` Lorenzo Stoakes
2025-06-10 14:20 ` Zi Yan
2025-06-10 15:16 ` Usama Arif
2025-06-09 15:32 ` Zi Yan
2025-06-06 17:37 ` David Hildenbrand [this message]
2025-06-09 11:34 ` Usama Arif
2025-06-09 13:28 ` Zi Yan
2025-06-07 8:18 ` Lorenzo Stoakes
2025-06-07 8:44 ` Lorenzo Stoakes
2025-06-09 12:07 ` Usama Arif
2025-06-09 12:12 ` Usama Arif
2025-06-09 14:58 ` Lorenzo Stoakes
2025-06-09 14:57 ` Lorenzo Stoakes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4c1d5033-0c90-4672-84a1-15978ced245d@redhat.com \
--to=david@redhat.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=dev.jain@arm.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kernel-team@meta.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=npache@redhat.com \
--cc=riel@surriel.com \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=usamaarif642@gmail.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox