Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ankur Arora <ankur.a.arora@oracle.com>
To: Li Zhe <lizhe.67@bytedance.com>
Cc: muchun.song@linux.dev, osalvador@suse.de, david@kernel.org,
	akpm@linux-foundation.org, fvdl@google.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
Date: Mon, 12 Jan 2026 14:01:29 -0800	[thread overview]
Message-ID: <87jyxmjxs6.fsf@oracle.com> (raw)
In-Reply-To: <20260107113130.37231-1-lizhe.67@bytedance.com>


Li Zhe <lizhe.67@bytedance.com> writes:

> This patchset is based on this commit[1]("mm/hugetlb: optionally
> pre-zero hugetlb pages").
>
> Fresh hugetlb pages are zeroed out when they are faulted in,
> just like with all other page types. This can take up a good
> amount of time for larger page sizes (e.g. around 250
> milliseconds for a 1G page on a Skylake machine).
>
> This normally isn't a problem, since hugetlb pages are typically
> mapped by the application for a long time, and the initial
> delay when touching them isn't much of an issue.
>
> However, there are some use cases where a large number of hugetlb
> pages are touched when an application starts (such as a VM backed
> by these pages), rendering the launch noticeably slow.
>
> On an Skylake platform running v6.19-rc2, faulting in 64 × 1 GB huge
> pages takes about 16 seconds, roughly 250 ms per page. Even with
> Ankur’s optimizations[2], the time drops only to ~13 seconds,
> ~200 ms per page, still a noticeable delay.
>
> To accelerate the above scenario, this patchset exports a per-node,
> read-write "zeroable_hugepages" sysfs interface for every hugepage size.
> This interface reports how many hugepages on that node can currently
> be pre-zeroed and allows user space to request that any integer number
> in the range [0, max] be zeroed in a single operation.
>
> This mechanism offers the following advantages:
>
> (1) User space gains full control over when zeroing is triggered,
> enabling it to minimize the impact on both CPU and cache utilization.
>
> (2) Applications can spawn as many zeroing processes as they need,
> enabling concurrent background zeroing.
>
> (3) By binding the process to specific CPUs, users can confine zeroing
> threads to cores that do not run latency-critical tasks, eliminating
> interference.
>
> (4) A zeroing process can be interrupted at any time through standard
> signal mechanisms, allowing immediate cancellation.
>
> (5) The CPU consumption incurred by zeroing can be throttled and contained
> with cgroups, ensuring that the cost is not borne system-wide.

> Tested on the same Skylake platform as above, when the 64 GiB of memory
> was pre-zeroed in advance by the pre-zeroing mechanism, the faulting
> latency test completed in negligible time.
>
> In user space, we can use system calls such as epoll and write to zero
> huge folios as they become available, and sleep when none are ready. The
> following pseudocode illustrates this approach. The pseudocode spawns
> eight threads (each running thread_fun()) that wait for huge pages on
> node 0 to become eligible for zeroing; whenever such pages are available,
> the threads clear them in parallel.
>
>   static void thread_fun(void)
>   {
>   	epoll_create();
>   	epoll_ctl();
>   	while (1) {
>   		val = read("/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
>   		if (val > 0)
>   			system("echo max > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
>   		epoll_wait();
>   	}
>   }

Given that zeroable_hugepages is per node, anybody who writes to
it would need to know how much the aggregate demand would be.

Seems to me that the only value that might make sense would be "max".
And at that point this approach seems a little bit like init_on_free.

Ankur

>   static void start_pre_zero_thread(int thread_num)
>   {
>   	create_pre_zero_threads(thread_num, thread_fun)
>   }
>
>   int main(void)
>   {
>   	start_pre_zero_thread(8);
>   }
>
> [1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t
> [2]: https://lore.kernel.org/all/20251215204922.475324-1-ankur.a.arora@oracle.com/T/#u
>
> Li Zhe (8):
>   mm/hugetlb: add pre-zeroed framework
>   mm/hugetlb: convert to prep_account_new_hugetlb_folio()
>   mm/hugetlb: move the huge folio to the end of the list during enqueue
>   mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"
>   mm/hugetlb: simplify function hugetlb_sysfs_add_hstate()
>   mm/hugetlb: relocate the per-hstate struct kobject pointer
>   mm/hugetlb: add epoll support for interface "zeroable_hugepages"
>   mm/hugetlb: limit event generation frequency of function
>     do_zero_free_notify()
>
>  fs/hugetlbfs/inode.c    |   3 +-
>  include/linux/hugetlb.h |  26 +++++
>  mm/hugetlb.c            | 131 ++++++++++++++++++++++---
>  mm/hugetlb_internal.h   |   6 ++
>  mm/hugetlb_sysfs.c      | 206 ++++++++++++++++++++++++++++++++++++----
>  5 files changed, 337 insertions(+), 35 deletions(-)
>
> ---
> Changelogs:
>
> v1->v2 :
> - Use guard() to simplify function hpage_wait_zeroing(). (pointed by
>   Raghu)
> - Simplify the logic of zero_free_hugepages_nid() by removing
>   redundant checks and exiting the loop upon encountering a
>   pre-zeroed folio. (pointed by Frank)
> - Include in the cover letter a performance comparison with Ankur's
>   optimization patch[2]. (pointed by Andrew)
>
> v1: https://lore.kernel.org/all/20251225082059.1632-1-lizhe.67@bytedance.com/


--
ankur

     prev parent reply	other threads:[~2026-01-12 22:01 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-07 11:31 Li Zhe
2026-01-07 11:31 ` [PATCH v2 1/8] mm/hugetlb: add pre-zeroed framework Li Zhe
2026-01-07 11:31 ` [PATCH v2 2/8] mm/hugetlb: convert to prep_account_new_hugetlb_folio() Li Zhe
2026-01-07 11:31 ` [PATCH v2 3/8] mm/hugetlb: move the huge folio to the end of the list during enqueue Li Zhe
2026-01-07 11:31 ` [PATCH v2 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages" Li Zhe
2026-01-07 11:31 ` [PATCH v2 5/8] mm/hugetlb: simplify function hugetlb_sysfs_add_hstate() Li Zhe
2026-01-07 11:31 ` [PATCH v2 6/8] mm/hugetlb: relocate the per-hstate struct kobject pointer Li Zhe
2026-01-07 11:31 ` [PATCH v2 7/8] mm/hugetlb: add epoll support for interface "zeroable_hugepages" Li Zhe
2026-01-07 11:31 ` [PATCH v2 8/8] mm/hugetlb: limit event generation frequency of function do_zero_free_notify() Li Zhe
2026-01-07 16:19 ` [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism Andrew Morton
2026-01-12 11:25   ` Li Zhe
2026-01-09  6:05 ` Muchun Song
2026-01-12 11:27   ` Li Zhe
2026-01-12 19:52     ` David Hildenbrand (Red Hat)
2026-01-12 22:00     ` Ankur Arora
2026-01-12 22:01 ` Ankur Arora [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jyxmjxs6.fsf@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=fvdl@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizhe.67@bytedance.com \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox