From: Muchun Song <muchun.song@linux.dev>
To: Li Zhe <lizhe.67@bytedance.com>
Cc: osalvador@suse.de, david@kernel.org, akpm@linux-foundation.org,
fvdl@google.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
Date: Fri, 9 Jan 2026 14:05:01 +0800 [thread overview]
Message-ID: <1981A332-0585-49AB-9ADE-99FA2FB32DD4@linux.dev> (raw)
In-Reply-To: <20260107113130.37231-1-lizhe.67@bytedance.com>
> On Jan 7, 2026, at 19:31, Li Zhe <lizhe.67@bytedance.com> wrote:
>
> This patchset is based on this commit[1]("mm/hugetlb: optionally
> pre-zero hugetlb pages").
I’d like you to add a brief summary here that roughly explains
what concerns the previous attempts raised and whether the
current proposal has already addressed those concerns, so more
people can quickly grasp the context.
>
> Fresh hugetlb pages are zeroed out when they are faulted in,
> just like with all other page types. This can take up a good
> amount of time for larger page sizes (e.g. around 250
> milliseconds for a 1G page on a Skylake machine).
>
> This normally isn't a problem, since hugetlb pages are typically
> mapped by the application for a long time, and the initial
> delay when touching them isn't much of an issue.
>
> However, there are some use cases where a large number of hugetlb
> pages are touched when an application starts (such as a VM backed
> by these pages), rendering the launch noticeably slow.
>
> On an Skylake platform running v6.19-rc2, faulting in 64 × 1 GB huge
> pages takes about 16 seconds, roughly 250 ms per page. Even with
> Ankur’s optimizations[2], the time drops only to ~13 seconds,
> ~200 ms per page, still a noticeable delay.
I did see some comments in [1] about QEMU supporting user-mode
parallel zero-page operations; I’m just not sure what the current
state of that support looks like, or what the corresponding benchmark
numbers are.
>
> To accelerate the above scenario, this patchset exports a per-node,
> read-write "zeroable_hugepages" sysfs interface for every hugepage size.
> This interface reports how many hugepages on that node can currently
> be pre-zeroed and allows user space to request that any integer number
> in the range [0, max] be zeroed in a single operation.
>
> This mechanism offers the following advantages:
>
> (1) User space gains full control over when zeroing is triggered,
> enabling it to minimize the impact on both CPU and cache utilization.
>
> (2) Applications can spawn as many zeroing processes as they need,
> enabling concurrent background zeroing.
>
> (3) By binding the process to specific CPUs, users can confine zeroing
> threads to cores that do not run latency-critical tasks, eliminating
> interference.
>
> (4) A zeroing process can be interrupted at any time through standard
> signal mechanisms, allowing immediate cancellation.
>
> (5) The CPU consumption incurred by zeroing can be throttled and contained
> with cgroups, ensuring that the cost is not borne system-wide.
>
> Tested on the same Skylake platform as above, when the 64 GiB of memory
> was pre-zeroed in advance by the pre-zeroing mechanism, the faulting
> latency test completed in negligible time.
>
> In user space, we can use system calls such as epoll and write to zero
> huge folios as they become available, and sleep when none are ready. The
> following pseudocode illustrates this approach. The pseudocode spawns
> eight threads (each running thread_fun()) that wait for huge pages on
> node 0 to become eligible for zeroing; whenever such pages are available,
> the threads clear them in parallel.
>
> static void thread_fun(void)
> {
> epoll_create();
> epoll_ctl();
> while (1) {
> val = read("/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
> if (val > 0)
> system("echo max > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
> epoll_wait();
> }
> }
>
> static void start_pre_zero_thread(int thread_num)
> {
> create_pre_zero_threads(thread_num, thread_fun)
> }
>
> int main(void)
> {
> start_pre_zero_thread(8);
> }
>
> [1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t
> [2]: https://lore.kernel.org/all/20251215204922.475324-1-ankur.a.arora@oracle.com/T/#u
>
> Li Zhe (8):
> mm/hugetlb: add pre-zeroed framework
> mm/hugetlb: convert to prep_account_new_hugetlb_folio()
> mm/hugetlb: move the huge folio to the end of the list during enqueue
> mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"
> mm/hugetlb: simplify function hugetlb_sysfs_add_hstate()
> mm/hugetlb: relocate the per-hstate struct kobject pointer
> mm/hugetlb: add epoll support for interface "zeroable_hugepages"
> mm/hugetlb: limit event generation frequency of function
> do_zero_free_notify()
>
> fs/hugetlbfs/inode.c | 3 +-
> include/linux/hugetlb.h | 26 +++++
> mm/hugetlb.c | 131 ++++++++++++++++++++++---
> mm/hugetlb_internal.h | 6 ++
> mm/hugetlb_sysfs.c | 206 ++++++++++++++++++++++++++++++++++++----
> 5 files changed, 337 insertions(+), 35 deletions(-)
>
> ---
> Changelogs:
>
> v1->v2 :
> - Use guard() to simplify function hpage_wait_zeroing(). (pointed by
> Raghu)
> - Simplify the logic of zero_free_hugepages_nid() by removing
> redundant checks and exiting the loop upon encountering a
> pre-zeroed folio. (pointed by Frank)
> - Include in the cover letter a performance comparison with Ankur's
> optimization patch[2]. (pointed by Andrew)
>
> v1: https://lore.kernel.org/all/20251225082059.1632-1-lizhe.67@bytedance.com/
>
> --
> 2.20.1
prev parent reply other threads:[~2026-01-09 6:05 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-07 11:31 Li Zhe
2026-01-07 11:31 ` [PATCH v2 1/8] mm/hugetlb: add pre-zeroed framework Li Zhe
2026-01-07 11:31 ` [PATCH v2 2/8] mm/hugetlb: convert to prep_account_new_hugetlb_folio() Li Zhe
2026-01-07 11:31 ` [PATCH v2 3/8] mm/hugetlb: move the huge folio to the end of the list during enqueue Li Zhe
2026-01-07 11:31 ` [PATCH v2 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages" Li Zhe
2026-01-07 11:31 ` [PATCH v2 5/8] mm/hugetlb: simplify function hugetlb_sysfs_add_hstate() Li Zhe
2026-01-07 11:31 ` [PATCH v2 6/8] mm/hugetlb: relocate the per-hstate struct kobject pointer Li Zhe
2026-01-07 11:31 ` [PATCH v2 7/8] mm/hugetlb: add epoll support for interface "zeroable_hugepages" Li Zhe
2026-01-07 11:31 ` [PATCH v2 8/8] mm/hugetlb: limit event generation frequency of function do_zero_free_notify() Li Zhe
2026-01-07 16:19 ` [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism Andrew Morton
2026-01-09 6:05 ` Muchun Song [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1981A332-0585-49AB-9ADE-99FA2FB32DD4@linux.dev \
--to=muchun.song@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=david@kernel.org \
--cc=fvdl@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizhe.67@bytedance.com \
--cc=osalvador@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox