[PATCH 0/8] Introduce a huge-page pre-zeroing mechanism

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: 李喆 <lizhe.67@bytedance.com>
To: <muchun.song@linux.dev>, <osalvador@suse.de>, <david@kernel.org>,
	 <akpm@linux-foundation.org>, <fvdl@google.com>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	 <lizhe.67@bytedance.com>
Subject: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism
Date: Thu, 25 Dec 2025 16:20:51 +0800	[thread overview]
Message-ID: <20251225082059.1632-1-lizhe.67@bytedance.com> (raw)

From: Li Zhe <lizhe.67@bytedance.com>

This patchset is based on this commit[1]("mm/hugetlb: optionally
pre-zero hugetlb pages").

Fresh hugetlb pages are zeroed out when they are faulted in,
just like with all other page types. This can take up a good
amount of time for larger page sizes (e.g. around 40
milliseconds for a 1G page on a recent AMD-based system).

This normally isn't a problem, since hugetlb pages are typically
mapped by the application for a long time, and the initial
delay when touching them isn't much of an issue.

However, there are some use cases where a large number of hugetlb
pages are touched when an application (such as a VM backed by these
pages) starts. For 256 1G pages and 40ms per page, this would take
10 seconds, a noticeable delay.

To accelerate the above scenario, this patchset exports a per-node,
read-write zeroable_hugepages interface for every hugepage size.
This interface reports how many hugepages on that node can currently
be pre-zeroed and allows user space to request that any integer number
in the range [0, max] be zeroed in a single operation.

This mechanism offers the following advantages:

(1) User space gains full control over when zeroing is triggered,
enabling it to minimize the impact on both CPU and cache utilization.

(2) Applications can spawn as many zeroing processes as they need,
enabling concurrent background zeroing.

(3) By binding the process to specific CPUs, users can confine zeroing
threads to cores that do not run latency-critical tasks, eliminating
interference.

(4) A zeroing process can be interrupted at any time through standard
signal mechanisms, allowing immediate cancellation.

(5) The CPU consumption incurred by zeroing can be throttled and contained
with cgroups, ensuring that the cost is not borne system-wide.

On an AMD Milan platform, each 1 GB huge-page fault is shortened by at
least 25628 us (figure inherited from the test results cited herein[1]).

In user space, we can use system calls such as epoll and write to zero
huge pages as they become available, and sleep when none are ready. The
following pseudocode illustrates this approach. The pseudocode spawns
eight threads that wait for huge pages on node 0 to become eligible for
zeroing; whenever such pages are available, the threads clear them in
parallel.

  static void thread_fun(void)
  {
  	epoll_create();
  	epoll_ctl();
  	while (1) {
  		val = read("/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
  		if (val > 0)
  			system("echo max > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
  		epoll_wait();
  	}
  }

  static void start_pre_zero_thread(int thread_num)
  {
  	create_pre_zero_threads(thread_num, thread_fun)
  }

  int main(void)
  {
  	start_pre_zero_thread(8);
  }

[1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t

Li Zhe (8):
  mm/hugetlb: add pre-zeroed framework
  mm/hugetlb: convert to prep_account_new_hugetlb_folio()
  mm/hugetlb: move the huge folio to the end of the list during enqueue
  mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"
  mm/hugetlb: simplify function hugetlb_sysfs_add_hstate()
  mm/hugetlb: relocate the per-hstate struct kobject pointer
  mm/hugetlb: add epoll support for interface "zeroable_hugepages"
  mm/hugetlb: limit event generation frequency of function
    do_zero_free_notify()

 fs/hugetlbfs/inode.c    |   3 +-
 include/linux/hugetlb.h |  26 ++++++
 mm/hugetlb.c            | 133 +++++++++++++++++++++++---
 mm/hugetlb_internal.h   |   6 ++
 mm/hugetlb_sysfs.c      | 202 ++++++++++++++++++++++++++++++++++++----
 5 files changed, 335 insertions(+), 35 deletions(-)

-- 
2.20.1

next             reply	other threads:[~2025-12-25  8:21 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-25  8:20 李喆 [this message]
2025-12-25  8:20 ` [PATCH 1/8] mm/hugetlb: add pre-zeroed framework 李喆
2025-12-26  9:24   ` Raghavendra K T
2025-12-26  9:48     ` Li Zhe
2025-12-25  8:20 ` [PATCH 2/8] mm/hugetlb: convert to prep_account_new_hugetlb_folio() 李喆
2025-12-25  8:20 ` [PATCH 3/8] mm/hugetlb: move the huge folio to the end of the list during enqueue 李喆
2025-12-25  8:20 ` [PATCH 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages" 李喆
2025-12-26 18:51   ` Frank van der Linden
2025-12-29 12:25     ` Li Zhe
2025-12-29 18:57       ` Frank van der Linden
2025-12-30  2:41         ` Li Zhe
2025-12-25  8:20 ` [PATCH 5/8] mm/hugetlb: simplify function hugetlb_sysfs_add_hstate() 李喆
2025-12-25  8:20 ` [PATCH 6/8] mm/hugetlb: relocate the per-hstate struct kobject pointer 李喆
2025-12-25  8:20 ` [PATCH 7/8] mm/hugetlb: add epoll support for interface "zeroable_hugepages" 李喆
2025-12-25  8:20 ` [PATCH 8/8] mm/hugetlb: limit event generation frequency of function do_zero_free_notify() 李喆
2025-12-26 18:32 ` [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism Frank van der Linden
2025-12-26 21:42   ` Frank van der Linden
2025-12-29 12:28     ` Li Zhe
2025-12-27  7:21 ` Mateusz Guzik
2025-12-29 12:31   ` Li Zhe
2025-12-28 21:44 ` Andrew Morton
2025-12-29 12:34   ` Li Zhe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251225082059.1632-1-lizhe.67@bytedance.com \
    --to=lizhe.67@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=fvdl@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox