[PATCH 0/8] Introduce a huge-page pre-zeroing mechanism

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism
@ 2025-12-25  8:20 李喆
  2025-12-25  8:20 ` [PATCH 1/8] mm/hugetlb: add pre-zeroed framework 李喆
                   ` (10 more replies)
  0 siblings, 11 replies; 22+ messages in thread
From: 李喆 @ 2025-12-25  8:20 UTC (permalink / raw)
  To: muchun.song, osalvador, david, akpm, fvdl
  Cc: linux-mm, linux-kernel, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

This patchset is based on this commit[1]("mm/hugetlb: optionally
pre-zero hugetlb pages").

Fresh hugetlb pages are zeroed out when they are faulted in,
just like with all other page types. This can take up a good
amount of time for larger page sizes (e.g. around 40
milliseconds for a 1G page on a recent AMD-based system).

This normally isn't a problem, since hugetlb pages are typically
mapped by the application for a long time, and the initial
delay when touching them isn't much of an issue.

However, there are some use cases where a large number of hugetlb
pages are touched when an application (such as a VM backed by these
pages) starts. For 256 1G pages and 40ms per page, this would take
10 seconds, a noticeable delay.

To accelerate the above scenario, this patchset exports a per-node,
read-write zeroable_hugepages interface for every hugepage size.
This interface reports how many hugepages on that node can currently
be pre-zeroed and allows user space to request that any integer number
in the range [0, max] be zeroed in a single operation.

This mechanism offers the following advantages:

(1) User space gains full control over when zeroing is triggered,
enabling it to minimize the impact on both CPU and cache utilization.

(2) Applications can spawn as many zeroing processes as they need,
enabling concurrent background zeroing.

(3) By binding the process to specific CPUs, users can confine zeroing
threads to cores that do not run latency-critical tasks, eliminating
interference.

(4) A zeroing process can be interrupted at any time through standard
signal mechanisms, allowing immediate cancellation.

(5) The CPU consumption incurred by zeroing can be throttled and contained
with cgroups, ensuring that the cost is not borne system-wide.

On an AMD Milan platform, each 1 GB huge-page fault is shortened by at
least 25628 us (figure inherited from the test results cited herein[1]).

In user space, we can use system calls such as epoll and write to zero
huge pages as they become available, and sleep when none are ready. The
following pseudocode illustrates this approach. The pseudocode spawns
eight threads that wait for huge pages on node 0 to become eligible for
zeroing; whenever such pages are available, the threads clear them in
parallel.

  static void thread_fun(void)
  {
  	epoll_create();
  	epoll_ctl();
  	while (1) {
  		val = read("/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
  		if (val > 0)
  			system("echo max > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
  		epoll_wait();
  	}
  }

  static void start_pre_zero_thread(int thread_num)
  {
  	create_pre_zero_threads(thread_num, thread_fun)
  }

  int main(void)
  {
  	start_pre_zero_thread(8);
  }

[1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t

Li Zhe (8):
  mm/hugetlb: add pre-zeroed framework
  mm/hugetlb: convert to prep_account_new_hugetlb_folio()
  mm/hugetlb: move the huge folio to the end of the list during enqueue
  mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"
  mm/hugetlb: simplify function hugetlb_sysfs_add_hstate()
  mm/hugetlb: relocate the per-hstate struct kobject pointer
  mm/hugetlb: add epoll support for interface "zeroable_hugepages"
  mm/hugetlb: limit event generation frequency of function
    do_zero_free_notify()

 fs/hugetlbfs/inode.c    |   3 +-
 include/linux/hugetlb.h |  26 ++++++
 mm/hugetlb.c            | 133 +++++++++++++++++++++++---
 mm/hugetlb_internal.h   |   6 ++
 mm/hugetlb_sysfs.c      | 202 ++++++++++++++++++++++++++++++++++++----
 5 files changed, 335 insertions(+), 35 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/8] mm/hugetlb: add pre-zeroed framework
  2025-12-25  8:20 [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism 李喆
@ 2025-12-25  8:20 ` 李喆
  2025-12-26  9:24   ` Raghavendra K T
  2025-12-25  8:20 ` [PATCH 2/8] mm/hugetlb: convert to prep_account_new_hugetlb_folio() 李喆
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 22+ messages in thread
From: 李喆 @ 2025-12-25  8:20 UTC (permalink / raw)
  To: muchun.song, osalvador, david, akpm, fvdl
  Cc: linux-mm, linux-kernel, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

This patch establishes a pre-zeroing framework by introducing two new
hugetlb page flags and extends the code at every point where these flags
may later be required. The roles of the two flags are as follows.

(1) HPG_zeroed – indicates that the huge folio has already been
    zeroed
(2) HPG_zeroing – marks that the huge folio is currently being zeroed

No functional change, as nothing sets the flags yet.

Co-developed-by: Frank van der Linden <fvdl@google.com>
Signed-off-by: Frank van der Linden <fvdl@google.com>
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 fs/hugetlbfs/inode.c    |   3 +-
 include/linux/hugetlb.h |  26 +++++++++
 mm/hugetlb.c            | 113 +++++++++++++++++++++++++++++++++++++---
 3 files changed, 133 insertions(+), 9 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3b4c152c5c73..be6b32ab3ca8 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -828,8 +828,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 			error = PTR_ERR(folio);
 			goto out;
 		}
-		folio_zero_user(folio, addr);
-		__folio_mark_uptodate(folio);
+		hugetlb_zero_folio(folio, addr);
 		error = hugetlb_add_to_page_cache(folio, mapping, index);
 		if (unlikely(error)) {
 			restore_reserve_on_error(h, &pseudo_vma, addr, folio);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 019a1c5281e4..2daf4422a17d 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -584,6 +584,17 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
  * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed.
  * HPG_raw_hwp_unreliable - Set when the hugetlb page has a hwpoison sub-page
  *     that is not tracked by raw_hwp_page list.
+ * HPG_zeroed - page was pre-zeroed.
+ *	Synchronization: hugetlb_lock held when set by pre-zero thread.
+ *	Only valid to read outside hugetlb_lock once the page is off
+ *	the freelist, and HPG_zeroing is clear. Always cleared when a
+ *	page is put (back) on the freelist.
+ * HPG_zeroing - page is being zeroed by the pre-zero thread.
+ *	Synchronization: set and cleared by the pre-zero thread with
+ *	hugetlb_lock held. Access by others is read-only. Once the page
+ *	is off the freelist, this can only change from set -> clear,
+ *	which the new page owner must wait for. Always cleared
+ *	when a page is put (back) on the freelist.
  */
 enum hugetlb_page_flags {
 	HPG_restore_reserve = 0,
@@ -593,6 +604,8 @@ enum hugetlb_page_flags {
 	HPG_vmemmap_optimized,
 	HPG_raw_hwp_unreliable,
 	HPG_cma,
+	HPG_zeroed,
+	HPG_zeroing,
 	__NR_HPAGEFLAGS,
 };
 
@@ -653,6 +666,8 @@ HPAGEFLAG(Freed, freed)
 HPAGEFLAG(VmemmapOptimized, vmemmap_optimized)
 HPAGEFLAG(RawHwpUnreliable, raw_hwp_unreliable)
 HPAGEFLAG(Cma, cma)
+HPAGEFLAG(Zeroed, zeroed)
+HPAGEFLAG(Zeroing, zeroing)
 
 #ifdef CONFIG_HUGETLB_PAGE
 
@@ -678,6 +693,12 @@ struct hstate {
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+
+	unsigned int free_huge_pages_zero_node[MAX_NUMNODES];
+
+	/* Queue to wait for a hugetlb folio that is being prezeroed */
+	wait_queue_head_t dqzero_wait[MAX_NUMNODES];
+
 	char name[HSTATE_NAME_LEN];
 };
 
@@ -711,6 +732,7 @@ int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping
 			pgoff_t idx);
 void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
 				unsigned long address, struct folio *folio);
+void hugetlb_zero_folio(struct folio *folio, unsigned long address);
 
 /* arch callback */
 int __init __alloc_bootmem_huge_page(struct hstate *h, int nid);
@@ -1303,6 +1325,10 @@ static inline bool hugetlb_bootmem_allocated(void)
 {
 	return false;
 }
+
+static inline void hugetlb_zero_folio(struct folio *folio, unsigned long address)
+{
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 51273baec9e5..d20614b1c927 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -93,6 +93,8 @@ static int hugetlb_param_index __initdata;
 static __init int hugetlb_add_param(char *s, int (*setup)(char *val));
 static __init void hugetlb_parse_params(void);
 
+static void hpage_wait_zeroing(struct hstate *h, struct folio *folio);
+
 #define hugetlb_early_param(str, func) \
 static __init int func##args(char *s) \
 { \
@@ -1292,21 +1294,33 @@ void clear_vma_resv_huge_pages(struct vm_area_struct *vma)
 	hugetlb_dup_vma_private(vma);
 }
 
+/*
+ * Clear flags for either a fresh page or one that is being
+ * added to the free list.
+ */
+static inline void prep_clear_zeroed(struct folio *folio)
+{
+	folio_clear_hugetlb_zeroed(folio);
+	folio_clear_hugetlb_zeroing(folio);
+}
+
 static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio)
 {
 	int nid = folio_nid(folio);
 
 	lockdep_assert_held(&hugetlb_lock);
 	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
+	VM_WARN_ON_FOLIO(folio_test_hugetlb_zeroing(folio), folio);
 
 	list_move(&folio->lru, &h->hugepage_freelists[nid]);
 	h->free_huge_pages++;
 	h->free_huge_pages_node[nid]++;
+	prep_clear_zeroed(folio);
 	folio_set_hugetlb_freed(folio);
 }
 
-static struct folio *dequeue_hugetlb_folio_node_exact(struct hstate *h,
-								int nid)
+static struct folio *dequeue_hugetlb_folio_node_exact(struct hstate *h, int nid,
+		gfp_t gfp_mask)
 {
 	struct folio *folio;
 	bool pin = !!(current->flags & PF_MEMALLOC_PIN);
@@ -1316,6 +1330,16 @@ static struct folio *dequeue_hugetlb_folio_node_exact(struct hstate *h,
 		if (pin && !folio_is_longterm_pinnable(folio))
 			continue;
 
+		/*
+		 * This shouldn't happen, as hugetlb pages are never allocated
+		 * with GFP_ATOMIC. But be paranoid and check for it, as
+		 * a zero_busy page might cause a sleep later in
+		 * hpage_wait_zeroing().
+		 */
+		if (WARN_ON_ONCE(folio_test_hugetlb_zeroing(folio) &&
+					!gfpflags_allow_blocking(gfp_mask)))
+			continue;
+
 		if (folio_test_hwpoison(folio))
 			continue;
 
@@ -1327,6 +1351,10 @@ static struct folio *dequeue_hugetlb_folio_node_exact(struct hstate *h,
 		folio_clear_hugetlb_freed(folio);
 		h->free_huge_pages--;
 		h->free_huge_pages_node[nid]--;
+		if (folio_test_hugetlb_zeroed(folio) ||
+		    folio_test_hugetlb_zeroing(folio))
+			h->free_huge_pages_zero_node[nid]--;
+
 		return folio;
 	}
 
@@ -1363,7 +1391,7 @@ static struct folio *dequeue_hugetlb_folio_nodemask(struct hstate *h, gfp_t gfp_
 			continue;
 		node = zone_to_nid(zone);
 
-		folio = dequeue_hugetlb_folio_node_exact(h, node);
+		folio = dequeue_hugetlb_folio_node_exact(h, node, gfp_mask);
 		if (folio)
 			return folio;
 	}
@@ -1490,7 +1518,16 @@ void remove_hugetlb_folio(struct hstate *h, struct folio *folio,
 		folio_clear_hugetlb_freed(folio);
 		h->free_huge_pages--;
 		h->free_huge_pages_node[nid]--;
+		folio_clear_hugetlb_freed(folio);
 	}
+	/*
+	 * Adjust the zero page counters now. Note that
+	 * if a page is currently being zeroed, that
+	 * will be waited for in update_and_free_page()
+	 */
+	if (folio_test_hugetlb_zeroed(folio) ||
+	    folio_test_hugetlb_zeroing(folio))
+		h->free_huge_pages_zero_node[nid]--;
 	if (adjust_surplus) {
 		h->surplus_huge_pages--;
 		h->surplus_huge_pages_node[nid]--;
@@ -1543,6 +1580,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
 {
 	bool clear_flag = folio_test_hugetlb_vmemmap_optimized(folio);
 
+	VM_WARN_ON_FOLIO(folio_test_hugetlb_zeroing(folio), folio);
+
 	if (hstate_is_gigantic_no_runtime(h))
 		return;
 
@@ -1627,6 +1666,7 @@ static void free_hpage_workfn(struct work_struct *work)
 		 */
 		h = size_to_hstate(folio_size(folio));
 
+		hpage_wait_zeroing(h, folio);
 		__update_and_free_hugetlb_folio(h, folio);
 
 		cond_resched();
@@ -1643,7 +1683,8 @@ static inline void flush_free_hpage_work(struct hstate *h)
 static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
 				 bool atomic)
 {
-	if (!folio_test_hugetlb_vmemmap_optimized(folio) || !atomic) {
+	if ((!folio_test_hugetlb_zeroing(folio) &&
+	     !folio_test_hugetlb_vmemmap_optimized(folio)) || !atomic) {
 		__update_and_free_hugetlb_folio(h, folio);
 		return;
 	}
@@ -1840,6 +1881,13 @@ static void account_new_hugetlb_folio(struct hstate *h, struct folio *folio)
 	h->nr_huge_pages_node[folio_nid(folio)]++;
 }
 
+static void prep_new_hugetlb_folio(struct folio *folio)
+{
+	lockdep_assert_held(&hugetlb_lock);
+	folio_clear_hugetlb_freed(folio);
+	prep_clear_zeroed(folio);
+}
+
 void init_new_hugetlb_folio(struct folio *folio)
 {
 	__folio_set_hugetlb(folio);
@@ -1964,6 +2012,7 @@ void prep_and_add_allocated_folios(struct hstate *h,
 	/* Add all new pool pages to free lists in one lock cycle */
 	spin_lock_irqsave(&hugetlb_lock, flags);
 	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
+		prep_new_hugetlb_folio(folio);
 		account_new_hugetlb_folio(h, folio);
 		enqueue_hugetlb_folio(h, folio);
 	}
@@ -2171,6 +2220,7 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
 		return NULL;
 
 	spin_lock_irq(&hugetlb_lock);
+	prep_new_hugetlb_folio(folio);
 	/*
 	 * nr_huge_pages needs to be adjusted within the same lock cycle
 	 * as surplus_pages, otherwise it might confuse
@@ -2214,6 +2264,7 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas
 		return NULL;
 
 	spin_lock_irq(&hugetlb_lock);
+	prep_new_hugetlb_folio(folio);
 	account_new_hugetlb_folio(h, folio);
 	spin_unlock_irq(&hugetlb_lock);
 
@@ -2289,6 +2340,13 @@ struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
 						preferred_nid, nmask);
 		if (folio) {
 			spin_unlock_irq(&hugetlb_lock);
+			/*
+			 * The contents of this page will be completely
+			 * overwritten immediately, as its a migration
+			 * target, so no clearing is needed. Do wait in
+			 * case pre-zero thread was working on it, though.
+			 */
+			hpage_wait_zeroing(h, folio);
 			return folio;
 		}
 	}
@@ -2779,6 +2837,7 @@ static int alloc_and_dissolve_hugetlb_folio(struct folio *old_folio,
 		 */
 		remove_hugetlb_folio(h, old_folio, false);
 
+		prep_new_hugetlb_folio(new_folio);
 		/*
 		 * Ref count on new_folio is already zero as it was dropped
 		 * earlier.  It can be directly added to the pool free list.
@@ -2999,6 +3058,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	spin_unlock_irq(&hugetlb_lock);
 
+	hpage_wait_zeroing(h, folio);
+
 	hugetlb_set_folio_subpool(folio, spool);
 
 	if (map_chg != MAP_CHG_ENFORCED) {
@@ -3257,6 +3318,7 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
 		hugetlb_bootmem_init_migratetype(folio, h);
 		/* Subdivide locks to achieve better parallel performance */
 		spin_lock_irqsave(&hugetlb_lock, flags);
+		prep_new_hugetlb_folio(folio);
 		account_new_hugetlb_folio(h, folio);
 		enqueue_hugetlb_folio(h, folio);
 		spin_unlock_irqrestore(&hugetlb_lock, flags);
@@ -4190,6 +4252,42 @@ bool __init __attribute((weak)) arch_hugetlb_valid_size(unsigned long size)
 	return size == HPAGE_SIZE;
 }
 
+/*
+ * Zero a hugetlb page.
+ *
+ * The caller has already made sure that the page is not
+ * being actively zeroed out in the background.
+ *
+ * If it wasn't zeroed out, do it ourselves.
+ */
+void hugetlb_zero_folio(struct folio *folio, unsigned long address)
+{
+	if (!folio_test_hugetlb_zeroed(folio))
+		folio_zero_user(folio, address);
+
+	__folio_mark_uptodate(folio);
+}
+
+/*
+ * Once a page has been taken off the freelist, the new page owner
+ * must wait for the pre-zero thread to finish if it happens
+ * to be working on this page (which should be rare).
+ */
+static void hpage_wait_zeroing(struct hstate *h, struct folio *folio)
+{
+	if (!folio_test_hugetlb_zeroing(folio))
+		return;
+
+	spin_lock_irq(&hugetlb_lock);
+
+	wait_event_cmd(h->dqzero_wait[folio_nid(folio)],
+		       !folio_test_hugetlb_zeroing(folio),
+		       spin_unlock_irq(&hugetlb_lock),
+		       spin_lock_irq(&hugetlb_lock));
+
+	spin_unlock_irq(&hugetlb_lock);
+}
+
 void __init hugetlb_add_hstate(unsigned int order)
 {
 	struct hstate *h;
@@ -4205,8 +4303,10 @@ void __init hugetlb_add_hstate(unsigned int order)
 	__mutex_init(&h->resize_lock, "resize mutex", &h->resize_key);
 	h->order = order;
 	h->mask = ~(huge_page_size(h) - 1);
-	for (i = 0; i < MAX_NUMNODES; ++i)
+	for (i = 0; i < MAX_NUMNODES; ++i) {
 		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
+		init_waitqueue_head(&h->dqzero_wait[i]);
+	}
 	INIT_LIST_HEAD(&h->hugepage_activelist);
 	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
 					huge_page_size(h)/SZ_1K);
@@ -5804,8 +5904,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 				ret = 0;
 			goto out;
 		}
-		folio_zero_user(folio, vmf->real_address);
-		__folio_mark_uptodate(folio);
+		hugetlb_zero_folio(folio, vmf->address);
 		new_folio = true;
 
 		if (vma->vm_flags & VM_MAYSHARE) {
-- 
2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/8] mm/hugetlb: convert to prep_account_new_hugetlb_folio()
  2025-12-25  8:20 [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism 李喆
  2025-12-25  8:20 ` [PATCH 1/8] mm/hugetlb: add pre-zeroed framework 李喆
@ 2025-12-25  8:20 ` 李喆
  2025-12-25  8:20 ` [PATCH 3/8] mm/hugetlb: move the huge folio to the end of the list during enqueue 李喆
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: 李喆 @ 2025-12-25  8:20 UTC (permalink / raw)
  To: muchun.song, osalvador, david, akpm, fvdl
  Cc: linux-mm, linux-kernel, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

After a huge folio is instantiated, it is always initialized through
the successive calls to prep_new_hugetlb_folio() and
account_new_hugetlb_folio(). To eliminate the risk that future changes
update one routine but overlook the other, the two functions have been
consolidated into a single entry point prep_account_new_hugetlb_folio().

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 mm/hugetlb.c | 29 ++++++++++-------------------
 1 file changed, 10 insertions(+), 19 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d20614b1c927..63f9369789b5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1874,18 +1874,14 @@ void free_huge_folio(struct folio *folio)
 /*
  * Must be called with the hugetlb lock held
  */
-static void account_new_hugetlb_folio(struct hstate *h, struct folio *folio)
-{
-	lockdep_assert_held(&hugetlb_lock);
-	h->nr_huge_pages++;
-	h->nr_huge_pages_node[folio_nid(folio)]++;
-}
-
-static void prep_new_hugetlb_folio(struct folio *folio)
+static void prep_account_new_hugetlb_folio(struct hstate *h,
+					   struct folio *folio)
 {
 	lockdep_assert_held(&hugetlb_lock);
 	folio_clear_hugetlb_freed(folio);
 	prep_clear_zeroed(folio);
+	h->nr_huge_pages++;
+	h->nr_huge_pages_node[folio_nid(folio)]++;
 }
 
 void init_new_hugetlb_folio(struct folio *folio)
@@ -2012,8 +2008,7 @@ void prep_and_add_allocated_folios(struct hstate *h,
 	/* Add all new pool pages to free lists in one lock cycle */
 	spin_lock_irqsave(&hugetlb_lock, flags);
 	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
-		prep_new_hugetlb_folio(folio);
-		account_new_hugetlb_folio(h, folio);
+		prep_account_new_hugetlb_folio(h, folio);
 		enqueue_hugetlb_folio(h, folio);
 	}
 	spin_unlock_irqrestore(&hugetlb_lock, flags);
@@ -2220,13 +2215,12 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
 		return NULL;
 
 	spin_lock_irq(&hugetlb_lock);
-	prep_new_hugetlb_folio(folio);
 	/*
 	 * nr_huge_pages needs to be adjusted within the same lock cycle
 	 * as surplus_pages, otherwise it might confuse
 	 * persistent_huge_pages() momentarily.
 	 */
-	account_new_hugetlb_folio(h, folio);
+	prep_account_new_hugetlb_folio(h, folio);
 
 	/*
 	 * We could have raced with the pool size change.
@@ -2264,8 +2258,7 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas
 		return NULL;
 
 	spin_lock_irq(&hugetlb_lock);
-	prep_new_hugetlb_folio(folio);
-	account_new_hugetlb_folio(h, folio);
+	prep_account_new_hugetlb_folio(h, folio);
 	spin_unlock_irq(&hugetlb_lock);
 
 	/* fresh huge pages are frozen */
@@ -2831,18 +2824,17 @@ static int alloc_and_dissolve_hugetlb_folio(struct folio *old_folio,
 		/*
 		 * Ok, old_folio is still a genuine free hugepage. Remove it from
 		 * the freelist and decrease the counters. These will be
-		 * incremented again when calling account_new_hugetlb_folio()
+		 * incremented again when calling prep_account_new_hugetlb_folio()
 		 * and enqueue_hugetlb_folio() for new_folio. The counters will
 		 * remain stable since this happens under the lock.
 		 */
 		remove_hugetlb_folio(h, old_folio, false);
 
-		prep_new_hugetlb_folio(new_folio);
 		/*
 		 * Ref count on new_folio is already zero as it was dropped
 		 * earlier.  It can be directly added to the pool free list.
 		 */
-		account_new_hugetlb_folio(h, new_folio);
+		prep_account_new_hugetlb_folio(h, new_folio);
 		enqueue_hugetlb_folio(h, new_folio);
 
 		/*
@@ -3318,8 +3310,7 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
 		hugetlb_bootmem_init_migratetype(folio, h);
 		/* Subdivide locks to achieve better parallel performance */
 		spin_lock_irqsave(&hugetlb_lock, flags);
-		prep_new_hugetlb_folio(folio);
-		account_new_hugetlb_folio(h, folio);
+		prep_account_new_hugetlb_folio(h, folio);
 		enqueue_hugetlb_folio(h, folio);
 		spin_unlock_irqrestore(&hugetlb_lock, flags);
 	}
-- 
2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 3/8] mm/hugetlb: move the huge folio to the end of the list during enqueue
  2025-12-25  8:20 [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism 李喆
  2025-12-25  8:20 ` [PATCH 1/8] mm/hugetlb: add pre-zeroed framework 李喆
  2025-12-25  8:20 ` [PATCH 2/8] mm/hugetlb: convert to prep_account_new_hugetlb_folio() 李喆
@ 2025-12-25  8:20 ` 李喆
  2025-12-25  8:20 ` [PATCH 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages" 李喆
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: 李喆 @ 2025-12-25  8:20 UTC (permalink / raw)
  To: muchun.song, osalvador, david, akpm, fvdl
  Cc: linux-mm, linux-kernel, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

For the huge-folio free list, unzeroed huge folios are now inserted at
the tail; a follow-on patch will place pre-zeroed ones at the head, so
that allocations can obtain a pre-zeroed huge folio with minimal
search. Also, placing newly zeroed pages at the head of the queue so
they're chosen first in the next allocation helps keep the cache hot.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 mm/hugetlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 63f9369789b5..8d36487659f8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1312,7 +1312,7 @@ static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio)
 	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
 	VM_WARN_ON_FOLIO(folio_test_hugetlb_zeroing(folio), folio);
 
-	list_move(&folio->lru, &h->hugepage_freelists[nid]);
+	list_move_tail(&folio->lru, &h->hugepage_freelists[nid]);
 	h->free_huge_pages++;
 	h->free_huge_pages_node[nid]++;
 	prep_clear_zeroed(folio);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"
  2025-12-25  8:20 [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism 李喆
                   ` (2 preceding siblings ...)
  2025-12-25  8:20 ` [PATCH 3/8] mm/hugetlb: move the huge folio to the end of the list during enqueue 李喆
@ 2025-12-25  8:20 ` 李喆
  2025-12-26 18:51   ` Frank van der Linden
  2025-12-25  8:20 ` [PATCH 5/8] mm/hugetlb: simplify function hugetlb_sysfs_add_hstate() 李喆
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 22+ messages in thread
From: 李喆 @ 2025-12-25  8:20 UTC (permalink / raw)
  To: muchun.song, osalvador, david, akpm, fvdl
  Cc: linux-mm, linux-kernel, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

Fresh hugetlb pages are zeroed out when they are faulted in,
just like with all other page types. This can take up a good
amount of time for larger page sizes (e.g. around 40 milliseconds
for a 1G page on a recent AMD-based system).

This normally isn't a problem, since hugetlb pages are typically
mapped by the application for a long time, and the initial delay
when touching them isn't much of an issue.

However, there are some use cases where a large number of hugetlb
pages are touched when an application (such as a VM backed by
these pages) starts. For 256 1G pages and 40ms per page, this would
take 10 seconds, a noticeable delay.

This patch adds a new zeroable_hugepages interface under each
/sys/devices/system/node/node*/hugepages/hugepages-***kB directory.
Reading it returns the number of huge folios of the corresponding size
on that node that are eligible for pre-zeroing. The interface also
accepts an integer x in the range [0, max], enabling user space to
request that x huge pages be zeroed on demand.

Exporting this interface offers the following advantages:

(1) User space gains full control over when zeroing is triggered,
enabling it to minimize the impact on both CPU and cache utilization.

(2) Applications can spawn as many zeroing processes as they need,
enabling concurrent background zeroing.

(3) By binding the process to specific CPUs, users can confine zeroing
threads to cores that do not run latency-critical tasks, eliminating
interference.

(4) A zeroing process can be interrupted at any time through standard
signal mechanisms, allowing immediate cancellation.

(5) The CPU consumption incurred by zeroing can be throttled and contained
with cgroups, ensuring that the cost is not borne system-wide.

On an AMD Milan platform, each 1 GB huge-page fault is shortened by at
least 25628 us (figure inherited from the test results cited herein[1]).

[1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t

Co-developed-by: Frank van der Linden <fvdl@google.com>
Signed-off-by: Frank van der Linden <fvdl@google.com>
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 mm/hugetlb_sysfs.c | 120 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 120 insertions(+)

diff --git a/mm/hugetlb_sysfs.c b/mm/hugetlb_sysfs.c
index 79ece91406bf..8c3e433209c3 100644
--- a/mm/hugetlb_sysfs.c
+++ b/mm/hugetlb_sysfs.c
@@ -352,6 +352,125 @@ struct node_hstate {
 };
 static struct node_hstate node_hstates[MAX_NUMNODES];
 
+static ssize_t zeroable_hugepages_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct hstate *h;
+	unsigned long free_huge_pages_zero;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (WARN_ON(nid == NUMA_NO_NODE))
+		return -EPERM;
+
+	free_huge_pages_zero = h->free_huge_pages_node[nid] -
+			       h->free_huge_pages_zero_node[nid];
+
+	return sprintf(buf, "%lu\n", free_huge_pages_zero);
+}
+
+static inline bool zero_should_abort(struct hstate *h, int nid)
+{
+	return (h->free_huge_pages_zero_node[nid] ==
+		h->free_huge_pages_node[nid]) ||
+		list_empty(&h->hugepage_freelists[nid]);
+}
+
+static void zero_free_hugepages_nid(struct hstate *h,
+				   int nid, unsigned int nr_zero)
+{
+	struct list_head *freelist = &h->hugepage_freelists[nid];
+	unsigned int nr_zerod = 0;
+	struct folio *folio;
+
+	if (zero_should_abort(h, nid))
+		return;
+
+	spin_lock_irq(&hugetlb_lock);
+
+	while (nr_zerod < nr_zero) {
+
+		if (zero_should_abort(h, nid) || fatal_signal_pending(current))
+			break;
+
+		freelist = freelist->prev;
+		if (unlikely(list_is_head(freelist, &h->hugepage_freelists[nid])))
+			break;
+		folio = list_entry(freelist, struct folio, lru);
+
+		if (folio_test_hugetlb_zeroed(folio) ||
+		    folio_test_hugetlb_zeroing(folio))
+			continue;
+
+		folio_set_hugetlb_zeroing(folio);
+
+		/*
+		 * Incrementing this here is a bit of a fib, since
+		 * the page hasn't been cleared yet (it will be done
+		 * immediately after dropping the lock below). But
+		 * it keeps the count consistent with the overall
+		 * free count in case the page gets taken off the
+		 * freelist while we're working on it.
+		 */
+		h->free_huge_pages_zero_node[nid]++;
+		spin_unlock_irq(&hugetlb_lock);
+
+		/*
+		 * HWPoison pages may show up on the freelist.
+		 * Don't try to zero it out, but do set the flag
+		 * and counts, so that we don't consider it again.
+		 */
+		if (!folio_test_hwpoison(folio))
+			folio_zero_user(folio, 0);
+
+		cond_resched();
+
+		spin_lock_irq(&hugetlb_lock);
+		folio_set_hugetlb_zeroed(folio);
+		folio_clear_hugetlb_zeroing(folio);
+
+		/*
+		 * If the page is still on the free list, move
+		 * it to the head.
+		 */
+		if (folio_test_hugetlb_freed(folio))
+			list_move(&folio->lru, &h->hugepage_freelists[nid]);
+
+		/*
+		 * If someone was waiting for the zero to
+		 * finish, wake them up.
+		 */
+		if (waitqueue_active(&h->dqzero_wait[nid]))
+			wake_up(&h->dqzero_wait[nid]);
+		nr_zerod++;
+		freelist = &h->hugepage_freelists[nid];
+	}
+	spin_unlock_irq(&hugetlb_lock);
+}
+
+static ssize_t zeroable_hugepages_store(struct kobject *kobj,
+	       struct kobj_attribute *attr, const char *buf, size_t len)
+{
+	unsigned int nr_zero;
+	struct hstate *h;
+	int err;
+	int nid;
+
+	if (!strcmp(buf, "max") || !strcmp(buf, "max\n")) {
+		nr_zero = UINT_MAX;
+	} else {
+		err = kstrtouint(buf, 10, &nr_zero);
+		if (err)
+			return err;
+	}
+	h = kobj_to_hstate(kobj, &nid);
+
+	zero_free_hugepages_nid(h, nid, nr_zero);
+
+	return len;
+}
+HSTATE_ATTR(zeroable_hugepages);
+
 /*
  * A subset of global hstate attributes for node devices
  */
@@ -359,6 +478,7 @@ static struct attribute *per_node_hstate_attrs[] = {
 	&nr_hugepages_attr.attr,
 	&free_hugepages_attr.attr,
 	&surplus_hugepages_attr.attr,
+	&zeroable_hugepages_attr.attr,
 	NULL,
 };
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 5/8] mm/hugetlb: simplify function hugetlb_sysfs_add_hstate()
  2025-12-25  8:20 [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism 李喆
                   ` (3 preceding siblings ...)
  2025-12-25  8:20 ` [PATCH 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages" 李喆
@ 2025-12-25  8:20 ` 李喆
  2025-12-25  8:20 ` [PATCH 6/8] mm/hugetlb: relocate the per-hstate struct kobject pointer 李喆
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: 李喆 @ 2025-12-25  8:20 UTC (permalink / raw)
  To: muchun.song, osalvador, david, akpm, fvdl
  Cc: linux-mm, linux-kernel, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

The third parameter of hugetlb_sysfs_add_hstate() is currently an array
of struct kobject *, yet the function only ever uses a single element.
This patch narrows the argument to a pointer to that specific member,
eliminating the unused array.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 mm/hugetlb_sysfs.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/mm/hugetlb_sysfs.c b/mm/hugetlb_sysfs.c
index 8c3e433209c3..87dcd3038abc 100644
--- a/mm/hugetlb_sysfs.c
+++ b/mm/hugetlb_sysfs.c
@@ -304,31 +304,30 @@ static const struct attribute_group hstate_demote_attr_group = {
 };
 
 static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
-				    struct kobject **hstate_kobjs,
+				    struct kobject **hstate_kobj,
 				    const struct attribute_group *hstate_attr_group)
 {
 	int retval;
-	int hi = hstate_index(h);
 
-	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
-	if (!hstate_kobjs[hi])
+	*hstate_kobj = kobject_create_and_add(h->name, parent);
+	if (!*hstate_kobj)
 		return -ENOMEM;
 
-	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
+	retval = sysfs_create_group(*hstate_kobj, hstate_attr_group);
 	if (retval) {
-		kobject_put(hstate_kobjs[hi]);
-		hstate_kobjs[hi] = NULL;
+		kobject_put(*hstate_kobj);
+		*hstate_kobj = NULL;
 		return retval;
 	}
 
 	if (h->demote_order) {
-		retval = sysfs_create_group(hstate_kobjs[hi],
+		retval = sysfs_create_group(*hstate_kobj,
 					    &hstate_demote_attr_group);
 		if (retval) {
 			pr_warn("HugeTLB unable to create demote interfaces for %s\n", h->name);
-			sysfs_remove_group(hstate_kobjs[hi], hstate_attr_group);
-			kobject_put(hstate_kobjs[hi]);
-			hstate_kobjs[hi] = NULL;
+			sysfs_remove_group(*hstate_kobj, hstate_attr_group);
+			kobject_put(*hstate_kobj);
+			*hstate_kobj = NULL;
 			return retval;
 		}
 	}
@@ -562,8 +561,8 @@ void hugetlb_register_node(struct node *node)
 
 	for_each_hstate(h) {
 		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
-						nhs->hstate_kobjs,
-						&per_node_hstate_attr_group);
+				&nhs->hstate_kobjs[hstate_index(h)],
+				&per_node_hstate_attr_group);
 		if (err) {
 			pr_err("HugeTLB: Unable to add hstate %s for node %d\n",
 				h->name, node->dev.id);
@@ -610,7 +609,7 @@ void __init hugetlb_sysfs_init(void)
 
 	for_each_hstate(h) {
 		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
-					 hstate_kobjs, &hstate_attr_group);
+			&hstate_kobjs[hstate_index(h)], &hstate_attr_group);
 		if (err)
 			pr_err("HugeTLB: Unable to add hstate %s\n", h->name);
 	}
-- 
2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 6/8] mm/hugetlb: relocate the per-hstate struct kobject pointer
  2025-12-25  8:20 [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism 李喆
                   ` (4 preceding siblings ...)
  2025-12-25  8:20 ` [PATCH 5/8] mm/hugetlb: simplify function hugetlb_sysfs_add_hstate() 李喆
@ 2025-12-25  8:20 ` 李喆
  2025-12-25  8:20 ` [PATCH 7/8] mm/hugetlb: add epoll support for interface "zeroable_hugepages" 李喆
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: 李喆 @ 2025-12-25  8:20 UTC (permalink / raw)
  To: muchun.song, osalvador, david, akpm, fvdl
  Cc: linux-mm, linux-kernel, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

Relocate the per-hstate struct kobject pointer from struct node_hstate
into a standalone structure.

This change prepares for a future patch that adds epoll support to the
“zeroable_hugepages” interface. When a huge folio is freed we must emit
an event, yet the freeing context may be atomic; therefore the
notification will be delegated to a workqueue. Extracting the struct
kobject pointer allows the workqueue callback to obtain it effortlessly.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 mm/hugetlb_sysfs.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb_sysfs.c b/mm/hugetlb_sysfs.c
index 87dcd3038abc..08ad39d3e022 100644
--- a/mm/hugetlb_sysfs.c
+++ b/mm/hugetlb_sysfs.c
@@ -338,6 +338,10 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
 #ifdef CONFIG_NUMA
 static bool hugetlb_sysfs_initialized __ro_after_init;
 
+struct node_hstate_item {
+	struct kobject *hstate_kobj;
+};
+
 /*
  * node_hstate/s - associate per node hstate attributes, via their kobjects,
  * with node devices in node_devices[] using a parallel array.  The array
@@ -347,7 +351,7 @@ static bool hugetlb_sysfs_initialized __ro_after_init;
  */
 struct node_hstate {
 	struct kobject		*hugepages_kobj;
-	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
+	struct node_hstate_item items[HUGE_MAX_HSTATE];
 };
 static struct node_hstate node_hstates[MAX_NUMNODES];
 
@@ -497,7 +501,7 @@ static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
 		struct node_hstate *nhs = &node_hstates[nid];
 		int i;
 		for (i = 0; i < HUGE_MAX_HSTATE; i++)
-			if (nhs->hstate_kobjs[i] == kobj) {
+			if (nhs->items[i].hstate_kobj == kobj) {
 				if (nidp)
 					*nidp = nid;
 				return &hstates[i];
@@ -522,7 +526,7 @@ void hugetlb_unregister_node(struct node *node)
 
 	for_each_hstate(h) {
 		int idx = hstate_index(h);
-		struct kobject *hstate_kobj = nhs->hstate_kobjs[idx];
+		struct kobject *hstate_kobj = nhs->items[idx].hstate_kobj;
 
 		if (!hstate_kobj)
 			continue;
@@ -530,7 +534,7 @@ void hugetlb_unregister_node(struct node *node)
 			sysfs_remove_group(hstate_kobj, &hstate_demote_attr_group);
 		sysfs_remove_group(hstate_kobj, &per_node_hstate_attr_group);
 		kobject_put(hstate_kobj);
-		nhs->hstate_kobjs[idx] = NULL;
+		nhs->items[idx].hstate_kobj = NULL;
 	}
 
 	kobject_put(nhs->hugepages_kobj);
@@ -561,7 +565,7 @@ void hugetlb_register_node(struct node *node)
 
 	for_each_hstate(h) {
 		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
-				&nhs->hstate_kobjs[hstate_index(h)],
+				&nhs->items[hstate_index(h)].hstate_kobj,
 				&per_node_hstate_attr_group);
 		if (err) {
 			pr_err("HugeTLB: Unable to add hstate %s for node %d\n",
-- 
2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 7/8] mm/hugetlb: add epoll support for interface "zeroable_hugepages"
  2025-12-25  8:20 [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism 李喆
                   ` (5 preceding siblings ...)
  2025-12-25  8:20 ` [PATCH 6/8] mm/hugetlb: relocate the per-hstate struct kobject pointer 李喆
@ 2025-12-25  8:20 ` 李喆
  2025-12-25  8:20 ` [PATCH 8/8] mm/hugetlb: limit event generation frequency of function do_zero_free_notify() 李喆
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: 李喆 @ 2025-12-25  8:20 UTC (permalink / raw)
  To: muchun.song, osalvador, david, akpm, fvdl
  Cc: linux-mm, linux-kernel, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

Add epoll support for interface "zeroable_hugepages". When no huge folios
are available for pre-zeroing, user space can block on the
zeroable_hugepages file with epoll, and it will be woken as soon as one
or more huge folios become eligible for pre-zeroing.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 mm/hugetlb.c          | 13 +++++++++++++
 mm/hugetlb_internal.h |  6 ++++++
 mm/hugetlb_sysfs.c    | 22 +++++++++++++++++++++-
 3 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8d36487659f8..c2df0317fe15 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1868,6 +1868,7 @@ void free_huge_folio(struct folio *folio)
 		arch_clear_hugetlb_flags(folio);
 		enqueue_hugetlb_folio(h, folio);
 		spin_unlock_irqrestore(&hugetlb_lock, flags);
+		do_zero_free_notify(h, folio_nid(folio));
 	}
 }
 
@@ -1999,8 +2000,10 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
 void prep_and_add_allocated_folios(struct hstate *h,
 				   struct list_head *folio_list)
 {
+	nodemask_t allocated_mask = NODE_MASK_NONE;
 	unsigned long flags;
 	struct folio *folio, *tmp_f;
+	int nid;
 
 	/* Send list for bulk vmemmap optimization processing */
 	hugetlb_vmemmap_optimize_folios(h, folio_list);
@@ -2010,8 +2013,12 @@ void prep_and_add_allocated_folios(struct hstate *h,
 	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
 		prep_account_new_hugetlb_folio(h, folio);
 		enqueue_hugetlb_folio(h, folio);
+		node_set(folio_nid(folio), allocated_mask);
 	}
 	spin_unlock_irqrestore(&hugetlb_lock, flags);
+
+	for_each_node_mask(nid, allocated_mask)
+		do_zero_free_notify(h, nid);
 }
 
 /*
@@ -2383,6 +2390,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	long needed, allocated;
 	bool alloc_ok = true;
 	nodemask_t *mbind_nodemask, alloc_nodemask;
+	nodemask_t allocated_mask = NODE_MASK_NONE;
+	int nid;
 
 	mbind_nodemask = policy_mbind_nodemask(htlb_alloc_mask(h));
 	if (mbind_nodemask)
@@ -2455,9 +2464,12 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 			break;
 		/* Add the page to the hugetlb allocator */
 		enqueue_hugetlb_folio(h, folio);
+		node_set(folio_nid(folio), allocated_mask);
 	}
 free:
 	spin_unlock_irq(&hugetlb_lock);
+	for_each_node_mask(nid, allocated_mask)
+		do_zero_free_notify(h, nid);
 
 	/*
 	 * Free unnecessary surplus pages to the buddy allocator.
@@ -2841,6 +2853,7 @@ static int alloc_and_dissolve_hugetlb_folio(struct folio *old_folio,
 		 * Folio has been replaced, we can safely free the old one.
 		 */
 		spin_unlock_irq(&hugetlb_lock);
+		do_zero_free_notify(h, folio_nid(new_folio));
 		update_and_free_hugetlb_folio(h, old_folio, false);
 	}
 
diff --git a/mm/hugetlb_internal.h b/mm/hugetlb_internal.h
index 1d2f870deccf..9c60661283c7 100644
--- a/mm/hugetlb_internal.h
+++ b/mm/hugetlb_internal.h
@@ -106,6 +106,12 @@ extern ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
 					   struct hstate *h, int nid,
 					   unsigned long count, size_t len);
 
+#ifdef CONFIG_NUMA
+extern void do_zero_free_notify(struct hstate *h, int nid);
+#else
+static inline void do_zero_free_notify(struct hstate *h, int nid) {}
+#endif
+
 extern void hugetlb_sysfs_init(void) __init;
 
 #ifdef CONFIG_SYSCTL
diff --git a/mm/hugetlb_sysfs.c b/mm/hugetlb_sysfs.c
index 08ad39d3e022..c063237249f6 100644
--- a/mm/hugetlb_sysfs.c
+++ b/mm/hugetlb_sysfs.c
@@ -340,6 +340,7 @@ static bool hugetlb_sysfs_initialized __ro_after_init;
 
 struct node_hstate_item {
 	struct kobject *hstate_kobj;
+	struct work_struct notify_work;
 };
 
 /*
@@ -355,6 +356,21 @@ struct node_hstate {
 };
 static struct node_hstate node_hstates[MAX_NUMNODES];
 
+static void pre_zero_notify_fun(struct work_struct *work)
+{
+	struct node_hstate_item *item =
+		container_of(work, struct node_hstate_item, notify_work);
+
+	sysfs_notify(item->hstate_kobj, NULL, "zeroable_hugepages");
+}
+
+void do_zero_free_notify(struct hstate *h, int nid)
+{
+	struct node_hstate *nhs = &node_hstates[nid];
+
+	schedule_work(&nhs->items[hstate_index(h)].notify_work);
+}
+
 static ssize_t zeroable_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
@@ -564,8 +580,11 @@ void hugetlb_register_node(struct node *node)
 		return;
 
 	for_each_hstate(h) {
+		int index = hstate_index(h);
+		struct node_hstate_item *item = &nhs->items[index];
+
 		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
-				&nhs->items[hstate_index(h)].hstate_kobj,
+				&item->hstate_kobj,
 				&per_node_hstate_attr_group);
 		if (err) {
 			pr_err("HugeTLB: Unable to add hstate %s for node %d\n",
@@ -573,6 +592,7 @@ void hugetlb_register_node(struct node *node)
 			hugetlb_unregister_node(node);
 			break;
 		}
+		INIT_WORK(&item->notify_work, pre_zero_notify_fun);
 	}
 }
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 8/8] mm/hugetlb: limit event generation frequency of function do_zero_free_notify()
  2025-12-25  8:20 [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism 李喆
                   ` (6 preceding siblings ...)
  2025-12-25  8:20 ` [PATCH 7/8] mm/hugetlb: add epoll support for interface "zeroable_hugepages" 李喆
@ 2025-12-25  8:20 ` 李喆
  2025-12-26 18:32 ` [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism Frank van der Linden
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: 李喆 @ 2025-12-25  8:20 UTC (permalink / raw)
  To: muchun.song, osalvador, david, akpm, fvdl
  Cc: linux-mm, linux-kernel, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

Throttling notifications reduces the number of scheduling notify_work
making the mechanism far more efficient when huge numbers of huge
folios are freed in rapid succession.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 mm/hugetlb_sysfs.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb_sysfs.c b/mm/hugetlb_sysfs.c
index c063237249f6..dd47d48fe910 100644
--- a/mm/hugetlb_sysfs.c
+++ b/mm/hugetlb_sysfs.c
@@ -341,6 +341,8 @@ static bool hugetlb_sysfs_initialized __ro_after_init;
 struct node_hstate_item {
 	struct kobject *hstate_kobj;
 	struct work_struct notify_work;
+	unsigned long notified_at;
+	spinlock_t notify_lock;
 };
 
 /*
@@ -364,11 +366,30 @@ static void pre_zero_notify_fun(struct work_struct *work)
 	sysfs_notify(item->hstate_kobj, NULL, "zeroable_hugepages");
 }
 
+static void __do_zero_free_notify(struct node_hstate_item *item)
+{
+	unsigned long last;
+	unsigned long next;
+
+#define PRE_ZERO_NOTIFY_MIN_INTV       DIV_ROUND_UP(HZ, 100)
+	spin_lock(&item->notify_lock);
+	last = item->notified_at;
+	next = last + PRE_ZERO_NOTIFY_MIN_INTV;
+	if (time_in_range(jiffies, last, next)) {
+		spin_unlock(&item->notify_lock);
+		return;
+	}
+	item->notified_at = jiffies;
+	spin_unlock(&item->notify_lock);
+
+	schedule_work(&item->notify_work);
+}
+
 void do_zero_free_notify(struct hstate *h, int nid)
 {
 	struct node_hstate *nhs = &node_hstates[nid];
 
-	schedule_work(&nhs->items[hstate_index(h)].notify_work);
+	__do_zero_free_notify(&nhs->items[hstate_index(h)]);
 }
 
 static ssize_t zeroable_hugepages_show(struct kobject *kobj,
@@ -593,6 +614,8 @@ void hugetlb_register_node(struct node *node)
 			break;
 		}
 		INIT_WORK(&item->notify_work, pre_zero_notify_fun);
+		item->notified_at = jiffies;
+		spin_lock_init(&item->notify_lock);
 	}
 }
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/8] mm/hugetlb: add pre-zeroed framework
  2025-12-25  8:20 ` [PATCH 1/8] mm/hugetlb: add pre-zeroed framework 李喆
@ 2025-12-26  9:24   ` Raghavendra K T
  2025-12-26  9:48     ` Li Zhe
  0 siblings, 1 reply; 22+ messages in thread
From: Raghavendra K T @ 2025-12-26  9:24 UTC (permalink / raw)
  To: 李喆, muchun.song, osalvador, david, akpm, fvdl
  Cc: linux-mm, linux-kernel

On 12/25/2025 1:50 PM, æå wrote:
> From: Li Zhe <lizhe.67@bytedance.com>
> 
> This patch establishes a pre-zeroing framework by introducing two new
> hugetlb page flags and extends the code at every point where these flags
> may later be required. The roles of the two flags are as follows.
> 
> (1) HPG_zeroed – indicates that the huge folio has already been
>      zeroed
> (2) HPG_zeroing – marks that the huge folio is currently being zeroed
> 
> No functional change, as nothing sets the flags yet.
> 
> Co-developed-by: Frank van der Linden <fvdl@google.com>
> Signed-off-by: Frank van der Linden <fvdl@google.com>
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> ---
>   fs/hugetlbfs/inode.c    |   3 +-
>   include/linux/hugetlb.h |  26 +++++++++
>   mm/hugetlb.c            | 113 +++++++++++++++++++++++++++++++++++++---
>   3 files changed, 133 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 3b4c152c5c73..be6b32ab3ca8 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -828,8 +828,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>   			error = PTR_ERR(folio);
>   			goto out;
>   		}
> -		folio_zero_user(folio, addr);
> -		__folio_mark_uptodate(folio);
> +		hugetlb_zero_folio(folio, addr);
>   		error = hugetlb_add_to_page_cache(folio, mapping, index);
>   		if (unlikely(error)) {
>   			restore_reserve_on_error(h, &pseudo_vma, addr, folio);
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 019a1c5281e4..2daf4422a17d 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -584,6 +584,17 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>    * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed.
>    * HPG_raw_hwp_unreliable - Set when the hugetlb page has a hwpoison sub-page
>    *     that is not tracked by raw_hwp_page list.
> + * HPG_zeroed - page was pre-zeroed.
> + *	Synchronization: hugetlb_lock held when set by pre-zero thread.
> + *	Only valid to read outside hugetlb_lock once the page is off
> + *	the freelist, and HPG_zeroing is clear. Always cleared when a
> + *	page is put (back) on the freelist.
> + * HPG_zeroing - page is being zeroed by the pre-zero thread.
> + *	Synchronization: set and cleared by the pre-zero thread with
> + *	hugetlb_lock held. Access by others is read-only. Once the page
> + *	is off the freelist, this can only change from set -> clear,
> + *	which the new page owner must wait for. Always cleared
> + *	when a page is put (back) on the freelist.
>    */
>   enum hugetlb_page_flags {
>   	HPG_restore_reserve = 0,
> @@ -593,6 +604,8 @@ enum hugetlb_page_flags {
>   	HPG_vmemmap_optimized,
>   	HPG_raw_hwp_unreliable,
>   	HPG_cma,
> +	HPG_zeroed,
> +	HPG_zeroing,
>   	__NR_HPAGEFLAGS,
>   };
>   
> @@ -653,6 +666,8 @@ HPAGEFLAG(Freed, freed)
>   HPAGEFLAG(VmemmapOptimized, vmemmap_optimized)
>   HPAGEFLAG(RawHwpUnreliable, raw_hwp_unreliable)
>   HPAGEFLAG(Cma, cma)
> +HPAGEFLAG(Zeroed, zeroed)
> +HPAGEFLAG(Zeroing, zeroing)
>   
>   #ifdef CONFIG_HUGETLB_PAGE
>   
> @@ -678,6 +693,12 @@ struct hstate {
>   	unsigned int nr_huge_pages_node[MAX_NUMNODES];
>   	unsigned int free_huge_pages_node[MAX_NUMNODES];
>   	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> +
> +	unsigned int free_huge_pages_zero_node[MAX_NUMNODES];
> +
> +	/* Queue to wait for a hugetlb folio that is being prezeroed */
> +	wait_queue_head_t dqzero_wait[MAX_NUMNODES];
> +
>   	char name[HSTATE_NAME_LEN];
>   };
>   
> @@ -711,6 +732,7 @@ int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping
>   			pgoff_t idx);
>   void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
>   				unsigned long address, struct folio *folio);
> +void hugetlb_zero_folio(struct folio *folio, unsigned long address);
>   
>   /* arch callback */
>   int __init __alloc_bootmem_huge_page(struct hstate *h, int nid);
> @@ -1303,6 +1325,10 @@ static inline bool hugetlb_bootmem_allocated(void)
>   {
>   	return false;
>   }
> +
> +static inline void hugetlb_zero_folio(struct folio *folio, unsigned long address)
> +{
> +}
>   #endif	/* CONFIG_HUGETLB_PAGE */
>   
>   static inline spinlock_t *huge_pte_lock(struct hstate *h,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 51273baec9e5..d20614b1c927 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -93,6 +93,8 @@ static int hugetlb_param_index __initdata;
>   static __init int hugetlb_add_param(char *s, int (*setup)(char *val));
>   static __init void hugetlb_parse_params(void);
>   
> +static void hpage_wait_zeroing(struct hstate *h, struct folio *folio);
> +
>   #define hugetlb_early_param(str, func) \
>   static __init int func##args(char *s) \
>   { \
> @@ -1292,21 +1294,33 @@ void clear_vma_resv_huge_pages(struct vm_area_struct *vma)
>   	hugetlb_dup_vma_private(vma);
>   }
>   
> +/*
> + * Clear flags for either a fresh page or one that is being
> + * added to the free list.
> + */
> +static inline void prep_clear_zeroed(struct folio *folio)
> +{
> +	folio_clear_hugetlb_zeroed(folio);
> +	folio_clear_hugetlb_zeroing(folio);
> +}
> +
>   static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio)
>   {
>   	int nid = folio_nid(folio);
>   
>   	lockdep_assert_held(&hugetlb_lock);
>   	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
> +	VM_WARN_ON_FOLIO(folio_test_hugetlb_zeroing(folio), folio);
>   
>   	list_move(&folio->lru, &h->hugepage_freelists[nid]);
>   	h->free_huge_pages++;
>   	h->free_huge_pages_node[nid]++;
> +	prep_clear_zeroed(folio);
>   	folio_set_hugetlb_freed(folio);
>   }
>   
> -static struct folio *dequeue_hugetlb_folio_node_exact(struct hstate *h,
> -								int nid)
> +static struct folio *dequeue_hugetlb_folio_node_exact(struct hstate *h, int nid,
> +		gfp_t gfp_mask)
>   {
>   	struct folio *folio;
>   	bool pin = !!(current->flags & PF_MEMALLOC_PIN);
> @@ -1316,6 +1330,16 @@ static struct folio *dequeue_hugetlb_folio_node_exact(struct hstate *h,
>   		if (pin && !folio_is_longterm_pinnable(folio))
>   			continue;
>   
> +		/*
> +		 * This shouldn't happen, as hugetlb pages are never allocated
> +		 * with GFP_ATOMIC. But be paranoid and check for it, as
> +		 * a zero_busy page might cause a sleep later in
> +		 * hpage_wait_zeroing().
> +		 */
> +		if (WARN_ON_ONCE(folio_test_hugetlb_zeroing(folio) &&
> +					!gfpflags_allow_blocking(gfp_mask)))
> +			continue;
> +
>   		if (folio_test_hwpoison(folio))
>   			continue;
>   
> @@ -1327,6 +1351,10 @@ static struct folio *dequeue_hugetlb_folio_node_exact(struct hstate *h,
>   		folio_clear_hugetlb_freed(folio);
>   		h->free_huge_pages--;
>   		h->free_huge_pages_node[nid]--;
> +		if (folio_test_hugetlb_zeroed(folio) ||
> +		    folio_test_hugetlb_zeroing(folio))
> +			h->free_huge_pages_zero_node[nid]--;
> +
>   		return folio;
>   	}
>   
> @@ -1363,7 +1391,7 @@ static struct folio *dequeue_hugetlb_folio_nodemask(struct hstate *h, gfp_t gfp_
>   			continue;
>   		node = zone_to_nid(zone);
>   
> -		folio = dequeue_hugetlb_folio_node_exact(h, node);
> +		folio = dequeue_hugetlb_folio_node_exact(h, node, gfp_mask);
>   		if (folio)
>   			return folio;
>   	}
> @@ -1490,7 +1518,16 @@ void remove_hugetlb_folio(struct hstate *h, struct folio *folio,
>   		folio_clear_hugetlb_freed(folio);
>   		h->free_huge_pages--;
>   		h->free_huge_pages_node[nid]--;
> +		folio_clear_hugetlb_freed(folio);
>   	}
> +	/*
> +	 * Adjust the zero page counters now. Note that
> +	 * if a page is currently being zeroed, that
> +	 * will be waited for in update_and_free_page()
> +	 */
> +	if (folio_test_hugetlb_zeroed(folio) ||
> +	    folio_test_hugetlb_zeroing(folio))
> +		h->free_huge_pages_zero_node[nid]--;
>   	if (adjust_surplus) {
>   		h->surplus_huge_pages--;
>   		h->surplus_huge_pages_node[nid]--;
> @@ -1543,6 +1580,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>   {
>   	bool clear_flag = folio_test_hugetlb_vmemmap_optimized(folio);
>   
> +	VM_WARN_ON_FOLIO(folio_test_hugetlb_zeroing(folio), folio);
> +
>   	if (hstate_is_gigantic_no_runtime(h))
>   		return;
>   
> @@ -1627,6 +1666,7 @@ static void free_hpage_workfn(struct work_struct *work)
>   		 */
>   		h = size_to_hstate(folio_size(folio));
>   
> +		hpage_wait_zeroing(h, folio);
>   		__update_and_free_hugetlb_folio(h, folio);
>   
>   		cond_resched();
> @@ -1643,7 +1683,8 @@ static inline void flush_free_hpage_work(struct hstate *h)
>   static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
>   				 bool atomic)
>   {
> -	if (!folio_test_hugetlb_vmemmap_optimized(folio) || !atomic) {
> +	if ((!folio_test_hugetlb_zeroing(folio) &&
> +	     !folio_test_hugetlb_vmemmap_optimized(folio)) || !atomic) {
>   		__update_and_free_hugetlb_folio(h, folio);
>   		return;
>   	}
> @@ -1840,6 +1881,13 @@ static void account_new_hugetlb_folio(struct hstate *h, struct folio *folio)
>   	h->nr_huge_pages_node[folio_nid(folio)]++;
>   }
>   
> +static void prep_new_hugetlb_folio(struct folio *folio)
> +{
> +	lockdep_assert_held(&hugetlb_lock);
> +	folio_clear_hugetlb_freed(folio);
> +	prep_clear_zeroed(folio);
> +}
> +
>   void init_new_hugetlb_folio(struct folio *folio)
>   {
>   	__folio_set_hugetlb(folio);
> @@ -1964,6 +2012,7 @@ void prep_and_add_allocated_folios(struct hstate *h,
>   	/* Add all new pool pages to free lists in one lock cycle */
>   	spin_lock_irqsave(&hugetlb_lock, flags);
>   	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
> +		prep_new_hugetlb_folio(folio);
>   		account_new_hugetlb_folio(h, folio);
>   		enqueue_hugetlb_folio(h, folio);
>   	}
> @@ -2171,6 +2220,7 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
>   		return NULL;
>   
>   	spin_lock_irq(&hugetlb_lock);
> +	prep_new_hugetlb_folio(folio);
>   	/*
>   	 * nr_huge_pages needs to be adjusted within the same lock cycle
>   	 * as surplus_pages, otherwise it might confuse
> @@ -2214,6 +2264,7 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas
>   		return NULL;
>   
>   	spin_lock_irq(&hugetlb_lock);
> +	prep_new_hugetlb_folio(folio);
>   	account_new_hugetlb_folio(h, folio);
>   	spin_unlock_irq(&hugetlb_lock);
>   
> @@ -2289,6 +2340,13 @@ struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
>   						preferred_nid, nmask);
>   		if (folio) {
>   			spin_unlock_irq(&hugetlb_lock);
> +			/*
> +			 * The contents of this page will be completely
> +			 * overwritten immediately, as its a migration
> +			 * target, so no clearing is needed. Do wait in
> +			 * case pre-zero thread was working on it, though.
> +			 */
> +			hpage_wait_zeroing(h, folio);
>   			return folio;
>   		}
>   	}
> @@ -2779,6 +2837,7 @@ static int alloc_and_dissolve_hugetlb_folio(struct folio *old_folio,
>   		 */
>   		remove_hugetlb_folio(h, old_folio, false);
>   
> +		prep_new_hugetlb_folio(new_folio);
>   		/*
>   		 * Ref count on new_folio is already zero as it was dropped
>   		 * earlier.  It can be directly added to the pool free list.
> @@ -2999,6 +3058,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>   
>   	spin_unlock_irq(&hugetlb_lock);
>   
> +	hpage_wait_zeroing(h, folio);
> +
>   	hugetlb_set_folio_subpool(folio, spool);
>   
>   	if (map_chg != MAP_CHG_ENFORCED) {
> @@ -3257,6 +3318,7 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
>   		hugetlb_bootmem_init_migratetype(folio, h);
>   		/* Subdivide locks to achieve better parallel performance */
>   		spin_lock_irqsave(&hugetlb_lock, flags);
> +		prep_new_hugetlb_folio(folio);
>   		account_new_hugetlb_folio(h, folio);
>   		enqueue_hugetlb_folio(h, folio);
>   		spin_unlock_irqrestore(&hugetlb_lock, flags);
> @@ -4190,6 +4252,42 @@ bool __init __attribute((weak)) arch_hugetlb_valid_size(unsigned long size)
>   	return size == HPAGE_SIZE;
>   }
>   
> +/*
> + * Zero a hugetlb page.
> + *
> + * The caller has already made sure that the page is not
> + * being actively zeroed out in the background.
> + *
> + * If it wasn't zeroed out, do it ourselves.
> + */
> +void hugetlb_zero_folio(struct folio *folio, unsigned long address)
> +{
> +	if (!folio_test_hugetlb_zeroed(folio))
> +		folio_zero_user(folio, address);
> +
> +	__folio_mark_uptodate(folio);
> +}
> +
> +/*
> + * Once a page has been taken off the freelist, the new page owner
> + * must wait for the pre-zero thread to finish if it happens
> + * to be working on this page (which should be rare).
> + */
> +static void hpage_wait_zeroing(struct hstate *h, struct folio *folio)
> +{
> +	if (!folio_test_hugetlb_zeroing(folio))
> +		return;
> +
> +	spin_lock_irq(&hugetlb_lock);
> +
> +	wait_event_cmd(h->dqzero_wait[folio_nid(folio)],
> +		       !folio_test_hugetlb_zeroing(folio),
> +		       spin_unlock_irq(&hugetlb_lock),
> +		       spin_lock_irq(&hugetlb_lock));
> +
> +	spin_unlock_irq(&hugetlb_lock);
> +}
> +

nit:
May be simple enough chunk to introduce guard() above

[...]

Regards
- Raghu


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/8] mm/hugetlb: add pre-zeroed framework
  2025-12-26  9:24   ` Raghavendra K T
@ 2025-12-26  9:48     ` Li Zhe
  0 siblings, 0 replies; 22+ messages in thread
From: Li Zhe @ 2025-12-26  9:48 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: akpm, david, fvdl, linux-kernel, linux-mm, lizhe.67, muchun.song,
	osalvador

On Fri, 26 Dec 2025 14:54:17 +0530, raghavendra.kt@amd.com wrote:

> > +/*
> > + * Once a page has been taken off the freelist, the new page owner
> > + * must wait for the pre-zero thread to finish if it happens
> > + * to be working on this page (which should be rare).
> > + */
> > +static void hpage_wait_zeroing(struct hstate *h, struct folio *folio)
> > +{
> > +	if (!folio_test_hugetlb_zeroing(folio))
> > +		return;
> > +
> > +	spin_lock_irq(&hugetlb_lock);
> > +
> > +	wait_event_cmd(h->dqzero_wait[folio_nid(folio)],
> > +		       !folio_test_hugetlb_zeroing(folio),
> > +		       spin_unlock_irq(&hugetlb_lock),
> > +		       spin_lock_irq(&hugetlb_lock));
> > +
> > +	spin_unlock_irq(&hugetlb_lock);
> > +}
> > +
> 
> nit:
> May be simple enough chunk to introduce guard() above

Thank you for the reminder. I will address this issue in v2.

Thanks,
Zhe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism
  2025-12-25  8:20 [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism 李喆
                   ` (7 preceding siblings ...)
  2025-12-25  8:20 ` [PATCH 8/8] mm/hugetlb: limit event generation frequency of function do_zero_free_notify() 李喆
@ 2025-12-26 18:32 ` Frank van der Linden
  2025-12-26 21:42   ` Frank van der Linden
  2025-12-27  7:21 ` Mateusz Guzik
  2025-12-28 21:44 ` Andrew Morton
  10 siblings, 1 reply; 22+ messages in thread
From: Frank van der Linden @ 2025-12-26 18:32 UTC (permalink / raw)
  To: 李喆
  Cc: muchun.song, osalvador, david, akpm, linux-mm, linux-kernel

On Thu, Dec 25, 2025 at 12:21 AM 李喆 <lizhe.67@bytedance.com> wrote:
>
> From: Li Zhe <lizhe.67@bytedance.com>
>
> This patchset is based on this commit[1]("mm/hugetlb: optionally
> pre-zero hugetlb pages").
>
> Fresh hugetlb pages are zeroed out when they are faulted in,
> just like with all other page types. This can take up a good
> amount of time for larger page sizes (e.g. around 40
> milliseconds for a 1G page on a recent AMD-based system).
>
> This normally isn't a problem, since hugetlb pages are typically
> mapped by the application for a long time, and the initial
> delay when touching them isn't much of an issue.
>
> However, there are some use cases where a large number of hugetlb
> pages are touched when an application (such as a VM backed by these
> pages) starts. For 256 1G pages and 40ms per page, this would take
> 10 seconds, a noticeable delay.
>
> To accelerate the above scenario, this patchset exports a per-node,
> read-write zeroable_hugepages interface for every hugepage size.
> This interface reports how many hugepages on that node can currently
> be pre-zeroed and allows user space to request that any integer number
> in the range [0, max] be zeroed in a single operation.
>
> This mechanism offers the following advantages:
>
> (1) User space gains full control over when zeroing is triggered,
> enabling it to minimize the impact on both CPU and cache utilization.
>
> (2) Applications can spawn as many zeroing processes as they need,
> enabling concurrent background zeroing.
>
> (3) By binding the process to specific CPUs, users can confine zeroing
> threads to cores that do not run latency-critical tasks, eliminating
> interference.
>
> (4) A zeroing process can be interrupted at any time through standard
> signal mechanisms, allowing immediate cancellation.
>
> (5) The CPU consumption incurred by zeroing can be throttled and contained
> with cgroups, ensuring that the cost is not borne system-wide.
>
> On an AMD Milan platform, each 1 GB huge-page fault is shortened by at
> least 25628 us (figure inherited from the test results cited herein[1]).
>
> In user space, we can use system calls such as epoll and write to zero
> huge pages as they become available, and sleep when none are ready. The
> following pseudocode illustrates this approach. The pseudocode spawns
> eight threads that wait for huge pages on node 0 to become eligible for
> zeroing; whenever such pages are available, the threads clear them in
> parallel.
>
>   static void thread_fun(void)
>   {
>         epoll_create();
>         epoll_ctl();
>         while (1) {
>                 val = read("/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
>                 if (val > 0)
>                         system("echo max > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
>                 epoll_wait();
>         }
>   }
>
>   static void start_pre_zero_thread(int thread_num)
>   {
>         create_pre_zero_threads(thread_num, thread_fun)
>   }
>
>   int main(void)
>   {
>         start_pre_zero_thread(8);
>   }
>
> [1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t

Thanks for taking my patches and extending them!

As far as I can see, you took what I did and then added a framework
for the zeroing to be done in user context, and possibly by multiple
threads, right? There were one or two comments on my original patch
set that objected to the zero cost being taken by a system thread, not
a user thread, so this should address that.

I'll go through them to provide comments inline.

- Frank


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"
  2025-12-25  8:20 ` [PATCH 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages" 李喆
@ 2025-12-26 18:51   ` Frank van der Linden
  2025-12-29 12:25     ` Li Zhe
  0 siblings, 1 reply; 22+ messages in thread
From: Frank van der Linden @ 2025-12-26 18:51 UTC (permalink / raw)
  To: 李喆
  Cc: muchun.song, osalvador, david, akpm, linux-mm, linux-kernel

On Thu, Dec 25, 2025 at 12:22 AM 李喆 <lizhe.67@bytedance.com> wrote:
>
> From: Li Zhe <lizhe.67@bytedance.com>
>
> Fresh hugetlb pages are zeroed out when they are faulted in,
> just like with all other page types. This can take up a good
> amount of time for larger page sizes (e.g. around 40 milliseconds
> for a 1G page on a recent AMD-based system).
>
> This normally isn't a problem, since hugetlb pages are typically
> mapped by the application for a long time, and the initial delay
> when touching them isn't much of an issue.
>
> However, there are some use cases where a large number of hugetlb
> pages are touched when an application (such as a VM backed by
> these pages) starts. For 256 1G pages and 40ms per page, this would
> take 10 seconds, a noticeable delay.
>
> This patch adds a new zeroable_hugepages interface under each
> /sys/devices/system/node/node*/hugepages/hugepages-***kB directory.
> Reading it returns the number of huge folios of the corresponding size
> on that node that are eligible for pre-zeroing. The interface also
> accepts an integer x in the range [0, max], enabling user space to
> request that x huge pages be zeroed on demand.
>
> Exporting this interface offers the following advantages:
>
> (1) User space gains full control over when zeroing is triggered,
> enabling it to minimize the impact on both CPU and cache utilization.
>
> (2) Applications can spawn as many zeroing processes as they need,
> enabling concurrent background zeroing.
>
> (3) By binding the process to specific CPUs, users can confine zeroing
> threads to cores that do not run latency-critical tasks, eliminating
> interference.
>
> (4) A zeroing process can be interrupted at any time through standard
> signal mechanisms, allowing immediate cancellation.
>
> (5) The CPU consumption incurred by zeroing can be throttled and contained
> with cgroups, ensuring that the cost is not borne system-wide.
>
> On an AMD Milan platform, each 1 GB huge-page fault is shortened by at
> least 25628 us (figure inherited from the test results cited herein[1]).
>
> [1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t
>
> Co-developed-by: Frank van der Linden <fvdl@google.com>
> Signed-off-by: Frank van der Linden <fvdl@google.com>
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> ---
>  mm/hugetlb_sysfs.c | 120 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 120 insertions(+)
>
> diff --git a/mm/hugetlb_sysfs.c b/mm/hugetlb_sysfs.c
> index 79ece91406bf..8c3e433209c3 100644
> --- a/mm/hugetlb_sysfs.c
> +++ b/mm/hugetlb_sysfs.c
> @@ -352,6 +352,125 @@ struct node_hstate {
>  };
>  static struct node_hstate node_hstates[MAX_NUMNODES];
>
> +static ssize_t zeroable_hugepages_show(struct kobject *kobj,
> +                                       struct kobj_attribute *attr, char *buf)
> +{
> +       struct hstate *h;
> +       unsigned long free_huge_pages_zero;
> +       int nid;
> +
> +       h = kobj_to_hstate(kobj, &nid);
> +       if (WARN_ON(nid == NUMA_NO_NODE))
> +               return -EPERM;
> +
> +       free_huge_pages_zero = h->free_huge_pages_node[nid] -
> +                              h->free_huge_pages_zero_node[nid];
> +
> +       return sprintf(buf, "%lu\n", free_huge_pages_zero);
> +}
> +
> +static inline bool zero_should_abort(struct hstate *h, int nid)
> +{
> +       return (h->free_huge_pages_zero_node[nid] ==
> +               h->free_huge_pages_node[nid]) ||
> +               list_empty(&h->hugepage_freelists[nid]);
> +}
> +
> +static void zero_free_hugepages_nid(struct hstate *h,
> +                                  int nid, unsigned int nr_zero)
> +{
> +       struct list_head *freelist = &h->hugepage_freelists[nid];
> +       unsigned int nr_zerod = 0;
> +       struct folio *folio;
> +
> +       if (zero_should_abort(h, nid))
> +               return;
> +
> +       spin_lock_irq(&hugetlb_lock);
> +
> +       while (nr_zerod < nr_zero) {
> +
> +               if (zero_should_abort(h, nid) || fatal_signal_pending(current))
> +                       break;
> +
> +               freelist = freelist->prev;
> +               if (unlikely(list_is_head(freelist, &h->hugepage_freelists[nid])))
> +                       break;
> +               folio = list_entry(freelist, struct folio, lru);
> +
> +               if (folio_test_hugetlb_zeroed(folio) ||
> +                   folio_test_hugetlb_zeroing(folio))
> +                       continue;
> +
> +               folio_set_hugetlb_zeroing(folio);
> +
> +               /*
> +                * Incrementing this here is a bit of a fib, since
> +                * the page hasn't been cleared yet (it will be done
> +                * immediately after dropping the lock below). But
> +                * it keeps the count consistent with the overall
> +                * free count in case the page gets taken off the
> +                * freelist while we're working on it.
> +                */
> +               h->free_huge_pages_zero_node[nid]++;
> +               spin_unlock_irq(&hugetlb_lock);
> +
> +               /*
> +                * HWPoison pages may show up on the freelist.
> +                * Don't try to zero it out, but do set the flag
> +                * and counts, so that we don't consider it again.
> +                */
> +               if (!folio_test_hwpoison(folio))
> +                       folio_zero_user(folio, 0);
> +
> +               cond_resched();
> +
> +               spin_lock_irq(&hugetlb_lock);
> +               folio_set_hugetlb_zeroed(folio);
> +               folio_clear_hugetlb_zeroing(folio);
> +
> +               /*
> +                * If the page is still on the free list, move
> +                * it to the head.
> +                */
> +               if (folio_test_hugetlb_freed(folio))
> +                       list_move(&folio->lru, &h->hugepage_freelists[nid]);
> +
> +               /*
> +                * If someone was waiting for the zero to
> +                * finish, wake them up.
> +                */
> +               if (waitqueue_active(&h->dqzero_wait[nid]))
> +                       wake_up(&h->dqzero_wait[nid]);
> +               nr_zerod++;
> +               freelist = &h->hugepage_freelists[nid];
> +       }
> +       spin_unlock_irq(&hugetlb_lock);
> +}

Nit: s/nr_zerod/nr_zeroed/

Feels like the list logic can be cleaned up a bit here. Since the
zeroed folios are at the head of the list, and the dirty ones at the
tail, and you start walking from the tail, you don't need to check if
you circled back to the head - just stop if you encounter a prezeroed
folio. If you encounter a prezeroed folio while walking from the tail,
that means that all other folios from that one to the head will also
be prezeroed already.

- Frank


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism
  2025-12-26 18:32 ` [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism Frank van der Linden
@ 2025-12-26 21:42   ` Frank van der Linden
  2025-12-29 12:28     ` Li Zhe
  0 siblings, 1 reply; 22+ messages in thread
From: Frank van der Linden @ 2025-12-26 21:42 UTC (permalink / raw)
  To: 李喆
  Cc: muchun.song, osalvador, david, akpm, linux-mm, linux-kernel

Is there any situation where you would write anything else than 'max'
to the new sysfs file? E.g. in which scenarios does it make sense to
*not* pre-zero all freed hugetlb folios? There doesn't seem to be a
point to just doing a certain number. You can't know for sure if the
number you read will remain correct, as it's just a snapshot. So how
would you determine a correct number other than 'max'?

- Frank

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism
  2025-12-25  8:20 [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism 李喆
                   ` (8 preceding siblings ...)
  2025-12-26 18:32 ` [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism Frank van der Linden
@ 2025-12-27  7:21 ` Mateusz Guzik
  2025-12-29 12:31   ` Li Zhe
  2025-12-28 21:44 ` Andrew Morton
  10 siblings, 1 reply; 22+ messages in thread
From: Mateusz Guzik @ 2025-12-27  7:21 UTC (permalink / raw)
  To: 李喆
  Cc: muchun.song, osalvador, david, akpm, fvdl, linux-mm, linux-kernel

On Thu, Dec 25, 2025 at 04:20:51PM +0800, 李喆 wrote:
> From: Li Zhe <lizhe.67@bytedance.com>
> 
> This patchset is based on this commit[1]("mm/hugetlb: optionally
> pre-zero hugetlb pages").
> 
> Fresh hugetlb pages are zeroed out when they are faulted in,
> just like with all other page types. This can take up a good
> amount of time for larger page sizes (e.g. around 40
> milliseconds for a 1G page on a recent AMD-based system).
> 
> This normally isn't a problem, since hugetlb pages are typically
> mapped by the application for a long time, and the initial
> delay when touching them isn't much of an issue.
> 
> However, there are some use cases where a large number of hugetlb
> pages are touched when an application (such as a VM backed by these
> pages) starts. For 256 1G pages and 40ms per page, this would take
> 10 seconds, a noticeable delay.
> 
> To accelerate the above scenario, this patchset exports a per-node,
> read-write zeroable_hugepages interface for every hugepage size.
> This interface reports how many hugepages on that node can currently
> be pre-zeroed and allows user space to request that any integer number
> in the range [0, max] be zeroed in a single operation.
> 
> This mechanism offers the following advantages:
> 
> (1) User space gains full control over when zeroing is triggered,
> enabling it to minimize the impact on both CPU and cache utilization.
> 
> (2) Applications can spawn as many zeroing processes as they need,
> enabling concurrent background zeroing.
> 
> (3) By binding the process to specific CPUs, users can confine zeroing
> threads to cores that do not run latency-critical tasks, eliminating
> interference.
> 
> (4) A zeroing process can be interrupted at any time through standard
> signal mechanisms, allowing immediate cancellation.
> 
> (5) The CPU consumption incurred by zeroing can be throttled and contained
> with cgroups, ensuring that the cost is not borne system-wide.
> 
> On an AMD Milan platform, each 1 GB huge-page fault is shortened by at
> least 25628 us (figure inherited from the test results cited herein[1]).
> 
> In user space, we can use system calls such as epoll and write to zero
> huge pages as they become available, and sleep when none are ready. The
> following pseudocode illustrates this approach. The pseudocode spawns
> eight threads that wait for huge pages on node 0 to become eligible for
> zeroing; whenever such pages are available, the threads clear them in
> parallel.
> 
>   static void thread_fun(void)
>   {
>   	epoll_create();
>   	epoll_ctl();
>   	while (1) {
>   		val = read("/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
>   		if (val > 0)
>   			system("echo max > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
>   		epoll_wait();
>   	}
>   }
>   
>   static void start_pre_zero_thread(int thread_num)
>   {
>   	create_pre_zero_threads(thread_num, thread_fun)
>   }
>   
>   int main(void)
>   {
>   	start_pre_zero_thread(8);
>   }
> 

In the name of "provide tools, not policy" making userspace call the
shots is the right approach, which I advocated for in the original
thread.

I do have concerns about the specific interface as I think it is a
little too limited.

Suppose vastly different deployments with different needs. For example
one may want to keep at least n pages ready to use, RAM permitting.

At the same time it perhaps would like to balance CPU usage vs other
tasks, so for example it would control parallelism based on observed
churn rate.

So a toolset I would consider viable would need to provide an extensible
interface to future-proof it.

As for an immediate need not met with the current patchset, there is no
configurable threshold for free zeroed page count to generate a wake up.

I suspect a bunch of ioctls would be needed here.

I don't know if sysfs is viable at all for this. Worst case a device (or
a set of per-node devices) can be created with the same goal.

For illustrative purposes perhaps something like this:

I'm assuming centralized file/device. the node parameter can be dropped
otherwise.

struct hugepage_zero_req {
	int version; /* version of the struct for extensibility purposes; alternatively different versions can use differrent ioctls */
	int node; /* numa node to zero in */
	int pages; /* max pages to zero out in this call */
}

then interested threads can do:
	struct hugepage_zero_req hzr { .node = 0; pages = UINT_MAX; }
	pages = ioctl(hfd, HUGEPAGE_ZERO_PERFORM, &hzr); /* return pages zeroed */

struct hugepage_zero_configure {
	int version;
	int node; /* numa node to watch, open the device more times for other nodes */
	int minfree; /* issue a wake up if free pages drop below this value */
}

and of course:

struct hugepage_zero_query {
	int version;
	int node;
	size_t total_pages; /* all pages installed in the domain */
	size_t free_pages; /* total free pages */
	size_t free_huge_pages; /* total free huge pages */
	size_t zeroed_huge_pages; /* huge pages ready to use */
	size_t pad[MEDIUM_NUM]; /* optionally make extensible without abi breakage, version handling and ioctl renumbering? */
	.... /* whatever else useful, note it can be added later */
}

Then I would imagine a userspace daemon with arbitrary policies can be
written just fine.

Consider this pseudo-code which spawns 8 threads on domain 1 and
dispatches some number of them to zero stuff based on what it sees.

#define THREADS 8
int main(void)
{
	bind_to_domain(1);
	start_pre_zero_threads(THREADS); /* create a thread pool */
	epoll_create();
	struct hugepage_zero_configure hzc { .version = MAGIC_VERSION, .node = 1; minfree = SMALL_NUM; }
	ioctl(hfd, HUGEPAGE_ZERO_CONFIGURE, &hzc);
	hfd = open("/dev/hugepagectl", O_RDWR);
	epoll_ctl();

	for (;;) {
		epoll_wait();

		struct hugepage_zero_query hzq { .version = MAGIC_VERSION, .node = 1 };
		ioctl(hfd, HUGEPAGE_ZERO_QUERY, &hzq);
		if (hzq.free_huge_pages == 0)
			/* nothing which can be done */
			continue;
		tozero = tune_the_request(&hzq);
		/*
		 * get up to THREADS workers zeroing in parallel based on magic policy
		 */
		dispatch(tozero); 
	}
}

If one wants one daemon handling multiple domains it can use open the
file one time per domain to cover it.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism
  2025-12-25  8:20 [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism 李喆
                   ` (9 preceding siblings ...)
  2025-12-27  7:21 ` Mateusz Guzik
@ 2025-12-28 21:44 ` Andrew Morton
  2025-12-29 12:34   ` Li Zhe
  10 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2025-12-28 21:44 UTC (permalink / raw)
  To: 李喆
  Cc: muchun.song, osalvador, david, fvdl, linux-mm, linux-kernel, Ankur Arora

On Thu, 25 Dec 2025 16:20:51 +0800 李喆 <lizhe.67@bytedance.com> wrote:

> This patchset is based on this commit[1]("mm/hugetlb: optionally
> pre-zero hugetlb pages").
> 
> Fresh hugetlb pages are zeroed out when they are faulted in,
> just like with all other page types. This can take up a good
> amount of time for larger page sizes (e.g. around 40
> milliseconds for a 1G page on a recent AMD-based system).
> 
> This normally isn't a problem, since hugetlb pages are typically
> mapped by the application for a long time, and the initial
> delay when touching them isn't much of an issue.
> 
> However, there are some use cases where a large number of hugetlb
> pages are touched when an application (such as a VM backed by these
> pages) starts. For 256 1G pages and 40ms per page, this would take
> 10 seconds, a noticeable delay.

Ankur's contiguous page clearing work
(https://lkml.kernel.org/r/20251215204922.475324-1-ankur.a.arora@oracle.com)
will hopefully result in significant changes to the timing observations
in your changelogs?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"
  2025-12-26 18:51   ` Frank van der Linden
@ 2025-12-29 12:25     ` Li Zhe
  2025-12-29 18:57       ` Frank van der Linden
  0 siblings, 1 reply; 22+ messages in thread
From: Li Zhe @ 2025-12-29 12:25 UTC (permalink / raw)
  To: fvdl
  Cc: akpm, david, linux-kernel, linux-mm, lizhe.67, muchun.song, osalvador

On Fri, 26 Dec 2025 10:51:01 -0800, fvdl@google.com wrote:

> > +static ssize_t zeroable_hugepages_show(struct kobject *kobj,
> > +                                       struct kobj_attribute *attr, char *buf)
> > +{
> > +       struct hstate *h;
> > +       unsigned long free_huge_pages_zero;
> > +       int nid;
> > +
> > +       h = kobj_to_hstate(kobj, &nid);
> > +       if (WARN_ON(nid == NUMA_NO_NODE))
> > +               return -EPERM;
> > +
> > +       free_huge_pages_zero = h->free_huge_pages_node[nid] -
> > +                              h->free_huge_pages_zero_node[nid];
> > +
> > +       return sprintf(buf, "%lu\n", free_huge_pages_zero);
> > +}
> > +
> > +static inline bool zero_should_abort(struct hstate *h, int nid)
> > +{
> > +       return (h->free_huge_pages_zero_node[nid] ==
> > +               h->free_huge_pages_node[nid]) ||
> > +               list_empty(&h->hugepage_freelists[nid]);
> > +}
> > +
> > +static void zero_free_hugepages_nid(struct hstate *h,
> > +                                  int nid, unsigned int nr_zero)
> > +{
> > +       struct list_head *freelist = &h->hugepage_freelists[nid];
> > +       unsigned int nr_zerod = 0;
> > +       struct folio *folio;
> > +
> > +       if (zero_should_abort(h, nid))
> > +               return;
> > +
> > +       spin_lock_irq(&hugetlb_lock);
> > +
> > +       while (nr_zerod < nr_zero) {
> > +
> > +               if (zero_should_abort(h, nid) || fatal_signal_pending(current))
> > +                       break;
> > +
> > +               freelist = freelist->prev;
> > +               if (unlikely(list_is_head(freelist, &h->hugepage_freelists[nid])))
> > +                       break;
> > +               folio = list_entry(freelist, struct folio, lru);
> > +
> > +               if (folio_test_hugetlb_zeroed(folio) ||
> > +                   folio_test_hugetlb_zeroing(folio))
> > +                       continue;
> > +
> > +               folio_set_hugetlb_zeroing(folio);
> > +
> > +               /*
> > +                * Incrementing this here is a bit of a fib, since
> > +                * the page hasn't been cleared yet (it will be done
> > +                * immediately after dropping the lock below). But
> > +                * it keeps the count consistent with the overall
> > +                * free count in case the page gets taken off the
> > +                * freelist while we're working on it.
> > +                */
> > +               h->free_huge_pages_zero_node[nid]++;
> > +               spin_unlock_irq(&hugetlb_lock);
> > +
> > +               /*
> > +                * HWPoison pages may show up on the freelist.
> > +                * Don't try to zero it out, but do set the flag
> > +                * and counts, so that we don't consider it again.
> > +                */
> > +               if (!folio_test_hwpoison(folio))
> > +                       folio_zero_user(folio, 0);
> > +
> > +               cond_resched();
> > +
> > +               spin_lock_irq(&hugetlb_lock);
> > +               folio_set_hugetlb_zeroed(folio);
> > +               folio_clear_hugetlb_zeroing(folio);
> > +
> > +               /*
> > +                * If the page is still on the free list, move
> > +                * it to the head.
> > +                */
> > +               if (folio_test_hugetlb_freed(folio))
> > +                       list_move(&folio->lru, &h->hugepage_freelists[nid]);
> > +
> > +               /*
> > +                * If someone was waiting for the zero to
> > +                * finish, wake them up.
> > +                */
> > +               if (waitqueue_active(&h->dqzero_wait[nid]))
> > +                       wake_up(&h->dqzero_wait[nid]);
> > +               nr_zerod++;
> > +               freelist = &h->hugepage_freelists[nid];
> > +       }
> > +       spin_unlock_irq(&hugetlb_lock);
> > +}
> 
> Nit: s/nr_zerod/nr_zeroed/

Thank you for the reminder. I will address this issue in v2.

> Feels like the list logic can be cleaned up a bit here. Since the
> zeroed folios are at the head of the list, and the dirty ones at the
> tail, and you start walking from the tail, you don't need to check if
> you circled back to the head - just stop if you encounter a prezeroed
> folio. If you encounter a prezeroed folio while walking from the tail,
> that means that all other folios from that one to the head will also
> be prezeroed already.

Thank you for the thoughtful suggestion. Your line of reasoning is,
in most situations, perfectly valid. Under extreme concurrency,
however, a corner case can still appear. Imagine two processes
simultaneously zeroing huge pages: Process A enters
zero_free_hugepages_nid(), completes the zeroing of one huge page,
and marks the folio in the list as pre-zeroed. Should Process B enter
the same function moments later and decide to exit as soon as it
meets a prezeroed folio, the intended parallel zeroing would quietly
fall back to a single-threaded pace.

Thanks,
Zhe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism
  2025-12-26 21:42   ` Frank van der Linden
@ 2025-12-29 12:28     ` Li Zhe
  0 siblings, 0 replies; 22+ messages in thread
From: Li Zhe @ 2025-12-29 12:28 UTC (permalink / raw)
  To: fvdl
  Cc: akpm, david, linux-kernel, linux-mm, lizhe.67, muchun.song, osalvador

On Fri, 26 Dec 2025 13:42:13 -0800, fvdl@google.com wrote:

> Is there any situation where you would write anything else than 'max'
> to the new sysfs file? E.g. in which scenarios does it make sense to
> *not* pre-zero all freed hugetlb folios? There doesn't seem to be a
> point to just doing a certain number. You can't know for sure if the
> number you read will remain correct, as it's just a snapshot. So how
> would you determine a correct number other than 'max'?

My view is that each application knows its own huge-page requirement
and should therefore write the corresponding number into the
"zeroable_hugepages" interface. Since the zeroing work is accounted
to the application process, the CPU time it consumes can be
constrained through that process's cgroup.

Thanks,
Zhe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism
  2025-12-27  7:21 ` Mateusz Guzik
@ 2025-12-29 12:31   ` Li Zhe
  0 siblings, 0 replies; 22+ messages in thread
From: Li Zhe @ 2025-12-29 12:31 UTC (permalink / raw)
  To: mjguzik
  Cc: akpm, david, fvdl, linux-kernel, linux-mm, lizhe.67, muchun.song,
	osalvador

On Sat, 27 Dec 2025 08:21:16 +0100, mjguzik@gmail.com wrote:

> In the name of "provide tools, not policy" making userspace call the
> shots is the right approach, which I advocated for in the original
> thread.

Thank you for your endorsement!

> I do have concerns about the specific interface as I think it is a
> little too limited.
> 
> Suppose vastly different deployments with different needs. For example
> one may want to keep at least n pages ready to use, RAM permitting.
> 
> At the same time it perhaps would like to balance CPU usage vs other
> tasks, so for example it would control parallelism based on observed
> churn rate.
> 
> So a toolset I would consider viable would need to provide an extensible
> interface to future-proof it.
> 
> As for an immediate need not met with the current patchset, there is no
> configurable threshold for free zeroed page count to generate a wake up.
> 
> I suspect a bunch of ioctls would be needed here.
> 
> I don't know if sysfs is viable at all for this. Worst case a device (or
> a set of per-node devices) can be created with the same goal.

In my view, the present kernel framework does not allow an ioctl
interface to be placed under the per-node huge-page directories.

The functionality you describe appears to align closely with that
offered by the cgroup.event_control interface in the memory
controller.

We could therefore introduce a new event_control file for huge-page
events, following the same pattern. Given that all huge-page
attributes already live in sysfs, such an addition would keep the
interface consistent and avoid the extra indirection of a new
/dev/hugepagectl file.

Thanks,
Zhe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism
  2025-12-28 21:44 ` Andrew Morton
@ 2025-12-29 12:34   ` Li Zhe
  0 siblings, 0 replies; 22+ messages in thread
From: Li Zhe @ 2025-12-29 12:34 UTC (permalink / raw)
  To: akpm
  Cc: ankur.a.arora, david, fvdl, linux-kernel, linux-mm, lizhe.67,
	muchun.song, osalvador

On Sun, 28 Dec 2025 13:44:54 -0800, akpm@linux-foundation.org wrote:

> > Fresh hugetlb pages are zeroed out when they are faulted in,
> > just like with all other page types. This can take up a good
> > amount of time for larger page sizes (e.g. around 40
> > milliseconds for a 1G page on a recent AMD-based system).
> > 
> > This normally isn't a problem, since hugetlb pages are typically
> > mapped by the application for a long time, and the initial
> > delay when touching them isn't much of an issue.
> > 
> > However, there are some use cases where a large number of hugetlb
> > pages are touched when an application (such as a VM backed by these
> > pages) starts. For 256 1G pages and 40ms per page, this would take
> > 10 seconds, a noticeable delay.
> 
> Ankur's contiguous page clearing work
> (https://lkml.kernel.org/r/20251215204922.475324-1-ankur.a.arora@oracle.com)
> will hopefully result in significant changes to the timing observations
> in your changelogs?

I conducted the experiment on my Skylake machine; below are the 64-GiB
huge-page fault latencies measured with 1-GiB page size.

Without Ankur's optimization:

	Total time: 15.989429 seconds
	Avg time per 1GB page: 0.249835 seconds

With Ankur's optimization:

	Total time: 12.931696 seconds
	Avg time per 1GB page: 0.202058 seconds

For comparison, when the same 64 GiB of memory was pre-zeroed in
advance by the pre-zeroing mechanism, the test completed in negligible
time.

I will incorporate these findings into the V2 description.

Thanks,
Zhe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"
  2025-12-29 12:25     ` Li Zhe
@ 2025-12-29 18:57       ` Frank van der Linden
  2025-12-30  2:41         ` Li Zhe
  0 siblings, 1 reply; 22+ messages in thread
From: Frank van der Linden @ 2025-12-29 18:57 UTC (permalink / raw)
  To: Li Zhe; +Cc: akpm, david, linux-kernel, linux-mm, muchun.song, osalvador

On Mon, Dec 29, 2025 at 4:26 AM Li Zhe <lizhe.67@bytedance.com> wrote:
>
> On Fri, 26 Dec 2025 10:51:01 -0800, fvdl@google.com wrote:
>
> > > +static ssize_t zeroable_hugepages_show(struct kobject *kobj,
> > > +                                       struct kobj_attribute *attr, char *buf)
> > > +{
> > > +       struct hstate *h;
> > > +       unsigned long free_huge_pages_zero;
> > > +       int nid;
> > > +
> > > +       h = kobj_to_hstate(kobj, &nid);
> > > +       if (WARN_ON(nid == NUMA_NO_NODE))
> > > +               return -EPERM;
> > > +
> > > +       free_huge_pages_zero = h->free_huge_pages_node[nid] -
> > > +                              h->free_huge_pages_zero_node[nid];
> > > +
> > > +       return sprintf(buf, "%lu\n", free_huge_pages_zero);
> > > +}
> > > +
> > > +static inline bool zero_should_abort(struct hstate *h, int nid)
> > > +{
> > > +       return (h->free_huge_pages_zero_node[nid] ==
> > > +               h->free_huge_pages_node[nid]) ||
> > > +               list_empty(&h->hugepage_freelists[nid]);
> > > +}
> > > +
> > > +static void zero_free_hugepages_nid(struct hstate *h,
> > > +                                  int nid, unsigned int nr_zero)
> > > +{
> > > +       struct list_head *freelist = &h->hugepage_freelists[nid];
> > > +       unsigned int nr_zerod = 0;
> > > +       struct folio *folio;
> > > +
> > > +       if (zero_should_abort(h, nid))
> > > +               return;
> > > +
> > > +       spin_lock_irq(&hugetlb_lock);
> > > +
> > > +       while (nr_zerod < nr_zero) {
> > > +
> > > +               if (zero_should_abort(h, nid) || fatal_signal_pending(current))
> > > +                       break;
> > > +
> > > +               freelist = freelist->prev;
> > > +               if (unlikely(list_is_head(freelist, &h->hugepage_freelists[nid])))
> > > +                       break;
> > > +               folio = list_entry(freelist, struct folio, lru);
> > > +
> > > +               if (folio_test_hugetlb_zeroed(folio) ||
> > > +                   folio_test_hugetlb_zeroing(folio))
> > > +                       continue;
> > > +
> > > +               folio_set_hugetlb_zeroing(folio);
> > > +
> > > +               /*
> > > +                * Incrementing this here is a bit of a fib, since
> > > +                * the page hasn't been cleared yet (it will be done
> > > +                * immediately after dropping the lock below). But
> > > +                * it keeps the count consistent with the overall
> > > +                * free count in case the page gets taken off the
> > > +                * freelist while we're working on it.
> > > +                */
> > > +               h->free_huge_pages_zero_node[nid]++;
> > > +               spin_unlock_irq(&hugetlb_lock);
> > > +
> > > +               /*
> > > +                * HWPoison pages may show up on the freelist.
> > > +                * Don't try to zero it out, but do set the flag
> > > +                * and counts, so that we don't consider it again.
> > > +                */
> > > +               if (!folio_test_hwpoison(folio))
> > > +                       folio_zero_user(folio, 0);
> > > +
> > > +               cond_resched();
> > > +
> > > +               spin_lock_irq(&hugetlb_lock);
> > > +               folio_set_hugetlb_zeroed(folio);
> > > +               folio_clear_hugetlb_zeroing(folio);
> > > +
> > > +               /*
> > > +                * If the page is still on the free list, move
> > > +                * it to the head.
> > > +                */
> > > +               if (folio_test_hugetlb_freed(folio))
> > > +                       list_move(&folio->lru, &h->hugepage_freelists[nid]);
> > > +
> > > +               /*
> > > +                * If someone was waiting for the zero to
> > > +                * finish, wake them up.
> > > +                */
> > > +               if (waitqueue_active(&h->dqzero_wait[nid]))
> > > +                       wake_up(&h->dqzero_wait[nid]);
> > > +               nr_zerod++;
> > > +               freelist = &h->hugepage_freelists[nid];
> > > +       }
> > > +       spin_unlock_irq(&hugetlb_lock);
> > > +}
> >
> > Nit: s/nr_zerod/nr_zeroed/
>
> Thank you for the reminder. I will address this issue in v2.
>
> > Feels like the list logic can be cleaned up a bit here. Since the
> > zeroed folios are at the head of the list, and the dirty ones at the
> > tail, and you start walking from the tail, you don't need to check if
> > you circled back to the head - just stop if you encounter a prezeroed
> > folio. If you encounter a prezeroed folio while walking from the tail,
> > that means that all other folios from that one to the head will also
> > be prezeroed already.
>
> Thank you for the thoughtful suggestion. Your line of reasoning is,
> in most situations, perfectly valid. Under extreme concurrency,
> however, a corner case can still appear. Imagine two processes
> simultaneously zeroing huge pages: Process A enters
> zero_free_hugepages_nid(), completes the zeroing of one huge page,
> and marks the folio in the list as pre-zeroed. Should Process B enter
> the same function moments later and decide to exit as soon as it
> meets a prezeroed folio, the intended parallel zeroing would quietly
> fall back to a single-threaded pace.

Hm, setting the prezeroed bit and moving the folio to the front of the
free list happens while holding hugetlb_lock. In other words, if you
encounter a folio with the prezeroed bit set while holding
hugetlb_lock, it will always be in a contiguous stretch of prezeroed
folios at the head of the free list.

Since the check for 'is this already prezeroed' is done while holding
hugetlb_lock, you know for sure that the folio is part of a list of
prezeroed folios at the head, and you can stop, right?

- Frank


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"
  2025-12-29 18:57       ` Frank van der Linden
@ 2025-12-30  2:41         ` Li Zhe
  0 siblings, 0 replies; 22+ messages in thread
From: Li Zhe @ 2025-12-30  2:41 UTC (permalink / raw)
  To: fvdl
  Cc: akpm, david, linux-kernel, linux-mm, lizhe.67, muchun.song, osalvador

> On Mon, 29 Dec 2025 10:57:23 -0800, fvdl@google.com wrote:
> 
> On Mon, Dec 29, 2025 at 4:26 AM Li Zhe <lizhe.67@bytedance.com> wrote:
> >
> > On Fri, 26 Dec 2025 10:51:01 -0800, fvdl@google.com wrote:
> >
> > > > +static ssize_t zeroable_hugepages_show(struct kobject *kobj,
> > > > +                                       struct kobj_attribute *attr, char *buf)
> > > > +{
> > > > +       struct hstate *h;
> > > > +       unsigned long free_huge_pages_zero;
> > > > +       int nid;
> > > > +
> > > > +       h = kobj_to_hstate(kobj, &nid);
> > > > +       if (WARN_ON(nid == NUMA_NO_NODE))
> > > > +               return -EPERM;
> > > > +
> > > > +       free_huge_pages_zero = h->free_huge_pages_node[nid] -
> > > > +                              h->free_huge_pages_zero_node[nid];
> > > > +
> > > > +       return sprintf(buf, "%lu\n", free_huge_pages_zero);
> > > > +}
> > > > +
> > > > +static inline bool zero_should_abort(struct hstate *h, int nid)
> > > > +{
> > > > +       return (h->free_huge_pages_zero_node[nid] ==
> > > > +               h->free_huge_pages_node[nid]) ||
> > > > +               list_empty(&h->hugepage_freelists[nid]);
> > > > +}
> > > > +
> > > > +static void zero_free_hugepages_nid(struct hstate *h,
> > > > +                                  int nid, unsigned int nr_zero)
> > > > +{
> > > > +       struct list_head *freelist = &h->hugepage_freelists[nid];
> > > > +       unsigned int nr_zerod = 0;
> > > > +       struct folio *folio;
> > > > +
> > > > +       if (zero_should_abort(h, nid))
> > > > +               return;
> > > > +
> > > > +       spin_lock_irq(&hugetlb_lock);
> > > > +
> > > > +       while (nr_zerod < nr_zero) {
> > > > +
> > > > +               if (zero_should_abort(h, nid) || fatal_signal_pending(current))
> > > > +                       break;
> > > > +
> > > > +               freelist = freelist->prev;
> > > > +               if (unlikely(list_is_head(freelist, &h->hugepage_freelists[nid])))
> > > > +                       break;
> > > > +               folio = list_entry(freelist, struct folio, lru);
> > > > +
> > > > +               if (folio_test_hugetlb_zeroed(folio) ||
> > > > +                   folio_test_hugetlb_zeroing(folio))
> > > > +                       continue;
> > > > +
> > > > +               folio_set_hugetlb_zeroing(folio);
> > > > +
> > > > +               /*
> > > > +                * Incrementing this here is a bit of a fib, since
> > > > +                * the page hasn't been cleared yet (it will be done
> > > > +                * immediately after dropping the lock below). But
> > > > +                * it keeps the count consistent with the overall
> > > > +                * free count in case the page gets taken off the
> > > > +                * freelist while we're working on it.
> > > > +                */
> > > > +               h->free_huge_pages_zero_node[nid]++;
> > > > +               spin_unlock_irq(&hugetlb_lock);
> > > > +
> > > > +               /*
> > > > +                * HWPoison pages may show up on the freelist.
> > > > +                * Don't try to zero it out, but do set the flag
> > > > +                * and counts, so that we don't consider it again.
> > > > +                */
> > > > +               if (!folio_test_hwpoison(folio))
> > > > +                       folio_zero_user(folio, 0);
> > > > +
> > > > +               cond_resched();
> > > > +
> > > > +               spin_lock_irq(&hugetlb_lock);
> > > > +               folio_set_hugetlb_zeroed(folio);
> > > > +               folio_clear_hugetlb_zeroing(folio);
> > > > +
> > > > +               /*
> > > > +                * If the page is still on the free list, move
> > > > +                * it to the head.
> > > > +                */
> > > > +               if (folio_test_hugetlb_freed(folio))
> > > > +                       list_move(&folio->lru, &h->hugepage_freelists[nid]);
> > > > +
> > > > +               /*
> > > > +                * If someone was waiting for the zero to
> > > > +                * finish, wake them up.
> > > > +                */
> > > > +               if (waitqueue_active(&h->dqzero_wait[nid]))
> > > > +                       wake_up(&h->dqzero_wait[nid]);
> > > > +               nr_zerod++;
> > > > +               freelist = &h->hugepage_freelists[nid];
> > > > +       }
> > > > +       spin_unlock_irq(&hugetlb_lock);
> > > > +}
> > >
> > > Nit: s/nr_zerod/nr_zeroed/
> >
> > Thank you for the reminder. I will address this issue in v2.
> >
> > > Feels like the list logic can be cleaned up a bit here. Since the
> > > zeroed folios are at the head of the list, and the dirty ones at the
> > > tail, and you start walking from the tail, you don't need to check if
> > > you circled back to the head - just stop if you encounter a prezeroed
> > > folio. If you encounter a prezeroed folio while walking from the tail,
> > > that means that all other folios from that one to the head will also
> > > be prezeroed already.
> >
> > Thank you for the thoughtful suggestion. Your line of reasoning is,
> > in most situations, perfectly valid. Under extreme concurrency,
> > however, a corner case can still appear. Imagine two processes
> > simultaneously zeroing huge pages: Process A enters
> > zero_free_hugepages_nid(), completes the zeroing of one huge page,
> > and marks the folio in the list as pre-zeroed. Should Process B enter
> > the same function moments later and decide to exit as soon as it
> > meets a prezeroed folio, the intended parallel zeroing would quietly
> > fall back to a single-threaded pace.
> 
> Hm, setting the prezeroed bit and moving the folio to the front of the
> free list happens while holding hugetlb_lock. In other words, if you
> encounter a folio with the prezeroed bit set while holding
> hugetlb_lock, it will always be in a contiguous stretch of prezeroed
> folios at the head of the free list.
> 
> Since the check for 'is this already prezeroed' is done while holding
> hugetlb_lock, you know for sure that the folio is part of a list of
> prezeroed folios at the head, and you can stop, right?

Sorry for the confusion earlier. You're right, this does make
zero_free_hugepages_nid() simpler. I'll update it in v2.

Thanks,
Zhe


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-12-30  2:41 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-25  8:20 [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism 李喆
2025-12-25  8:20 ` [PATCH 1/8] mm/hugetlb: add pre-zeroed framework 李喆
2025-12-26  9:24   ` Raghavendra K T
2025-12-26  9:48     ` Li Zhe
2025-12-25  8:20 ` [PATCH 2/8] mm/hugetlb: convert to prep_account_new_hugetlb_folio() 李喆
2025-12-25  8:20 ` [PATCH 3/8] mm/hugetlb: move the huge folio to the end of the list during enqueue 李喆
2025-12-25  8:20 ` [PATCH 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages" 李喆
2025-12-26 18:51   ` Frank van der Linden
2025-12-29 12:25     ` Li Zhe
2025-12-29 18:57       ` Frank van der Linden
2025-12-30  2:41         ` Li Zhe
2025-12-25  8:20 ` [PATCH 5/8] mm/hugetlb: simplify function hugetlb_sysfs_add_hstate() 李喆
2025-12-25  8:20 ` [PATCH 6/8] mm/hugetlb: relocate the per-hstate struct kobject pointer 李喆
2025-12-25  8:20 ` [PATCH 7/8] mm/hugetlb: add epoll support for interface "zeroable_hugepages" 李喆
2025-12-25  8:20 ` [PATCH 8/8] mm/hugetlb: limit event generation frequency of function do_zero_free_notify() 李喆
2025-12-26 18:32 ` [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism Frank van der Linden
2025-12-26 21:42   ` Frank van der Linden
2025-12-29 12:28     ` Li Zhe
2025-12-27  7:21 ` Mateusz Guzik
2025-12-29 12:31   ` Li Zhe
2025-12-28 21:44 ` Andrew Morton
2025-12-29 12:34   ` Li Zhe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox