[PATCH RFC 0/9] mm/virtio: skip redundant zeroing of host-zeroed reported pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/9] mm/virtio: skip redundant zeroing of host-zeroed reported pages
@ 2026-04-12 22:50 Michael S. Tsirkin
  2026-04-12 22:50 ` [PATCH RFC 1/9] mm: page_alloc: propagate PageReported flag across buddy splits Michael S. Tsirkin
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-12 22:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
	Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
	Andrea Arcangeli, linux-mm, virtualization

When a guest reports free pages to the hypervisor via virtio-balloon's
free page reporting, the host typically zeros those pages when reclaiming
their backing memory (e.g., via MADV_DONTNEED on anonymous mappings).
When the guest later reallocates those pages, the kernel zeros them
again -- redundantly.

This series eliminates that double-zeroing by propagating the "host
already zeroed this page" information through the buddy allocator and
into the page fault path.

Performance with THP enabled on a 2GB VM, 1 vCPU, allocating
256MB of anonymous pages:

  metric         baseline    optimized   delta
  task-clock     179ms       99ms        -45%
  cache-misses   1.22M       287K        -76%
  instructions   15.1M       13.9M       -8%

With hugetlb surplus pages:

  metric         baseline    optimized   delta
  task-clock     322ms       9.9ms       -97%
  cache-misses   659K        88K         -87%
  instructions   18.3M       10.6M       -42%

Notes:
- The virtio_balloon patch (9/9) is a testing hack with a module
  parameter.  A proper virtio feature flag is needed before merging.
- Patch 8/9 adds a sysfs flush trigger for deterministic testing
  (avoids waiting for the 2-second reporting delay).
- The optimization is most effective with THP, where entire 2MB
  pages are allocated directly from reported order-9+ buddy pages.
  Without THP, only ~21% of order-0 allocations come from reported
  pages due to low-order fragmentation.
- Persistent hugetlb pool pages are not covered: when freed by
  userspace they return to the hugetlb free pool, not the buddy
  allocator, so they are never reported to the host.  Surplus
  hugetlb pages are allocated from buddy and do benefit.

Test program:

  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  #include <sys/mman.h>

  #ifndef MADV_POPULATE_WRITE
  #define MADV_POPULATE_WRITE 23
  #endif
  #ifndef MAP_HUGETLB
  #define MAP_HUGETLB 0x40000
  #endif

  int main(int argc, char **argv)
  {
      unsigned long size;
      int flags = MAP_PRIVATE | MAP_ANONYMOUS;
      void *p;
      int r;

      if (argc < 2) {
          fprintf(stderr, "usage: %s <size_mb> [huge]\n", argv[0]);
          return 1;
      }
      size = atol(argv[1]) * 1024UL * 1024;
      if (argc >= 3 && strcmp(argv[2], "huge") == 0)
          flags |= MAP_HUGETLB;
      p = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
      if (p == MAP_FAILED) {
          perror("mmap");
          return 1;
      }
      r = madvise(p, size, MADV_POPULATE_WRITE);
      if (r) {
          perror("madvise");
          return 1;
      }
      munmap(p, size);
      return 0;
  }

Test script (bench.sh):

  #!/bin/bash
  # Usage: bench.sh <size_mb> <mode> <iterations> [huge]
  # mode 0 = baseline, mode 1 = skip zeroing
  SZ=${1:-256}; MODE=${2:-0}; ITER=${3:-10}; HUGE=${4:-}
  FLUSH=/sys/module/page_reporting/parameters/flush
  PERF_DATA=/tmp/perf-$MODE.data
  rmmod virtio_balloon 2>/dev/null
  insmod virtio_balloon.ko host_zeroes_pages=$MODE
  echo 1 > $FLUSH
  [ "$HUGE" = "huge" ] && echo $((SZ/2)) > /proc/sys/vm/nr_overcommit_hugepages
  rm -f $PERF_DATA
  echo "=== sz=${SZ}MB mode=$MODE iter=$ITER $HUGE ==="
  for i in $(seq 1 $ITER); do
      echo 3 > /proc/sys/vm/drop_caches
      echo 1 > $FLUSH
      perf stat record -e task-clock,instructions,cache-misses \
          -o $PERF_DATA --append -- ./alloc_once $SZ $HUGE
  done
  [ "$HUGE" = "huge" ] && echo 0 > /proc/sys/vm/nr_overcommit_hugepages
  rmmod virtio_balloon
  perf stat report -i $PERF_DATA

Compile and run:
  gcc -static -O2 -o alloc_once alloc_once.c
  bash bench.sh 256 0 10          # baseline (regular pages)
  bash bench.sh 256 1 10          # optimized (regular pages)
  bash bench.sh 256 0 10 huge     # baseline (hugetlb surplus)
  bash bench.sh 256 1 10 huge     # optimized (hugetlb surplus)

Written with assistance from claude. Everything manually read, patchset
split and commit logs edited manually.

Michael S. Tsirkin (9):
  mm: page_alloc: propagate PageReported flag across buddy splits
  mm: page_reporting: skip redundant zeroing of host-zeroed reported
    pages
  mm: add __GFP_PREZEROED flag and folio_test_clear_prezeroed()
  mm: skip zeroing in vma_alloc_zeroed_movable_folio for pre-zeroed
    pages
  mm: skip zeroing in alloc_anon_folio for pre-zeroed pages
  mm: skip zeroing in vma_alloc_anon_folio_pmd for pre-zeroed pages
  mm: hugetlb: skip zeroing of pre-zeroed hugetlb pages
  mm: page_reporting: add flush parameter to trigger immediate reporting
  virtio_balloon: a hack to enable host-zeroed page optimization

 drivers/virtio/virtio_balloon.c |  7 +++++
 fs/hugetlbfs/inode.c            |  3 ++-
 include/linux/gfp_types.h       |  5 ++++
 include/linux/highmem.h         |  6 +++--
 include/linux/hugetlb.h         |  2 +-
 include/linux/mm.h              | 22 ++++++++++++++++
 include/linux/page_reporting.h  |  3 +++
 mm/huge_memory.c                |  4 +--
 mm/hugetlb.c                    |  3 ++-
 mm/memory.c                     |  5 ++--
 mm/page_alloc.c                 | 46 ++++++++++++++++++++++++++++++---
 mm/page_reporting.c             | 34 ++++++++++++++++++++++++
 mm/page_reporting.h             |  2 ++
 13 files changed, 129 insertions(+), 13 deletions(-)

-- 
MST



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC 1/9] mm: page_alloc: propagate PageReported flag across buddy splits
  2026-04-12 22:50 [PATCH RFC 0/9] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
@ 2026-04-12 22:50 ` Michael S. Tsirkin
  2026-04-13 19:11   ` David Hildenbrand (Arm)
  2026-04-12 22:50 ` [PATCH RFC 2/9] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-12 22:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
	Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
	Andrea Arcangeli, linux-mm, virtualization, Johannes Weiner,
	Zi Yan

When a reported free page is split via expand() to satisfy a
smaller allocation, the sub-pages placed back on the free lists
lose the PageReported flag.  This means they will be unnecessarily
re-reported to the hypervisor in the next reporting cycle, wasting
work.

Propagate the PageReported flag to sub-pages during expand() so
that they are recognized as already-reported.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 mm/page_alloc.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2d4b6f1a554e..edbb1edf463d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1730,7 +1730,7 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
  * -- nyc
  */
 static inline unsigned int expand(struct zone *zone, struct page *page, int low,
-				  int high, int migratetype)
+				  int high, int migratetype, bool reported)
 {
 	unsigned int size = 1 << high;
 	unsigned int nr_added = 0;
@@ -1752,6 +1752,15 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
 		__add_to_free_list(&page[size], zone, high, migratetype, false);
 		set_buddy_order(&page[size], high);
 		nr_added += size;
+
+		/*
+		 * The parent page has been reported to the host.  The
+		 * sub-pages are part of the same reported block, so mark
+		 * them reported too.  This avoids re-reporting pages that
+		 * the host already knows about.
+		 */
+		if (reported)
+			__SetPageReported(&page[size]);
 	}
 
 	return nr_added;
@@ -1762,9 +1771,10 @@ static __always_inline void page_del_and_expand(struct zone *zone,
 						int high, int migratetype)
 {
 	int nr_pages = 1 << high;
+	bool was_reported = page_reported(page);
 
 	__del_page_from_free_list(page, zone, high, migratetype);
-	nr_pages -= expand(zone, page, low, high, migratetype);
+	nr_pages -= expand(zone, page, low, high, migratetype, was_reported);
 	account_freepages(zone, -nr_pages, migratetype);
 }
 
@@ -2322,7 +2332,8 @@ try_to_claim_block(struct zone *zone, struct page *page,
 
 		del_page_from_free_list(page, zone, current_order, block_type);
 		change_pageblock_range(page, current_order, start_type);
-		nr_added = expand(zone, page, order, current_order, start_type);
+		nr_added = expand(zone, page, order, current_order, start_type,
+				  false);
 		account_freepages(zone, nr_added, start_type);
 		return page;
 	}
-- 
MST



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC 2/9] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
  2026-04-12 22:50 [PATCH RFC 0/9] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
  2026-04-12 22:50 ` [PATCH RFC 1/9] mm: page_alloc: propagate PageReported flag across buddy splits Michael S. Tsirkin
@ 2026-04-12 22:50 ` Michael S. Tsirkin
  2026-04-13  8:00   ` David Hildenbrand (Arm)
  2026-04-12 22:50 ` [PATCH RFC 3/9] mm: add __GFP_PREZEROED flag and folio_test_clear_prezeroed() Michael S. Tsirkin
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-12 22:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
	Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
	Andrea Arcangeli, linux-mm, virtualization, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Johannes Weiner, Zi Yan

When a guest reports free pages to the hypervisor via the page reporting
framework (used by virtio-balloon and hv_balloon), the host typically
zeros those pages when reclaiming their backing memory.  However, when
those pages are later allocated in the guest, post_alloc_hook()
unconditionally zeros them again if __GFP_ZERO is set.  This
double-zeroing is wasteful, especially for large pages.

Avoid redundant zeroing by propagating the "host already zeroed this"
information through the allocation path:

1. Add a host_zeroes_pages flag to page_reporting_dev_info, allowing
   drivers to declare that their host zeros reported pages on reclaim.
   A static key (page_reporting_host_zeroes) gates the fast path.

2. In page_del_and_expand(), when the page was reported and the
   static key is enabled, stash a sentinel value (MAGIC_PAGE_ZEROED)
   in page->private.

3. In post_alloc_hook(), check page->private for the sentinel.  If
   present and zeroing was requested (but not tag zeroing), skip
   kernel_init_pages().

In particular, __GFP_ZERO is used by the x86 arch override of
vma_alloc_zeroed_movable_folio.

No driver sets host_zeroes_pages yet; a follow-up patch to
virtio_balloon is needed to opt in.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 include/linux/mm.h             |  6 ++++++
 include/linux/page_reporting.h |  3 +++
 mm/page_alloc.c                | 21 +++++++++++++++++++++
 mm/page_reporting.c            |  9 +++++++++
 mm/page_reporting.h            |  2 ++
 5 files changed, 41 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5be3d8a8f806..59fc77c4c90e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4814,6 +4814,12 @@ static inline bool user_alloc_needs_zeroing(void)
 				   &init_on_alloc);
 }
 
+/*
+ * Sentinel stored in page->private to indicate the page was pre-zeroed
+ * by the hypervisor (via free page reporting).
+ */
+#define MAGIC_PAGE_ZEROED	0x5A45524FU	/* ZERO */
+
 int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *status);
 int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
 int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index fe648dfa3a7c..10faadfeb4fb 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -13,6 +13,9 @@ struct page_reporting_dev_info {
 	int (*report)(struct page_reporting_dev_info *prdev,
 		      struct scatterlist *sg, unsigned int nents);
 
+	/* If true, host zeros reported pages on reclaim */
+	bool host_zeroes_pages;
+
 	/* work struct for processing reports */
 	struct delayed_work work;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index edbb1edf463d..efb65eee826b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1774,8 +1774,20 @@ static __always_inline void page_del_and_expand(struct zone *zone,
 	bool was_reported = page_reported(page);
 
 	__del_page_from_free_list(page, zone, high, migratetype);
+
+	was_reported = was_reported &&
+		       static_branch_unlikely(&page_reporting_host_zeroes);
+
 	nr_pages -= expand(zone, page, low, high, migratetype, was_reported);
 	account_freepages(zone, -nr_pages, migratetype);
+
+	/*
+	 * If the page was reported and the host is known to zero reported
+	 * pages, mark it zeroed via page->private so that
+	 * post_alloc_hook() can skip redundant zeroing.
+	 */
+	if (was_reported)
+		set_page_private(page, MAGIC_PAGE_ZEROED);
 }
 
 static void check_new_page_bad(struct page *page)
@@ -1851,11 +1863,20 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 {
 	bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
 			!should_skip_init(gfp_flags);
+	bool prezeroed = page_private(page) == MAGIC_PAGE_ZEROED;
 	bool zero_tags = init && (gfp_flags & __GFP_ZEROTAGS);
 	int i;
 
 	set_page_private(page, 0);
 
+	/*
+	 * If the page is pre-zeroed, skip memory initialization.
+	 * We still need to handle tag zeroing separately since the host
+	 * does not know about memory tags.
+	 */
+	if (prezeroed && init && !zero_tags)
+		init = false;
+
 	arch_alloc_page(page, order);
 	debug_pagealloc_map_pages(page, 1 << order);
 
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index f0042d5743af..cb24832bdf4e 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -50,6 +50,8 @@ EXPORT_SYMBOL_GPL(page_reporting_order);
 #define PAGE_REPORTING_DELAY	(2 * HZ)
 static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
 
+DEFINE_STATIC_KEY_FALSE(page_reporting_host_zeroes);
+
 enum {
 	PAGE_REPORTING_IDLE = 0,
 	PAGE_REPORTING_REQUESTED,
@@ -386,6 +388,10 @@ int page_reporting_register(struct page_reporting_dev_info *prdev)
 	/* Assign device to allow notifications */
 	rcu_assign_pointer(pr_dev_info, prdev);
 
+	/* enable zeroed page optimization if host zeroes reported pages */
+	if (prdev->host_zeroes_pages)
+		static_branch_enable(&page_reporting_host_zeroes);
+
 	/* enable page reporting notification */
 	if (!static_key_enabled(&page_reporting_enabled)) {
 		static_branch_enable(&page_reporting_enabled);
@@ -410,6 +416,9 @@ void page_reporting_unregister(struct page_reporting_dev_info *prdev)
 
 		/* Flush any existing work, and lock it out */
 		cancel_delayed_work_sync(&prdev->work);
+
+		if (prdev->host_zeroes_pages)
+			static_branch_disable(&page_reporting_host_zeroes);
 	}
 
 	mutex_unlock(&page_reporting_mutex);
diff --git a/mm/page_reporting.h b/mm/page_reporting.h
index c51dbc228b94..2bbf99f456f5 100644
--- a/mm/page_reporting.h
+++ b/mm/page_reporting.h
@@ -15,6 +15,8 @@ DECLARE_STATIC_KEY_FALSE(page_reporting_enabled);
 extern unsigned int page_reporting_order;
 void __page_reporting_notify(void);
 
+DECLARE_STATIC_KEY_FALSE(page_reporting_host_zeroes);
+
 static inline bool page_reported(struct page *page)
 {
 	return static_branch_unlikely(&page_reporting_enabled) &&
-- 
MST



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC 3/9] mm: add __GFP_PREZEROED flag and folio_test_clear_prezeroed()
  2026-04-12 22:50 [PATCH RFC 0/9] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
  2026-04-12 22:50 ` [PATCH RFC 1/9] mm: page_alloc: propagate PageReported flag across buddy splits Michael S. Tsirkin
  2026-04-12 22:50 ` [PATCH RFC 2/9] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
@ 2026-04-12 22:50 ` Michael S. Tsirkin
  2026-04-13  9:05   ` David Hildenbrand (Arm)
  2026-04-12 22:50 ` [PATCH RFC 4/9] mm: skip zeroing in vma_alloc_zeroed_movable_folio for pre-zeroed pages Michael S. Tsirkin
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-12 22:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
	Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
	Andrea Arcangeli, linux-mm, virtualization, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Johannes Weiner, Zi Yan

The previous patch skips zeroing in post_alloc_hook() when
__GFP_ZERO is used.  However, several page allocation paths
zero pages via folio_zero_user() or clear_user_highpage() after
allocation, not via __GFP_ZERO.

Add __GFP_PREZEROED gfp flag that tells post_alloc_hook() to
preserve the MAGIC_PAGE_ZEROED sentinel in page->private so the
caller can detect pre-zeroed pages and skip its own zeroing.
Add folio_test_clear_prezeroed() helper to check and clear
the sentinel.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 include/linux/gfp_types.h |  5 +++++
 include/linux/mm.h        | 16 ++++++++++++++++
 mm/page_alloc.c           |  8 +++++++-
 3 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 6c75df30a281..903f87c7fec9 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -56,6 +56,7 @@ enum {
 	___GFP_NOLOCKDEP_BIT,
 #endif
 	___GFP_NO_OBJ_EXT_BIT,
+	___GFP_PREZEROED_BIT,
 	___GFP_LAST_BIT
 };
 
@@ -97,6 +98,7 @@ enum {
 #define ___GFP_NOLOCKDEP	0
 #endif
 #define ___GFP_NO_OBJ_EXT       BIT(___GFP_NO_OBJ_EXT_BIT)
+#define ___GFP_PREZEROED	BIT(___GFP_PREZEROED_BIT)
 
 /*
  * Physical address zone modifiers (see linux/mmzone.h - low four bits)
@@ -292,6 +294,9 @@ enum {
 #define __GFP_SKIP_ZERO ((__force gfp_t)___GFP_SKIP_ZERO)
 #define __GFP_SKIP_KASAN ((__force gfp_t)___GFP_SKIP_KASAN)
 
+/* Caller handles pre-zeroed pages; preserve MAGIC_PAGE_ZEROED in private */
+#define __GFP_PREZEROED ((__force gfp_t)___GFP_PREZEROED)
+
 /* Disable lockdep for GFP context tracking */
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 59fc77c4c90e..caa1de31bbca 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4820,6 +4820,22 @@ static inline bool user_alloc_needs_zeroing(void)
  */
 #define MAGIC_PAGE_ZEROED	0x5A45524FU	/* ZERO */
 
+/**
+ * folio_test_clear_prezeroed - test and clear the pre-zeroed marker.
+ * @folio: the folio to test.
+ *
+ * Returns true if the folio was pre-zeroed by the host, and clears
+ * the marker.  Callers can skip their own zeroing.
+ */
+static inline bool folio_test_clear_prezeroed(struct folio *folio)
+{
+	if (page_private(&folio->page) == MAGIC_PAGE_ZEROED) {
+		set_page_private(&folio->page, 0);
+		return true;
+	}
+	return false;
+}
+
 int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *status);
 int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
 int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index efb65eee826b..fba8321c45ed 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1867,7 +1867,13 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	bool zero_tags = init && (gfp_flags & __GFP_ZEROTAGS);
 	int i;
 
-	set_page_private(page, 0);
+	/*
+	 * If the page is pre-zeroed and the caller opted in via
+	 * __GFP_PREZEROED, preserve the marker so the caller can
+	 * skip its own zeroing.  Otherwise always clear private.
+	 */
+	if (!(prezeroed && (gfp_flags & __GFP_PREZEROED)))
+		set_page_private(page, 0);
 
 	/*
 	 * If the page is pre-zeroed, skip memory initialization.
-- 
MST



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC 4/9] mm: skip zeroing in vma_alloc_zeroed_movable_folio for pre-zeroed pages
  2026-04-12 22:50 [PATCH RFC 0/9] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
                   ` (2 preceding siblings ...)
  2026-04-12 22:50 ` [PATCH RFC 3/9] mm: add __GFP_PREZEROED flag and folio_test_clear_prezeroed() Michael S. Tsirkin
@ 2026-04-12 22:50 ` Michael S. Tsirkin
  2026-04-12 22:50 ` [PATCH RFC 5/9] mm: skip zeroing in alloc_anon_folio " Michael S. Tsirkin
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-12 22:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
	Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
	Andrea Arcangeli, linux-mm, virtualization, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport

Use __GFP_PREZEROED and folio_test_clear_prezeroed() to skip
clear_user_highpage() when the page is already zeroed.

On x86, vma_alloc_zeroed_movable_folio is overridden by a macro
that uses __GFP_ZERO directly, so this change has no effect there.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 include/linux/highmem.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index af03db851a1d..b649e7e315f4 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -322,8 +322,10 @@ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
 {
 	struct folio *folio;
 
-	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vaddr);
-	if (folio && user_alloc_needs_zeroing())
+	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_PREZEROED,
+			       0, vma, vaddr);
+	if (folio && user_alloc_needs_zeroing() &&
+	    !folio_test_clear_prezeroed(folio))
 		clear_user_highpage(&folio->page, vaddr);
 
 	return folio;
-- 
MST



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC 5/9] mm: skip zeroing in alloc_anon_folio for pre-zeroed pages
  2026-04-12 22:50 [PATCH RFC 0/9] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
                   ` (3 preceding siblings ...)
  2026-04-12 22:50 ` [PATCH RFC 4/9] mm: skip zeroing in vma_alloc_zeroed_movable_folio for pre-zeroed pages Michael S. Tsirkin
@ 2026-04-12 22:50 ` Michael S. Tsirkin
  2026-04-12 22:50 ` [PATCH RFC 6/9] mm: skip zeroing in vma_alloc_anon_folio_pmd " Michael S. Tsirkin
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-12 22:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
	Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
	Andrea Arcangeli, linux-mm, virtualization, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport

Use __GFP_PREZEROED and folio_test_clear_prezeroed() to skip
folio_zero_user() in the mTHP anonymous page allocation path
when the page is already zeroed.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 mm/memory.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 07778814b4a8..2f61321a81fd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5176,7 +5176,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 		goto fallback;
 
 	/* Try allocating the highest of the remaining orders. */
-	gfp = vma_thp_gfp_mask(vma);
+	gfp = vma_thp_gfp_mask(vma) | __GFP_PREZEROED;
 	while (orders) {
 		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
 		folio = vma_alloc_folio(gfp, order, vma, addr);
@@ -5194,7 +5194,8 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 			 * that the page corresponding to the faulting address
 			 * will be hot in the cache after zeroing.
 			 */
-			if (user_alloc_needs_zeroing())
+			if (user_alloc_needs_zeroing() &&
+			    !folio_test_clear_prezeroed(folio))
 				folio_zero_user(folio, vmf->address);
 			return folio;
 		}
-- 
MST



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC 6/9] mm: skip zeroing in vma_alloc_anon_folio_pmd for pre-zeroed pages
  2026-04-12 22:50 [PATCH RFC 0/9] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
                   ` (4 preceding siblings ...)
  2026-04-12 22:50 ` [PATCH RFC 5/9] mm: skip zeroing in alloc_anon_folio " Michael S. Tsirkin
@ 2026-04-12 22:50 ` Michael S. Tsirkin
  2026-04-12 22:51 ` [PATCH RFC 7/9] mm: hugetlb: skip zeroing of pre-zeroed hugetlb pages Michael S. Tsirkin
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-12 22:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
	Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
	Andrea Arcangeli, linux-mm, virtualization, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang

Use __GFP_PREZEROED and folio_test_clear_prezeroed() to skip
folio_zero_user() in the PMD THP anonymous page allocation path
when the page is already zeroed.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 mm/huge_memory.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8e2746ea74ad..3b9b53fad0f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1256,7 +1256,7 @@ EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
 static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
 		unsigned long addr)
 {
-	gfp_t gfp = vma_thp_gfp_mask(vma);
+	gfp_t gfp = vma_thp_gfp_mask(vma) | __GFP_PREZEROED;
 	const int order = HPAGE_PMD_ORDER;
 	struct folio *folio;
 
@@ -1285,7 +1285,7 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
 	* make sure that the page corresponding to the faulting address will be
 	* hot in the cache after zeroing.
 	*/
-	if (user_alloc_needs_zeroing())
+	if (user_alloc_needs_zeroing() && !folio_test_clear_prezeroed(folio))
 		folio_zero_user(folio, addr);
 	/*
 	 * The memory barrier inside __folio_mark_uptodate makes sure that
-- 
MST



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC 7/9] mm: hugetlb: skip zeroing of pre-zeroed hugetlb pages
  2026-04-12 22:50 [PATCH RFC 0/9] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
                   ` (5 preceding siblings ...)
  2026-04-12 22:50 ` [PATCH RFC 6/9] mm: skip zeroing in vma_alloc_anon_folio_pmd " Michael S. Tsirkin
@ 2026-04-12 22:51 ` Michael S. Tsirkin
  2026-04-12 22:51 ` [PATCH RFC 8/9] mm: page_reporting: add flush parameter to trigger immediate reporting Michael S. Tsirkin
  2026-04-12 22:51 ` [PATCH RFC 9/9] virtio_balloon: a hack to enable host-zeroed page optimization Michael S. Tsirkin
  8 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-12 22:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
	Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
	Andrea Arcangeli, linux-mm, virtualization, Muchun Song,
	Oscar Salvador

When a surplus hugetlb page is allocated from the buddy allocator
and the page was previously reported to the host (and zeroed on
reclaim), skip the redundant folio_zero_user() in the hugetlb
fault path.

This only benefits surplus hugetlb pages that are freshly allocated
from the buddy.  Pages from the persistent hugetlb pool are not
affected since they are not allocated from buddy at fault time.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 fs/hugetlbfs/inode.c    | 3 ++-
 include/linux/hugetlb.h | 2 +-
 mm/hugetlb.c            | 3 ++-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3f70c47981de..301567ad160f 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -828,7 +828,8 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 			error = PTR_ERR(folio);
 			goto out;
 		}
-		folio_zero_user(folio, addr);
+		if (!folio_test_clear_prezeroed(folio))
+			folio_zero_user(folio, addr);
 		__folio_mark_uptodate(folio);
 		error = hugetlb_add_to_page_cache(folio, mapping, index);
 		if (unlikely(error)) {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 65910437be1c..07e3ef8c0418 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -937,7 +937,7 @@ static inline bool hugepage_movable_supported(struct hstate *h)
 /* Movability of hugepages depends on migration support. */
 static inline gfp_t htlb_alloc_mask(struct hstate *h)
 {
-	gfp_t gfp = __GFP_COMP | __GFP_NOWARN;
+	gfp_t gfp = __GFP_COMP | __GFP_NOWARN | __GFP_PREZEROED;
 
 	gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0beb6e22bc26..5b23b006c37c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5809,7 +5809,8 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 				ret = 0;
 			goto out;
 		}
-		folio_zero_user(folio, vmf->real_address);
+		if (!folio_test_clear_prezeroed(folio))
+			folio_zero_user(folio, vmf->real_address);
 		__folio_mark_uptodate(folio);
 		new_folio = true;
 
-- 
MST



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC 8/9] mm: page_reporting: add flush parameter to trigger immediate reporting
  2026-04-12 22:50 [PATCH RFC 0/9] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
                   ` (6 preceding siblings ...)
  2026-04-12 22:51 ` [PATCH RFC 7/9] mm: hugetlb: skip zeroing of pre-zeroed hugetlb pages Michael S. Tsirkin
@ 2026-04-12 22:51 ` Michael S. Tsirkin
  2026-04-12 22:51 ` [PATCH RFC 9/9] virtio_balloon: a hack to enable host-zeroed page optimization Michael S. Tsirkin
  8 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-12 22:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
	Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
	Andrea Arcangeli, linux-mm, virtualization, Johannes Weiner,
	Zi Yan

Add a write-only module parameter 'flush' that triggers an immediate
page reporting cycle.  Writing any value flushes pending work and
runs one cycle synchronously.

This is useful for testing and benchmarking the pre-zeroed page
optimization, where the reporting delay (2 seconds) makes it hard
to ensure pages are reported before measuring allocation performance.

  echo 1 > /sys/module/page_reporting/parameters/flush

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 mm/page_reporting.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index cb24832bdf4e..e9a2186e4c48 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -351,6 +351,31 @@ static void page_reporting_process(struct work_struct *work)
 static DEFINE_MUTEX(page_reporting_mutex);
 DEFINE_STATIC_KEY_FALSE(page_reporting_enabled);
 
+static int page_reporting_flush_set(const char *val,
+				    const struct kernel_param *kp)
+{
+	struct page_reporting_dev_info *prdev;
+
+	mutex_lock(&page_reporting_mutex);
+	prdev = rcu_dereference_protected(pr_dev_info,
+				lockdep_is_held(&page_reporting_mutex));
+	if (prdev) {
+		flush_delayed_work(&prdev->work);
+		__page_reporting_request(prdev);
+		flush_delayed_work(&prdev->work);
+	}
+	mutex_unlock(&page_reporting_mutex);
+	return 0;
+}
+
+static const struct kernel_param_ops flush_ops = {
+	.set = page_reporting_flush_set,
+	.get = param_get_uint,
+};
+static unsigned int page_reporting_flush;
+module_param_cb(flush, &flush_ops, &page_reporting_flush, 0200);
+MODULE_PARM_DESC(flush, "Trigger immediate page reporting cycle");
+
 int page_reporting_register(struct page_reporting_dev_info *prdev)
 {
 	int err = 0;
-- 
MST



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC 9/9] virtio_balloon: a hack to enable host-zeroed page optimization
  2026-04-12 22:50 [PATCH RFC 0/9] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
                   ` (7 preceding siblings ...)
  2026-04-12 22:51 ` [PATCH RFC 8/9] mm: page_reporting: add flush parameter to trigger immediate reporting Michael S. Tsirkin
@ 2026-04-12 22:51 ` Michael S. Tsirkin
  8 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-12 22:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
	Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
	Andrea Arcangeli, linux-mm, virtualization, Xuan Zhuo,
	Eugenio Pérez

Add a module parameter host_zeroes_pages to opt in to the pre-zeroed
page optimization.  A proper virtio feature flag is needed before
this can be merged.

  insmod virtio_balloon.ko host_zeroes_pages=1

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 drivers/virtio/virtio_balloon.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index d1fbc8fe8470..5d37196daa75 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -19,6 +19,11 @@
 #include <linux/mm.h>
 #include <linux/page_reporting.h>
 
+static bool host_zeroes_pages;
+module_param(host_zeroes_pages, bool, 0644);
+MODULE_PARM_DESC(host_zeroes_pages,
+		 "Host zeroes reported pages, skip guest re-zeroing");
+
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
  * multiple balloon pages.  All memory counters in this driver are in balloon
@@ -1039,6 +1044,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		vb->pr_dev_info.order = 5;
 #endif
 
+		/* TODO: needs a virtio feature flag */
+		vb->pr_dev_info.host_zeroes_pages = host_zeroes_pages;
 		err = page_reporting_register(&vb->pr_dev_info);
 		if (err)
 			goto out_unregister_oom;
-- 
MST



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 2/9] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
  2026-04-12 22:50 ` [PATCH RFC 2/9] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
@ 2026-04-13  8:00   ` David Hildenbrand (Arm)
  2026-04-13  8:10     ` Michael S. Tsirkin
  2026-04-13 20:35     ` Michael S. Tsirkin
  0 siblings, 2 replies; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-13  8:00 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Andrew Morton, Vlastimil Babka, Brendan Jackman, Michal Hocko,
	Suren Baghdasaryan, Jason Wang, Andrea Arcangeli, linux-mm,
	virtualization, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Johannes Weiner, Zi Yan

On 4/13/26 00:50, Michael S. Tsirkin wrote:
> When a guest reports free pages to the hypervisor via the page reporting
> framework (used by virtio-balloon and hv_balloon), the host typically
> zeros those pages when reclaiming their backing memory.  However, when
> those pages are later allocated in the guest, post_alloc_hook()
> unconditionally zeros them again if __GFP_ZERO is set.  This
> double-zeroing is wasteful, especially for large pages.
> 
> Avoid redundant zeroing by propagating the "host already zeroed this"
> information through the allocation path:
> 
> 1. Add a host_zeroes_pages flag to page_reporting_dev_info, allowing
>    drivers to declare that their host zeros reported pages on reclaim.
>    A static key (page_reporting_host_zeroes) gates the fast path.
> 
> 2. In page_del_and_expand(), when the page was reported and the
>    static key is enabled, stash a sentinel value (MAGIC_PAGE_ZEROED)
>    in page->private.
> 
> 3. In post_alloc_hook(), check page->private for the sentinel.  If
>    present and zeroing was requested (but not tag zeroing), skip
>    kernel_init_pages().
> 
> In particular, __GFP_ZERO is used by the x86 arch override of
> vma_alloc_zeroed_movable_folio.
> 
> No driver sets host_zeroes_pages yet; a follow-up patch to
> virtio_balloon is needed to opt in.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> ---
>  include/linux/mm.h             |  6 ++++++
>  include/linux/page_reporting.h |  3 +++
>  mm/page_alloc.c                | 21 +++++++++++++++++++++
>  mm/page_reporting.c            |  9 +++++++++
>  mm/page_reporting.h            |  2 ++
>  5 files changed, 41 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5be3d8a8f806..59fc77c4c90e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4814,6 +4814,12 @@ static inline bool user_alloc_needs_zeroing(void)
>  				   &init_on_alloc);
>  }
>  
> +/*
> + * Sentinel stored in page->private to indicate the page was pre-zeroed
> + * by the hypervisor (via free page reporting).
> + */
> +#define MAGIC_PAGE_ZEROED	0x5A45524FU	/* ZERO */

Why are we not using another page flag that is yet unused for buddy pages?

Using page->private for that, and exposing it to buddy users with the
__GFP_PREZEROED flag (I hope we can avoid that) does not sound
particularly elegant.

Also, if we're going to remember that some pages in the buddy are
pre-zeroed, it should better not be free-page-reporting specific.

I'd assume ordinary inflating+deflating of the balloon would also end up
with pre-zeroed pages. We'd just need a (mm/balloon.c -specific)
interface to tell the buddy that the pages are zeroed.


-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 2/9] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
  2026-04-13  8:00   ` David Hildenbrand (Arm)
@ 2026-04-13  8:10     ` Michael S. Tsirkin
  2026-04-13  8:15       ` David Hildenbrand (Arm)
  2026-04-13 20:35     ` Michael S. Tsirkin
  1 sibling, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-13  8:10 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, Brendan Jackman,
	Michal Hocko, Suren Baghdasaryan, Jason Wang, Andrea Arcangeli,
	linux-mm, virtualization, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Johannes Weiner, Zi Yan

On Mon, Apr 13, 2026 at 10:00:58AM +0200, David Hildenbrand (Arm) wrote:
> On 4/13/26 00:50, Michael S. Tsirkin wrote:
> > When a guest reports free pages to the hypervisor via the page reporting
> > framework (used by virtio-balloon and hv_balloon), the host typically
> > zeros those pages when reclaiming their backing memory.  However, when
> > those pages are later allocated in the guest, post_alloc_hook()
> > unconditionally zeros them again if __GFP_ZERO is set.  This
> > double-zeroing is wasteful, especially for large pages.
> > 
> > Avoid redundant zeroing by propagating the "host already zeroed this"
> > information through the allocation path:
> > 
> > 1. Add a host_zeroes_pages flag to page_reporting_dev_info, allowing
> >    drivers to declare that their host zeros reported pages on reclaim.
> >    A static key (page_reporting_host_zeroes) gates the fast path.
> > 
> > 2. In page_del_and_expand(), when the page was reported and the
> >    static key is enabled, stash a sentinel value (MAGIC_PAGE_ZEROED)
> >    in page->private.
> > 
> > 3. In post_alloc_hook(), check page->private for the sentinel.  If
> >    present and zeroing was requested (but not tag zeroing), skip
> >    kernel_init_pages().
> > 
> > In particular, __GFP_ZERO is used by the x86 arch override of
> > vma_alloc_zeroed_movable_folio.
> > 
> > No driver sets host_zeroes_pages yet; a follow-up patch to
> > virtio_balloon is needed to opt in.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Assisted-by: Claude:claude-opus-4-6
> > ---
> >  include/linux/mm.h             |  6 ++++++
> >  include/linux/page_reporting.h |  3 +++
> >  mm/page_alloc.c                | 21 +++++++++++++++++++++
> >  mm/page_reporting.c            |  9 +++++++++
> >  mm/page_reporting.h            |  2 ++
> >  5 files changed, 41 insertions(+)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 5be3d8a8f806..59fc77c4c90e 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4814,6 +4814,12 @@ static inline bool user_alloc_needs_zeroing(void)
> >  				   &init_on_alloc);
> >  }
> >  
> > +/*
> > + * Sentinel stored in page->private to indicate the page was pre-zeroed
> > + * by the hypervisor (via free page reporting).
> > + */
> > +#define MAGIC_PAGE_ZEROED	0x5A45524FU	/* ZERO */
> 
> Why are we not using another page flag that is yet unused for buddy pages?

Because we need to report the status *after* it left buddy.
And all flags are in use at that point.


> Using page->private for that, and exposing it to buddy users with the
> __GFP_PREZEROED flag (I hope we can avoid that) does not sound
> particularly elegant.

But propagating this all over mm does not sound too palatable, right?
There's precedent with MAGIC_HWPOISON already.
Better ideas? Thanks!

> Also, if we're going to remember that some pages in the buddy are
> pre-zeroed, it should better not be free-page-reporting specific.
> I'd assume ordinary inflating+deflating of the balloon would also end up
> with pre-zeroed pages. We'd just need a (mm/balloon.c -specific)
> interface to tell the buddy that the pages are zeroed.
> 

Indeed, it's also easily possible - it's a separate optimization, though.
Another simple enhancement is including hugetlbfs freelists in page
reporting.
Doesn't need to block this patchset though, right?

> 
> -- 
> Cheers,
> 
> David



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 2/9] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
  2026-04-13  8:10     ` Michael S. Tsirkin
@ 2026-04-13  8:15       ` David Hildenbrand (Arm)
  2026-04-13  8:29         ` Michael S. Tsirkin
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-13  8:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, Brendan Jackman,
	Michal Hocko, Suren Baghdasaryan, Jason Wang, Andrea Arcangeli,
	linux-mm, virtualization, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Johannes Weiner, Zi Yan

On 4/13/26 10:10, Michael S. Tsirkin wrote:
> On Mon, Apr 13, 2026 at 10:00:58AM +0200, David Hildenbrand (Arm) wrote:
>> On 4/13/26 00:50, Michael S. Tsirkin wrote:
>>> When a guest reports free pages to the hypervisor via the page reporting
>>> framework (used by virtio-balloon and hv_balloon), the host typically
>>> zeros those pages when reclaiming their backing memory.  However, when
>>> those pages are later allocated in the guest, post_alloc_hook()
>>> unconditionally zeros them again if __GFP_ZERO is set.  This
>>> double-zeroing is wasteful, especially for large pages.
>>>
>>> Avoid redundant zeroing by propagating the "host already zeroed this"
>>> information through the allocation path:
>>>
>>> 1. Add a host_zeroes_pages flag to page_reporting_dev_info, allowing
>>>    drivers to declare that their host zeros reported pages on reclaim.
>>>    A static key (page_reporting_host_zeroes) gates the fast path.
>>>
>>> 2. In page_del_and_expand(), when the page was reported and the
>>>    static key is enabled, stash a sentinel value (MAGIC_PAGE_ZEROED)
>>>    in page->private.
>>>
>>> 3. In post_alloc_hook(), check page->private for the sentinel.  If
>>>    present and zeroing was requested (but not tag zeroing), skip
>>>    kernel_init_pages().
>>>
>>> In particular, __GFP_ZERO is used by the x86 arch override of
>>> vma_alloc_zeroed_movable_folio.
>>>
>>> No driver sets host_zeroes_pages yet; a follow-up patch to
>>> virtio_balloon is needed to opt in.
>>>
>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>> Assisted-by: Claude:claude-opus-4-6
>>> ---
>>>  include/linux/mm.h             |  6 ++++++
>>>  include/linux/page_reporting.h |  3 +++
>>>  mm/page_alloc.c                | 21 +++++++++++++++++++++
>>>  mm/page_reporting.c            |  9 +++++++++
>>>  mm/page_reporting.h            |  2 ++
>>>  5 files changed, 41 insertions(+)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 5be3d8a8f806..59fc77c4c90e 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -4814,6 +4814,12 @@ static inline bool user_alloc_needs_zeroing(void)
>>>  				   &init_on_alloc);
>>>  }
>>>  
>>> +/*
>>> + * Sentinel stored in page->private to indicate the page was pre-zeroed
>>> + * by the hypervisor (via free page reporting).
>>> + */
>>> +#define MAGIC_PAGE_ZEROED	0x5A45524FU	/* ZERO */
>>
>> Why are we not using another page flag that is yet unused for buddy pages?
> 
> Because we need to report the status *after* it left buddy.
> And all flags are in use at that point.

I'll comment on that on the other patch, where __GFP_PREZEROED, which I
really hate, is added.

> 
> 
>> Using page->private for that, and exposing it to buddy users with the
>> __GFP_PREZEROED flag (I hope we can avoid that) does not sound
>> particularly elegant.
> 
> But propagating this all over mm does not sound too palatable, right?
> There's precedent with MAGIC_HWPOISON already.
> Better ideas? Thanks!

I'll comment on the __GFP_PREZEROED patch.

> 
>> Also, if we're going to remember that some pages in the buddy are
>> pre-zeroed, it should better not be free-page-reporting specific.
>> I'd assume ordinary inflating+deflating of the balloon would also end up
>> with pre-zeroed pages. We'd just need a (mm/balloon.c -specific)
>> interface to tell the buddy that the pages are zeroed.
>>
> 
> Indeed, it's also easily possible - it's a separate optimization, though.
> Another simple enhancement is including hugetlbfs freelists in page
> reporting.
> Doesn't need to block this patchset though, right?

Not blocking, but I don't want something that is too coupled to
free-page reporting optimizations in the buddy. The comment above
MAGIC_PAGE_ZEROED triggered my reaction.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 2/9] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
  2026-04-13  8:15       ` David Hildenbrand (Arm)
@ 2026-04-13  8:29         ` Michael S. Tsirkin
  0 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-13  8:29 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, Brendan Jackman,
	Michal Hocko, Suren Baghdasaryan, Jason Wang, Andrea Arcangeli,
	linux-mm, virtualization, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Johannes Weiner, Zi Yan

On Mon, Apr 13, 2026 at 10:15:08AM +0200, David Hildenbrand (Arm) wrote:
> On 4/13/26 10:10, Michael S. Tsirkin wrote:
> > On Mon, Apr 13, 2026 at 10:00:58AM +0200, David Hildenbrand (Arm) wrote:
> >> On 4/13/26 00:50, Michael S. Tsirkin wrote:
> >>> When a guest reports free pages to the hypervisor via the page reporting
> >>> framework (used by virtio-balloon and hv_balloon), the host typically
> >>> zeros those pages when reclaiming their backing memory.  However, when
> >>> those pages are later allocated in the guest, post_alloc_hook()
> >>> unconditionally zeros them again if __GFP_ZERO is set.  This
> >>> double-zeroing is wasteful, especially for large pages.
> >>>
> >>> Avoid redundant zeroing by propagating the "host already zeroed this"
> >>> information through the allocation path:
> >>>
> >>> 1. Add a host_zeroes_pages flag to page_reporting_dev_info, allowing
> >>>    drivers to declare that their host zeros reported pages on reclaim.
> >>>    A static key (page_reporting_host_zeroes) gates the fast path.
> >>>
> >>> 2. In page_del_and_expand(), when the page was reported and the
> >>>    static key is enabled, stash a sentinel value (MAGIC_PAGE_ZEROED)
> >>>    in page->private.
> >>>
> >>> 3. In post_alloc_hook(), check page->private for the sentinel.  If
> >>>    present and zeroing was requested (but not tag zeroing), skip
> >>>    kernel_init_pages().
> >>>
> >>> In particular, __GFP_ZERO is used by the x86 arch override of
> >>> vma_alloc_zeroed_movable_folio.
> >>>
> >>> No driver sets host_zeroes_pages yet; a follow-up patch to
> >>> virtio_balloon is needed to opt in.
> >>>
> >>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>> Assisted-by: Claude:claude-opus-4-6
> >>> ---
> >>>  include/linux/mm.h             |  6 ++++++
> >>>  include/linux/page_reporting.h |  3 +++
> >>>  mm/page_alloc.c                | 21 +++++++++++++++++++++
> >>>  mm/page_reporting.c            |  9 +++++++++
> >>>  mm/page_reporting.h            |  2 ++
> >>>  5 files changed, 41 insertions(+)
> >>>
> >>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >>> index 5be3d8a8f806..59fc77c4c90e 100644
> >>> --- a/include/linux/mm.h
> >>> +++ b/include/linux/mm.h
> >>> @@ -4814,6 +4814,12 @@ static inline bool user_alloc_needs_zeroing(void)
> >>>  				   &init_on_alloc);
> >>>  }
> >>>  
> >>> +/*
> >>> + * Sentinel stored in page->private to indicate the page was pre-zeroed
> >>> + * by the hypervisor (via free page reporting).
> >>> + */
> >>> +#define MAGIC_PAGE_ZEROED	0x5A45524FU	/* ZERO */
> >>
> >> Why are we not using another page flag that is yet unused for buddy pages?
> > 
> > Because we need to report the status *after* it left buddy.
> > And all flags are in use at that point.
> 
> I'll comment on that on the other patch, where __GFP_PREZEROED, which I
> really hate, is added.
> 
> > 
> > 
> >> Using page->private for that, and exposing it to buddy users with the
> >> __GFP_PREZEROED flag (I hope we can avoid that) does not sound
> >> particularly elegant.
> > 
> > But propagating this all over mm does not sound too palatable, right?
> > There's precedent with MAGIC_HWPOISON already.
> > Better ideas? Thanks!
> 
> I'll comment on the __GFP_PREZEROED patch.
> 
> > 
> >> Also, if we're going to remember that some pages in the buddy are
> >> pre-zeroed, it should better not be free-page-reporting specific.
> >> I'd assume ordinary inflating+deflating of the balloon would also end up
> >> with pre-zeroed pages. We'd just need a (mm/balloon.c -specific)
> >> interface to tell the buddy that the pages are zeroed.
> >>
> > 
> > Indeed, it's also easily possible - it's a separate optimization, though.
> > Another simple enhancement is including hugetlbfs freelists in page
> > reporting.
> > Doesn't need to block this patchset though, right?
> 
> Not blocking, but I don't want something that is too coupled to
> free-page reporting optimizations in the buddy.


I can add that in the next version if you like, sure.  The main issue is
that it means we need a flag that survives free.  And the benefit is
much smaller - unlike page reports, deflates are rare.

> The comment above
> MAGIC_PAGE_ZEROED triggered my reaction.

yea, that's more confusing than helpful.

> -- 
> Cheers,
> 
> David



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/9] mm: add __GFP_PREZEROED flag and folio_test_clear_prezeroed()
  2026-04-12 22:50 ` [PATCH RFC 3/9] mm: add __GFP_PREZEROED flag and folio_test_clear_prezeroed() Michael S. Tsirkin
@ 2026-04-13  9:05   ` David Hildenbrand (Arm)
  2026-04-13 20:37     ` Michael S. Tsirkin
                       ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-13  9:05 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Andrew Morton, Vlastimil Babka, Brendan Jackman, Michal Hocko,
	Suren Baghdasaryan, Jason Wang, Andrea Arcangeli, linux-mm,
	virtualization, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Johannes Weiner, Zi Yan

On 4/13/26 00:50, Michael S. Tsirkin wrote:
> The previous patch skips zeroing in post_alloc_hook() when
> __GFP_ZERO is used.  However, several page allocation paths
> zero pages via folio_zero_user() or clear_user_highpage() after
> allocation, not via __GFP_ZERO.
> 
> Add __GFP_PREZEROED gfp flag that tells post_alloc_hook() to
> preserve the MAGIC_PAGE_ZEROED sentinel in page->private so the
> caller can detect pre-zeroed pages and skip its own zeroing.
> Add folio_test_clear_prezeroed() helper to check and clear
> the sentinel.

I really don't like __GFP_PREZEROED, and wonder how we can avoid it.

What you want is, allocate a folio (well, actually a page that becomes
a folio) and know whether zeroing for that folio (once we establish it
from a page) is still required.

Or you just allocate a folio, specify GFP_ZERO, and let the folio
allocation code deal with that.

I think we have two options:

(1) Use an indication that can be sticky for callers that do not care.

Assuming we would use a page flag that is only ever used on folios, all
we'd have to do is make sure that we clear the flag once we convert
the to a folio.

For example, PG_dropbehind is only ever set on folios in the pagecache. 

Paths that allocate folios would have to clear the flag. For non-hugetlb
folios that happens through page_rmappable_folio().

I'm not super-happy about that, but it would be doable.

(2) Use a dedicated allocation interface for user pages in the buddy.

I hate the whole user_alloc_needs_zeroing()+folio_zero_user() handling.

It shouldn't exist. We should just be passing GFP_ZERO and let the buddy handle
all that.

For example, vma_alloc_folio() already gets passed the address in.

Pass the address from vma_alloc_folio_noprof()->folio_alloc_noprof(), and let
folio_alloc_noprof() use a buddy interface that can handle it.

Imagine if we had a alloc_user_pages_noprof() that consumes an address. It could just
do what folio_zero_user() does, and only if really required.

The whole user_alloc_needs_zeroing() could go away and you could just handle the
pre-zeroed optimization internally.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/9] mm: page_alloc: propagate PageReported flag across buddy splits
  2026-04-12 22:50 ` [PATCH RFC 1/9] mm: page_alloc: propagate PageReported flag across buddy splits Michael S. Tsirkin
@ 2026-04-13 19:11   ` David Hildenbrand (Arm)
  2026-04-13 20:32     ` Michael S. Tsirkin
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-13 19:11 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Andrew Morton, Vlastimil Babka, Brendan Jackman, Michal Hocko,
	Suren Baghdasaryan, Jason Wang, Andrea Arcangeli, linux-mm,
	virtualization, Johannes Weiner, Zi Yan

On 4/13/26 00:50, Michael S. Tsirkin wrote:
> When a reported free page is split via expand() to satisfy a
> smaller allocation, the sub-pages placed back on the free lists
> lose the PageReported flag.  This means they will be unnecessarily
> re-reported to the hypervisor in the next reporting cycle, wasting
> work.
> 
> Propagate the PageReported flag to sub-pages during expand() so
> that they are recognized as already-reported.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> ---
>  mm/page_alloc.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2d4b6f1a554e..edbb1edf463d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1730,7 +1730,7 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
>   * -- nyc
>   */
>  static inline unsigned int expand(struct zone *zone, struct page *page, int low,
> -				  int high, int migratetype)
> +				  int high, int migratetype, bool reported)
>  {
>  	unsigned int size = 1 << high;
>  	unsigned int nr_added = 0;
> @@ -1752,6 +1752,15 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
>  		__add_to_free_list(&page[size], zone, high, migratetype, false);
>  		set_buddy_order(&page[size], high);
>  		nr_added += size;
> +
> +		/*
> +		 * The parent page has been reported to the host.  The
> +		 * sub-pages are part of the same reported block, so mark
> +		 * them reported too.  This avoids re-reporting pages that
> +		 * the host already knows about.
> +		 */

The comment is a bit excessive. I'd say you can drop it completely.

> +		if (reported)
> +			__SetPageReported(&page[size]);
>  	}
>  
>  	return nr_added;
> @@ -1762,9 +1771,10 @@ static __always_inline void page_del_and_expand(struct zone *zone,
>  						int high, int migratetype)
>  {
>  	int nr_pages = 1 << high;
> +	bool was_reported = page_reported(page);
>  
>  	__del_page_from_free_list(page, zone, high, migratetype);
> -	nr_pages -= expand(zone, page, low, high, migratetype);
> +	nr_pages -= expand(zone, page, low, high, migratetype, was_reported);
>  	account_freepages(zone, -nr_pages, migratetype);
>  }
>  
> @@ -2322,7 +2332,8 @@ try_to_claim_block(struct zone *zone, struct page *page,
>  
>  		del_page_from_free_list(page, zone, current_order, block_type);
>  		change_pageblock_range(page, current_order, start_type);
> -		nr_added = expand(zone, page, order, current_order, start_type);
> +		nr_added = expand(zone, page, order, current_order, start_type,
> +				  false);

In MM land we started doing

/* reported= */false


This raises a good question: how does buddy merging handle the reported
flag?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/9] mm: page_alloc: propagate PageReported flag across buddy splits
  2026-04-13 19:11   ` David Hildenbrand (Arm)
@ 2026-04-13 20:32     ` Michael S. Tsirkin
  0 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-13 20:32 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, Brendan Jackman,
	Michal Hocko, Suren Baghdasaryan, Jason Wang, Andrea Arcangeli,
	linux-mm, virtualization, Johannes Weiner, Zi Yan

On Mon, Apr 13, 2026 at 09:11:51PM +0200, David Hildenbrand (Arm) wrote:
> On 4/13/26 00:50, Michael S. Tsirkin wrote:
> > When a reported free page is split via expand() to satisfy a
> > smaller allocation, the sub-pages placed back on the free lists
> > lose the PageReported flag.  This means they will be unnecessarily
> > re-reported to the hypervisor in the next reporting cycle, wasting
> > work.
> > 
> > Propagate the PageReported flag to sub-pages during expand() so
> > that they are recognized as already-reported.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Assisted-by: Claude:claude-opus-4-6
> > ---
> >  mm/page_alloc.c | 17 ++++++++++++++---
> >  1 file changed, 14 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 2d4b6f1a554e..edbb1edf463d 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1730,7 +1730,7 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
> >   * -- nyc
> >   */
> >  static inline unsigned int expand(struct zone *zone, struct page *page, int low,
> > -				  int high, int migratetype)
> > +				  int high, int migratetype, bool reported)
> >  {
> >  	unsigned int size = 1 << high;
> >  	unsigned int nr_added = 0;
> > @@ -1752,6 +1752,15 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
> >  		__add_to_free_list(&page[size], zone, high, migratetype, false);
> >  		set_buddy_order(&page[size], high);
> >  		nr_added += size;
> > +
> > +		/*
> > +		 * The parent page has been reported to the host.  The
> > +		 * sub-pages are part of the same reported block, so mark
> > +		 * them reported too.  This avoids re-reporting pages that
> > +		 * the host already knows about.
> > +		 */
> 
> The comment is a bit excessive. I'd say you can drop it completely.
> 
> > +		if (reported)
> > +			__SetPageReported(&page[size]);
> >  	}
> >  
> >  	return nr_added;
> > @@ -1762,9 +1771,10 @@ static __always_inline void page_del_and_expand(struct zone *zone,
> >  						int high, int migratetype)
> >  {
> >  	int nr_pages = 1 << high;
> > +	bool was_reported = page_reported(page);
> >  
> >  	__del_page_from_free_list(page, zone, high, migratetype);
> > -	nr_pages -= expand(zone, page, low, high, migratetype);
> > +	nr_pages -= expand(zone, page, low, high, migratetype, was_reported);
> >  	account_freepages(zone, -nr_pages, migratetype);
> >  }
> >  
> > @@ -2322,7 +2332,8 @@ try_to_claim_block(struct zone *zone, struct page *page,
> >  
> >  		del_page_from_free_list(page, zone, current_order, block_type);
> >  		change_pageblock_range(page, current_order, start_type);
> > -		nr_added = expand(zone, page, order, current_order, start_type);
> > +		nr_added = expand(zone, page, order, current_order, start_type,
> > +				  false);
> 
> In MM land we started doing
> 
> /* reported= */false
> 
> 
> This raises a good question: how does buddy merging handle the reported
> flag?
> 
> -- 
> Cheers,
> 
> David


IIUC it doesn't: reported pages are never merged, if a page is merged it has
just entered buddy.

-- 
MST



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 2/9] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
  2026-04-13  8:00   ` David Hildenbrand (Arm)
  2026-04-13  8:10     ` Michael S. Tsirkin
@ 2026-04-13 20:35     ` Michael S. Tsirkin
  1 sibling, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-13 20:35 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, Brendan Jackman,
	Michal Hocko, Suren Baghdasaryan, Jason Wang, Andrea Arcangeli,
	linux-mm, virtualization, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Johannes Weiner, Zi Yan

On Mon, Apr 13, 2026 at 10:00:58AM +0200, David Hildenbrand (Arm) wrote:
> On 4/13/26 00:50, Michael S. Tsirkin wrote:
> > When a guest reports free pages to the hypervisor via the page reporting
> > framework (used by virtio-balloon and hv_balloon), the host typically
> > zeros those pages when reclaiming their backing memory.  However, when
> > those pages are later allocated in the guest, post_alloc_hook()
> > unconditionally zeros them again if __GFP_ZERO is set.  This
> > double-zeroing is wasteful, especially for large pages.
> > 
> > Avoid redundant zeroing by propagating the "host already zeroed this"
> > information through the allocation path:
> > 
> > 1. Add a host_zeroes_pages flag to page_reporting_dev_info, allowing
> >    drivers to declare that their host zeros reported pages on reclaim.
> >    A static key (page_reporting_host_zeroes) gates the fast path.
> > 
> > 2. In page_del_and_expand(), when the page was reported and the
> >    static key is enabled, stash a sentinel value (MAGIC_PAGE_ZEROED)
> >    in page->private.
> > 
> > 3. In post_alloc_hook(), check page->private for the sentinel.  If
> >    present and zeroing was requested (but not tag zeroing), skip
> >    kernel_init_pages().
> > 
> > In particular, __GFP_ZERO is used by the x86 arch override of
> > vma_alloc_zeroed_movable_folio.
> > 
> > No driver sets host_zeroes_pages yet; a follow-up patch to
> > virtio_balloon is needed to opt in.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Assisted-by: Claude:claude-opus-4-6
> > ---
> >  include/linux/mm.h             |  6 ++++++
> >  include/linux/page_reporting.h |  3 +++
> >  mm/page_alloc.c                | 21 +++++++++++++++++++++
> >  mm/page_reporting.c            |  9 +++++++++
> >  mm/page_reporting.h            |  2 ++
> >  5 files changed, 41 insertions(+)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 5be3d8a8f806..59fc77c4c90e 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4814,6 +4814,12 @@ static inline bool user_alloc_needs_zeroing(void)
> >  				   &init_on_alloc);
> >  }
> >  
> > +/*
> > + * Sentinel stored in page->private to indicate the page was pre-zeroed
> > + * by the hypervisor (via free page reporting).
> > + */
> > +#define MAGIC_PAGE_ZEROED	0x5A45524FU	/* ZERO */
> 
> Why are we not using another page flag that is yet unused for buddy pages?
> 
> Using page->private for that, and exposing it to buddy users with the
> __GFP_PREZEROED flag (I hope we can avoid that) does not sound
> particularly elegant.


So here's an only alternative I see: a page flag for when page is in
buddy and a new "prezero" bool that we have to propagate everywhere
else. This is a patch on top. More elegant? Please tell me if you prefer that.
If yes I will squash it into the appropriate patches.



diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 903f87c7fec9..b9c5bdbb0e7b 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -294,7 +294,7 @@ enum {
 #define __GFP_SKIP_ZERO ((__force gfp_t)___GFP_SKIP_ZERO)
 #define __GFP_SKIP_KASAN ((__force gfp_t)___GFP_SKIP_KASAN)
 
-/* Caller handles pre-zeroed pages; preserve MAGIC_PAGE_ZEROED in private */
+/* Caller handles pre-zeroed pages; preserve PagePrezeroed */
 #define __GFP_PREZEROED ((__force gfp_t)___GFP_PREZEROED)
 
 /* Disable lockdep for GFP context tracking */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index caa1de31bbca..3e46233d5758 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4814,11 +4814,21 @@ static inline bool user_alloc_needs_zeroing(void)
 				   &init_on_alloc);
 }
 
-/*
- * Sentinel stored in page->private to indicate the page was pre-zeroed
- * by the hypervisor (via free page reporting).
+/**
+ * __page_test_clear_prezeroed - test and clear the pre-zeroed marker.
+ * @page: the page to test.
+ *
+ * Returns true if the page was pre-zeroed by the host, and clears
+ * the marker. Caller must have exclusive access to @page.
  */
-#define MAGIC_PAGE_ZEROED	0x5A45524FU	/* ZERO */
+static inline bool __page_test_clear_prezeroed(struct page *page)
+{
+	if (PagePrezeroed(page)) {
+		__ClearPagePrezeroed(page);
+		return true;
+	}
+	return false;
+}
 
 /**
  * folio_test_clear_prezeroed - test and clear the pre-zeroed marker.
@@ -4829,11 +4839,7 @@ static inline bool user_alloc_needs_zeroing(void)
  */
 static inline bool folio_test_clear_prezeroed(struct folio *folio)
 {
-	if (page_private(&folio->page) == MAGIC_PAGE_ZEROED) {
-		set_page_private(&folio->page, 0);
-		return true;
-	}
-	return false;
+	return __page_test_clear_prezeroed(&folio->page);
 }
 
 int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *status);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f7a0e4af0c73..342f9baf2206 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -135,6 +135,8 @@ enum pageflags {
 	PG_swapcache = PG_owner_priv_1, /* Swap page: swp_entry_t in private */
 	/* Some filesystems */
 	PG_checked = PG_owner_priv_1,
+	/* Page contents are known to be zero */
+	PG_prezeroed = PG_owner_priv_1,
 
 	/*
 	 * Depending on the way an anonymous folio can be mapped into a page
@@ -679,6 +681,13 @@ FOLIO_TEST_CLEAR_FLAG_FALSE(young)
 FOLIO_FLAG_FALSE(idle)
 #endif
 
+/*
+ * PagePrezeroed() tracks pages known to be zero. The
+ * allocator may preserve this bit for __GFP_PREZEROED callers so they can
+ * skip redundant zeroing after allocation.
+ */
+__PAGEFLAG(Prezeroed, prezeroed, PF_NO_COMPOUND)
+
 /*
  * PageReported() is used to track reported free pages within the Buddy
  * allocator. We can use the non-atomic version of the test and set
diff --git a/mm/compaction.c b/mm/compaction.c
index 1e8f8eca318c..d3c024c5a88b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -82,7 +82,7 @@ static inline bool is_via_compact_memory(int order) { return false; }
 
 static struct page *mark_allocated_noprof(struct page *page, unsigned int order, gfp_t gfp_flags)
 {
-	post_alloc_hook(page, order, __GFP_MOVABLE);
+	post_alloc_hook(page, order, __GFP_MOVABLE, false);
 	set_page_refcounted(page);
 	return page;
 }
@@ -1833,7 +1833,7 @@ static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long da
 	}
 	dst = (struct folio *)freepage;
 
-	post_alloc_hook(&dst->page, order, __GFP_MOVABLE);
+	post_alloc_hook(&dst->page, order, __GFP_MOVABLE, false);
 	set_page_refcounted(&dst->page);
 	if (order)
 		prep_compound_page(&dst->page, order);
diff --git a/mm/internal.h b/mm/internal.h
index cb0af847d7d9..ceb0b604c682 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -887,7 +887,8 @@ static inline void prep_compound_tail(struct page *head, int tail_idx)
 	set_page_private(p, 0);
 }
 
-void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags);
+void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
+		     bool prezeroed);
 extern bool free_pages_prepare(struct page *page, unsigned int order);
 
 extern int user_min_free_kbytes;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fba8321c45ed..57dc5195b29b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1528,6 +1528,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			count -= nr_pages;
 			pcp->count -= nr_pages;
 
+			if (PagePrezeroed(page))
+				__ClearPagePrezeroed(page);
 			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
 			trace_mm_page_pcpu_drain(page, order, mt);
 		} while (count > 0 && !list_empty(list));
@@ -1783,11 +1785,14 @@ static __always_inline void page_del_and_expand(struct zone *zone,
 
 	/*
 	 * If the page was reported and the host is known to zero reported
-	 * pages, mark it zeroed via page->private so that
-	 * post_alloc_hook() can skip redundant zeroing.
+	 * pages, mark it pre-zeroed so post_alloc_hook() can skip
+	 * redundant zeroing.
 	 */
-	if (was_reported)
-		set_page_private(page, MAGIC_PAGE_ZEROED);
+	if (was_reported) {
+		__SetPagePrezeroed(page);
+	} else {
+		__ClearPagePrezeroed(page);
+	}
 }
 
 static void check_new_page_bad(struct page *page)
@@ -1859,21 +1864,20 @@ static inline bool should_skip_init(gfp_t flags)
 }
 
 inline void post_alloc_hook(struct page *page, unsigned int order,
-				gfp_t gfp_flags)
+				gfp_t gfp_flags, bool prezeroed)
 {
 	bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
 			!should_skip_init(gfp_flags);
-	bool prezeroed = page_private(page) == MAGIC_PAGE_ZEROED;
+	bool preserve_prezeroed = prezeroed && (gfp_flags & __GFP_PREZEROED);
 	bool zero_tags = init && (gfp_flags & __GFP_ZEROTAGS);
 	int i;
 
 	/*
 	 * If the page is pre-zeroed and the caller opted in via
 	 * __GFP_PREZEROED, preserve the marker so the caller can
-	 * skip its own zeroing.  Otherwise always clear private.
+	 * skip its own zeroing.
 	 */
-	if (!(prezeroed && (gfp_flags & __GFP_PREZEROED)))
-		set_page_private(page, 0);
+	__ClearPagePrezeroed(page);
 
 	/*
 	 * If the page is pre-zeroed, skip memory initialization.
@@ -1923,15 +1927,18 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	if (init)
 		kernel_init_pages(page, 1 << order);
 
+	if (preserve_prezeroed)
+		__SetPagePrezeroed(page);
+
 	set_page_owner(page, order, gfp_flags);
 	page_table_check_alloc(page, order);
 	pgalloc_tag_add(page, current, 1 << order);
 }
 
 static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
-							unsigned int alloc_flags)
+			  unsigned int alloc_flags, bool prezeroed)
 {
-	post_alloc_hook(page, order, gfp_flags);
+	post_alloc_hook(page, order, gfp_flags, prezeroed);
 
 	if (order && (gfp_flags & __GFP_COMP))
 		prep_compound_page(page, order);
@@ -3276,7 +3283,7 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
 static __always_inline
 struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 			   unsigned int order, unsigned int alloc_flags,
-			   int migratetype)
+			   int migratetype, bool *prezeroed)
 {
 	struct page *page;
 	unsigned long flags;
@@ -3311,6 +3318,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 			}
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
+		*prezeroed = __page_test_clear_prezeroed(page);
 	} while (check_new_pages(page, order));
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -3372,10 +3380,9 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
 /* Remove page from the per-cpu list, caller must protect the list */
 static inline
 struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
-			int migratetype,
-			unsigned int alloc_flags,
+			int migratetype, unsigned int alloc_flags,
 			struct per_cpu_pages *pcp,
-			struct list_head *list)
+			struct list_head *list, bool *prezeroed)
 {
 	struct page *page;
 
@@ -3396,6 +3403,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 		page = list_first_entry(list, struct page, pcp_list);
 		list_del(&page->pcp_list);
 		pcp->count -= 1 << order;
+		*prezeroed = __page_test_clear_prezeroed(page);
 	} while (check_new_pages(page, order));
 
 	return page;
@@ -3404,7 +3412,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 /* Lock and remove page from the per-cpu list */
 static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
-			int migratetype, unsigned int alloc_flags)
+			int migratetype, unsigned int alloc_flags,
+			bool *prezeroed)
 {
 	struct per_cpu_pages *pcp;
 	struct list_head *list;
@@ -3423,7 +3432,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	 */
 	pcp->free_count >>= 1;
 	list = &pcp->lists[order_to_pindex(migratetype, order)];
-	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
+	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags,
+				 pcp, list, prezeroed);
 	pcp_spin_unlock(pcp, UP_flags);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -3448,19 +3458,19 @@ static inline
 struct page *rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
 			gfp_t gfp_flags, unsigned int alloc_flags,
-			int migratetype)
+			int migratetype, bool *prezeroed)
 {
 	struct page *page;
 
 	if (likely(pcp_allowed_order(order))) {
 		page = rmqueue_pcplist(preferred_zone, zone, order,
-				       migratetype, alloc_flags);
+				       migratetype, alloc_flags, prezeroed);
 		if (likely(page))
 			goto out;
 	}
 
 	page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
-							migratetype);
+			     migratetype, prezeroed);
 
 out:
 	/* Separate test+clear to avoid unnecessary atomics */
@@ -3851,6 +3861,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	struct pglist_data *last_pgdat = NULL;
 	bool last_pgdat_dirty_ok = false;
 	bool no_fallback;
+	bool prezeroed;
 	bool skip_kswapd_nodes = nr_online_nodes > 1;
 	bool skipped_kswapd_nodes = false;
 
@@ -3995,9 +4006,11 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 
 try_this_zone:
 		page = rmqueue(zonelist_zone(ac->preferred_zoneref), zone, order,
-				gfp_mask, alloc_flags, ac->migratetype);
+					gfp_mask, alloc_flags, ac->migratetype,
+					&prezeroed);
 		if (page) {
-			prep_new_page(page, order, gfp_mask, alloc_flags);
+			prep_new_page(page, order, gfp_mask, alloc_flags,
+				      prezeroed);
 
 			/*
 			 * If this is a high-order atomic allocation then check
@@ -4232,7 +4245,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 	/* Prep a captured page if available */
 	if (page)
-		prep_new_page(page, order, gfp_mask, alloc_flags);
+		prep_new_page(page, order, gfp_mask, alloc_flags, false);
 
 	/* Try get a page from the freelist if available */
 	if (!page)
@@ -5206,6 +5219,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	/* Attempt the batch allocation */
 	pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
 	while (nr_populated < nr_pages) {
+		bool prezeroed = false;
 
 		/* Skip existing pages */
 		if (page_array[nr_populated]) {
@@ -5214,7 +5228,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 		}
 
 		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
-								pcp, pcp_list);
+					 pcp, pcp_list, &prezeroed);
 		if (unlikely(!page)) {
 			/* Try and allocate at least one page */
 			if (!nr_account) {
@@ -5225,7 +5239,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 		}
 		nr_account++;
 
-		prep_new_page(page, 0, gfp, 0);
+		prep_new_page(page, 0, gfp, 0, prezeroed);
 		set_page_refcounted(page);
 		page_array[nr_populated++] = page;
 	}
@@ -6948,7 +6962,7 @@ static void split_free_frozen_pages(struct list_head *list, gfp_t gfp_mask)
 		list_for_each_entry_safe(page, next, &list[order], lru) {
 			int i;
 
-			post_alloc_hook(page, order, gfp_mask);
+			post_alloc_hook(page, order, gfp_mask, false);
 			if (!order)
 				continue;
 
@@ -7154,7 +7168,7 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
 		struct page *head = pfn_to_page(start);
 
 		check_new_pages(head, order);
-		prep_new_page(head, order, gfp_mask, 0);
+		prep_new_page(head, order, gfp_mask, 0, false);
 	} else {
 		ret = -EINVAL;
 		WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n",



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/9] mm: add __GFP_PREZEROED flag and folio_test_clear_prezeroed()
  2026-04-13  9:05   ` David Hildenbrand (Arm)
@ 2026-04-13 20:37     ` Michael S. Tsirkin
  2026-04-13 21:37     ` Michael S. Tsirkin
  2026-04-13 22:06     ` Michael S. Tsirkin
  2 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-13 20:37 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, Brendan Jackman,
	Michal Hocko, Suren Baghdasaryan, Jason Wang, Andrea Arcangeli,
	linux-mm, virtualization, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Johannes Weiner, Zi Yan

On Mon, Apr 13, 2026 at 11:05:40AM +0200, David Hildenbrand (Arm) wrote:
> On 4/13/26 00:50, Michael S. Tsirkin wrote:
> > The previous patch skips zeroing in post_alloc_hook() when
> > __GFP_ZERO is used.  However, several page allocation paths
> > zero pages via folio_zero_user() or clear_user_highpage() after
> > allocation, not via __GFP_ZERO.
> > 
> > Add __GFP_PREZEROED gfp flag that tells post_alloc_hook() to
> > preserve the MAGIC_PAGE_ZEROED sentinel in page->private so the
> > caller can detect pre-zeroed pages and skip its own zeroing.
> > Add folio_test_clear_prezeroed() helper to check and clear
> > the sentinel.
> 
> I really don't like __GFP_PREZEROED, and wonder how we can avoid it.
> 
> 
> What you want is, allocate a folio (well, actually a page that becomes
> a folio) and know whether zeroing for that folio (once we establish it
> from a page) is still required.
> 
> Or you just allocate a folio, specify GFP_ZERO, and let the folio
> allocation code deal with that.
> 
> 
> I think we have two options:
> 
> (1) Use an indication that can be sticky for callers that do not care.
> 
> Assuming we would use a page flag that is only ever used on folios, all
> we'd have to do is make sure that we clear the flag once we convert
> the to a folio.
> 
> For example, PG_dropbehind is only ever set on folios in the pagecache. 
> 
> Paths that allocate folios would have to clear the flag. For non-hugetlb
> folios that happens through page_rmappable_folio().
> 
> I'm not super-happy about that, but it would be doable.
> 
> 
> (2) Use a dedicated allocation interface for user pages in the buddy.
> 
> I hate the whole user_alloc_needs_zeroing()+folio_zero_user() handling.
> 
> It shouldn't exist. We should just be passing GFP_ZERO and let the buddy handle
> all that.
> 
> 
> For example, vma_alloc_folio() already gets passed the address in.
> 
> Pass the address from vma_alloc_folio_noprof()->folio_alloc_noprof(), and let
> folio_alloc_noprof() use a buddy interface that can handle it.
> 
> Imagine if we had a alloc_user_pages_noprof() that consumes an address. It could just
> do what folio_zero_user() does, and only if really required.
> 
> The whole user_alloc_needs_zeroing() could go away and you could just handle the
> pre-zeroed optimization internally.
> 
> -- 
> Cheers,
> 
> David

I admit I only vaguely understand the core mm refactoring you are suggesting.

-- 
MST



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/9] mm: add __GFP_PREZEROED flag and folio_test_clear_prezeroed()
  2026-04-13  9:05   ` David Hildenbrand (Arm)
  2026-04-13 20:37     ` Michael S. Tsirkin
@ 2026-04-13 21:37     ` Michael S. Tsirkin
  2026-04-13 22:06     ` Michael S. Tsirkin
  2 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-13 21:37 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, Brendan Jackman,
	Michal Hocko, Suren Baghdasaryan, Jason Wang, Andrea Arcangeli,
	linux-mm, virtualization, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Johannes Weiner, Zi Yan

On Mon, Apr 13, 2026 at 11:05:40AM +0200, David Hildenbrand (Arm) wrote:
> On 4/13/26 00:50, Michael S. Tsirkin wrote:
> > The previous patch skips zeroing in post_alloc_hook() when
> > __GFP_ZERO is used.  However, several page allocation paths
> > zero pages via folio_zero_user() or clear_user_highpage() after
> > allocation, not via __GFP_ZERO.
> > 
> > Add __GFP_PREZEROED gfp flag that tells post_alloc_hook() to
> > preserve the MAGIC_PAGE_ZEROED sentinel in page->private so the
> > caller can detect pre-zeroed pages and skip its own zeroing.
> > Add folio_test_clear_prezeroed() helper to check and clear
> > the sentinel.
> 
> I really don't like __GFP_PREZEROED, and wonder how we can avoid it.
> 
> 
> What you want is, allocate a folio (well, actually a page that becomes
> a folio) and know whether zeroing for that folio (once we establish it
> from a page) is still required.
> 
> Or you just allocate a folio, specify GFP_ZERO, and let the folio
> allocation code deal with that.
> 
> 
> I think we have two options:
> 
> (1) Use an indication that can be sticky for callers that do not care.
> 
> Assuming we would use a page flag that is only ever used on folios, all
> we'd have to do is make sure that we clear the flag once we convert
> the to a folio.
> 
> For example, PG_dropbehind is only ever set on folios in the pagecache. 
> 
> Paths that allocate folios would have to clear the flag. For non-hugetlb
> folios that happens through page_rmappable_folio().
> 
> I'm not super-happy about that, but it would be doable.


I suspect PG_dropbehind (or any flag, e.g. PG_owner_priv_1
that the patch that I sent uses) won't work as-is for
this.  The issue is PAGE_FLAGS_CHECK_AT_PREP:

       #define PAGE_FLAGS_CHECK_AT_PREP \
               ((PAGEFLAGS_MASK & ~__PG_HWPOISON) | ...)

This includes all page flags except hwpoison.  check_new_pages()
verifies that none of these flags are set on an allocated page.

PG_dropbehind is part of PAGEFLAGS_MASK, so if we set it in
page_del_and_expand() to mark a page as pre-zeroed, check_new_pages()
would reject it as a bad page.


I guess we could exclude it unconditionally, but this
looks like a riskier change to me. No?



> 
> (2) Use a dedicated allocation interface for user pages in the buddy.
> 
> I hate the whole user_alloc_needs_zeroing()+folio_zero_user() handling.
> 
> It shouldn't exist. We should just be passing GFP_ZERO and let the buddy handle
> all that.
> 
> 
> For example, vma_alloc_folio() already gets passed the address in.
> 
> Pass the address from vma_alloc_folio_noprof()->folio_alloc_noprof(), and let
> folio_alloc_noprof() use a buddy interface that can handle it.
> 
> Imagine if we had a alloc_user_pages_noprof() that consumes an address. It could just
> do what folio_zero_user() does, and only if really required.
> 
> The whole user_alloc_needs_zeroing() could go away and you could just handle the
> pre-zeroed optimization internally.


It's all rather messy, from what I saw so far there are arch specific
hacks actually around this.


> -- 
> Cheers,
> 
> David



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/9] mm: add __GFP_PREZEROED flag and folio_test_clear_prezeroed()
  2026-04-13  9:05   ` David Hildenbrand (Arm)
  2026-04-13 20:37     ` Michael S. Tsirkin
  2026-04-13 21:37     ` Michael S. Tsirkin
@ 2026-04-13 22:06     ` Michael S. Tsirkin
  2026-04-13 23:43       ` Michael S. Tsirkin
  2 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-13 22:06 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, Brendan Jackman,
	Michal Hocko, Suren Baghdasaryan, Jason Wang, Andrea Arcangeli,
	linux-mm, virtualization, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Johannes Weiner, Zi Yan

On Mon, Apr 13, 2026 at 11:05:40AM +0200, David Hildenbrand (Arm) wrote:

...

> (2) Use a dedicated allocation interface for user pages in the buddy.
> 
> I hate the whole user_alloc_needs_zeroing()+folio_zero_user() handling.
> 
> It shouldn't exist. We should just be passing GFP_ZERO and let the buddy handle
> all that.
> 
> 
> For example, vma_alloc_folio() already gets passed the address in.
>
> Pass the address from vma_alloc_folio_noprof()->folio_alloc_noprof(), and let
> folio_alloc_noprof() use a buddy interface that can handle it.
> 
> Imagine if we had a alloc_user_pages_noprof() that consumes an address. It could just
> do what folio_zero_user() does, and only if really required.
> 
> The whole user_alloc_needs_zeroing() could go away and you could just handle the
> pre-zeroed optimization internally.

I looked at this a bit, and I think the issue is that the buddy
allocator doesn't do the arch-specific cache handling.
So allocating it is a fundamentally different thing from GFP_ZERO which
is "zero a kernel address range".

So I don't get how you want to do it.

> 
> -- 
> Cheers,
> 
> David



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/9] mm: add __GFP_PREZEROED flag and folio_test_clear_prezeroed()
  2026-04-13 22:06     ` Michael S. Tsirkin
@ 2026-04-13 23:43       ` Michael S. Tsirkin
  0 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2026-04-13 23:43 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, Brendan Jackman,
	Michal Hocko, Suren Baghdasaryan, Jason Wang, Andrea Arcangeli,
	linux-mm, virtualization, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Johannes Weiner, Zi Yan

On Mon, Apr 13, 2026 at 06:06:14PM -0400, Michael S. Tsirkin wrote:
> On Mon, Apr 13, 2026 at 11:05:40AM +0200, David Hildenbrand (Arm) wrote:
> 
> ...
> 
> > (2) Use a dedicated allocation interface for user pages in the buddy.
> > 
> > I hate the whole user_alloc_needs_zeroing()+folio_zero_user() handling.
> > 
> > It shouldn't exist. We should just be passing GFP_ZERO and let the buddy handle
> > all that.
> > 
> > 
> > For example, vma_alloc_folio() already gets passed the address in.
> >
> > Pass the address from vma_alloc_folio_noprof()->folio_alloc_noprof(), and let
> > folio_alloc_noprof() use a buddy interface that can handle it.
> > 
> > Imagine if we had a alloc_user_pages_noprof() that consumes an address. It could just
> > do what folio_zero_user() does, and only if really required.
> > 
> > The whole user_alloc_needs_zeroing() could go away and you could just handle the
> > pre-zeroed optimization internally.
> 
> I looked at this a bit, and I think the issue is that the buddy
> allocator doesn't do the arch-specific cache handling.
> So allocating it is a fundamentally different thing from GFP_ZERO which
> is "zero a kernel address range".
> 
> So I don't get how you want to do it.

Oh, wait, do you mean we thread the userspace address through all the allocation calls?
Like the below? This is on top of my patches, on top of mm it will be
a tiny bit smaller.  I can rebase no problem.

But: isn't it a bit overkill for something that is, in the end, virtualization specific?
It's all mechanical threading through of user_addr, but should we miss
one place, and suddenly you get weird corruption on esoteric arches
since we will get memset instead of folio_zero_user.

Worth it?

Let me know.





diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 301567ad160f..3b444a7b11cd 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -828,8 +828,6 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 			error = PTR_ERR(folio);
 			goto out;
 		}
-		if (!folio_test_clear_prezeroed(folio))
-			folio_zero_user(folio, addr);
 		__folio_mark_uptodate(folio);
 		error = hugetlb_add_to_page_cache(folio, mapping, index);
 		if (unlikely(error)) {
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 51ef13ed756e..06b71beed0a7 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -227,11 +227,11 @@ static inline void arch_alloc_page(struct page *page, int order) { }
 #endif
 
 struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
-		nodemask_t *nodemask);
+		nodemask_t *nodemask, unsigned long user_addr);
 #define __alloc_pages(...)			alloc_hooks(__alloc_pages_noprof(__VA_ARGS__))
 
 struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
-		nodemask_t *nodemask);
+		nodemask_t *nodemask, unsigned long user_addr);
 #define __folio_alloc(...)			alloc_hooks(__folio_alloc_noprof(__VA_ARGS__))
 
 unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
@@ -286,7 +286,7 @@ __alloc_pages_node_noprof(int nid, gfp_t gfp_mask, unsigned int order)
 	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
 	warn_if_node_offline(nid, gfp_mask);
 
-	return __alloc_pages_noprof(gfp_mask, order, nid, NULL);
+	return __alloc_pages_noprof(gfp_mask, order, nid, NULL, 0);
 }
 
 #define  __alloc_pages_node(...)		alloc_hooks(__alloc_pages_node_noprof(__VA_ARGS__))
@@ -297,7 +297,7 @@ struct folio *__folio_alloc_node_noprof(gfp_t gfp, unsigned int order, int nid)
 	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
 	warn_if_node_offline(nid, gfp);
 
-	return __folio_alloc_noprof(gfp, order, nid, NULL);
+	return __folio_alloc_noprof(gfp, order, nid, NULL, 0);
 }
 
 #define  __folio_alloc_node(...)		alloc_hooks(__folio_alloc_node_noprof(__VA_ARGS__))
@@ -322,7 +322,8 @@ static inline struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask,
 struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order);
 struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order);
 struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
-		struct mempolicy *mpol, pgoff_t ilx, int nid);
+		struct mempolicy *mpol, pgoff_t ilx, int nid,
+		unsigned long user_addr);
 struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma,
 		unsigned long addr);
 #else
@@ -335,14 +336,17 @@ static inline struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
 	return __folio_alloc_node_noprof(gfp, order, numa_node_id());
 }
 static inline struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
-		struct mempolicy *mpol, pgoff_t ilx, int nid)
+		struct mempolicy *mpol, pgoff_t ilx, int nid,
+		unsigned long user_addr)
 {
-	return folio_alloc_noprof(gfp, order);
+	return __folio_alloc_noprof(gfp | __GFP_COMP, order, numa_node_id(),
+				   NULL, user_addr);
 }
 static inline struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
 		struct vm_area_struct *vma, unsigned long addr)
 {
-	return folio_alloc_noprof(gfp, order);
+	return folio_alloc_mpol_noprof(gfp, order, NULL, 0, numa_node_id(),
+				      addr);
 }
 #endif
 
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index b9c5bdbb0e7b..6c75df30a281 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -56,7 +56,6 @@ enum {
 	___GFP_NOLOCKDEP_BIT,
 #endif
 	___GFP_NO_OBJ_EXT_BIT,
-	___GFP_PREZEROED_BIT,
 	___GFP_LAST_BIT
 };
 
@@ -98,7 +97,6 @@ enum {
 #define ___GFP_NOLOCKDEP	0
 #endif
 #define ___GFP_NO_OBJ_EXT       BIT(___GFP_NO_OBJ_EXT_BIT)
-#define ___GFP_PREZEROED	BIT(___GFP_PREZEROED_BIT)
 
 /*
  * Physical address zone modifiers (see linux/mmzone.h - low four bits)
@@ -294,9 +292,6 @@ enum {
 #define __GFP_SKIP_ZERO ((__force gfp_t)___GFP_SKIP_ZERO)
 #define __GFP_SKIP_KASAN ((__force gfp_t)___GFP_SKIP_KASAN)
 
-/* Caller handles pre-zeroed pages; preserve PagePrezeroed */
-#define __GFP_PREZEROED ((__force gfp_t)___GFP_PREZEROED)
-
 /* Disable lockdep for GFP context tracking */
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
 
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index b649e7e315f4..ffa683f64f1d 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -320,15 +320,8 @@ static inline
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
 				   unsigned long vaddr)
 {
-	struct folio *folio;
-
-	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_PREZEROED,
-			       0, vma, vaddr);
-	if (folio && user_alloc_needs_zeroing() &&
-	    !folio_test_clear_prezeroed(folio))
-		clear_user_highpage(&folio->page, vaddr);
-
-	return folio;
+	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO,
+			      0, vma, vaddr);
 }
 #endif
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 07e3ef8c0418..e05d14536329 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -937,7 +937,7 @@ static inline bool hugepage_movable_supported(struct hstate *h)
 /* Movability of hugepages depends on migration support. */
 static inline gfp_t htlb_alloc_mask(struct hstate *h)
 {
-	gfp_t gfp = __GFP_COMP | __GFP_NOWARN | __GFP_PREZEROED;
+	gfp_t gfp = __GFP_COMP | __GFP_NOWARN | __GFP_ZERO;
 
 	gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 342f9baf2206..851ebf1c9902 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -682,9 +682,8 @@ FOLIO_FLAG_FALSE(idle)
 #endif
 
 /*
- * PagePrezeroed() tracks pages known to be zero. The
- * allocator may preserve this bit for __GFP_PREZEROED callers so they can
- * skip redundant zeroing after allocation.
+ * PagePrezeroed() tracks pages known to be zero.  The allocator
+ * uses this to skip redundant zeroing in post_alloc_hook().
  */
 __PAGEFLAG(Prezeroed, prezeroed, PF_NO_COMPOUND)
 
diff --git a/mm/compaction.c b/mm/compaction.c
index d3c024c5a88b..9c61fa61941b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -82,7 +82,7 @@ static inline bool is_via_compact_memory(int order) { return false; }
 
 static struct page *mark_allocated_noprof(struct page *page, unsigned int order, gfp_t gfp_flags)
 {
-	post_alloc_hook(page, order, __GFP_MOVABLE, false);
+	post_alloc_hook(page, order, __GFP_MOVABLE, false, 0);
 	set_page_refcounted(page);
 	return page;
 }
@@ -1833,7 +1833,7 @@ static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long da
 	}
 	dst = (struct folio *)freepage;
 
-	post_alloc_hook(&dst->page, order, __GFP_MOVABLE, false);
+	post_alloc_hook(&dst->page, order, __GFP_MOVABLE, false, 0);
 	set_page_refcounted(&dst->page);
 	if (order)
 		prep_compound_page(&dst->page, order);
diff --git a/mm/filemap.c b/mm/filemap.c
index 6cd7974d4ada..867bfe0c1d37 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -998,7 +998,7 @@ struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order,
 
 	if (policy)
 		return folio_alloc_mpol_noprof(gfp, order, policy,
-				NO_INTERLEAVE_INDEX, numa_node_id());
+				NO_INTERLEAVE_INDEX, numa_node_id(), 0);
 
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3b9b53fad0f1..ffad77f95ec2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1256,7 +1256,7 @@ EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
 static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
 		unsigned long addr)
 {
-	gfp_t gfp = vma_thp_gfp_mask(vma) | __GFP_PREZEROED;
+	gfp_t gfp = vma_thp_gfp_mask(vma) | __GFP_ZERO;
 	const int order = HPAGE_PMD_ORDER;
 	struct folio *folio;
 
@@ -1279,14 +1279,6 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
 	}
 	folio_throttle_swaprate(folio, gfp);
 
-       /*
-	* When a folio is not zeroed during allocation (__GFP_ZERO not used)
-	* or user folios require special handling, folio_zero_user() is used to
-	* make sure that the page corresponding to the faulting address will be
-	* hot in the cache after zeroing.
-	*/
-	if (user_alloc_needs_zeroing() && !folio_test_clear_prezeroed(folio))
-		folio_zero_user(folio, addr);
 	/*
 	 * The memory barrier inside __folio_mark_uptodate makes sure that
 	 * folio_zero_user writes become visible before the set_pmd_at()
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5b23b006c37c..8bd450fac6cb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1842,7 +1842,8 @@ struct address_space *hugetlb_folio_mapping_lock_write(struct folio *folio)
 }
 
 static struct folio *alloc_buddy_frozen_folio(int order, gfp_t gfp_mask,
-		int nid, nodemask_t *nmask, nodemask_t *node_alloc_noretry)
+		int nid, nodemask_t *nmask, nodemask_t *node_alloc_noretry,
+		unsigned long addr)
 {
 	struct folio *folio;
 	bool alloc_try_hard = true;
@@ -1859,7 +1860,7 @@ static struct folio *alloc_buddy_frozen_folio(int order, gfp_t gfp_mask,
 	if (alloc_try_hard)
 		gfp_mask |= __GFP_RETRY_MAYFAIL;
 
-	folio = (struct folio *)__alloc_frozen_pages(gfp_mask, order, nid, nmask);
+	folio = (struct folio *)__alloc_frozen_pages(gfp_mask, order, nid, nmask, addr);
 
 	/*
 	 * If we did not specify __GFP_RETRY_MAYFAIL, but still got a
@@ -1888,7 +1889,7 @@ static struct folio *alloc_buddy_frozen_folio(int order, gfp_t gfp_mask,
 
 static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
 		gfp_t gfp_mask, int nid, nodemask_t *nmask,
-		nodemask_t *node_alloc_noretry)
+		nodemask_t *node_alloc_noretry, unsigned long addr)
 {
 	struct folio *folio;
 	int order = huge_page_order(h);
@@ -1900,7 +1901,7 @@ static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
 		folio = alloc_gigantic_frozen_folio(order, gfp_mask, nid, nmask);
 	else
 		folio = alloc_buddy_frozen_folio(order, gfp_mask, nid, nmask,
-						 node_alloc_noretry);
+						 node_alloc_noretry, addr);
 	if (folio)
 		init_new_hugetlb_folio(folio);
 	return folio;
@@ -1914,11 +1915,12 @@ static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
  * pages is zero, and the accounting must be done in the caller.
  */
 static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
-		gfp_t gfp_mask, int nid, nodemask_t *nmask)
+		gfp_t gfp_mask, int nid, nodemask_t *nmask,
+		unsigned long addr)
 {
 	struct folio *folio;
 
-	folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, NULL);
+	folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, NULL, addr);
 	if (folio)
 		hugetlb_vmemmap_optimize_folio(h, folio);
 	return folio;
@@ -1958,7 +1960,7 @@ static struct folio *alloc_pool_huge_folio(struct hstate *h,
 		struct folio *folio;
 
 		folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
-					nodes_allowed, node_alloc_noretry);
+					nodes_allowed, node_alloc_noretry, 0);
 		if (folio)
 			return folio;
 	}
@@ -2127,7 +2129,8 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn, unsigned long end_pfn)
  * Allocates a fresh surplus page from the page allocator.
  */
 static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
-				gfp_t gfp_mask,	int nid, nodemask_t *nmask)
+				gfp_t gfp_mask,	int nid, nodemask_t *nmask,
+				unsigned long addr)
 {
 	struct folio *folio = NULL;
 
@@ -2139,7 +2142,7 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
 		goto out_unlock;
 	spin_unlock_irq(&hugetlb_lock);
 
-	folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask);
+	folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, addr);
 	if (!folio)
 		return NULL;
 
@@ -2182,7 +2185,7 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas
 	if (hstate_is_gigantic(h))
 		return NULL;
 
-	folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask);
+	folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, 0);
 	if (!folio)
 		return NULL;
 
@@ -2218,14 +2221,14 @@ struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
 	if (mpol_is_preferred_many(mpol)) {
 		gfp_t gfp = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
 
-		folio = alloc_surplus_hugetlb_folio(h, gfp, nid, nodemask);
+		folio = alloc_surplus_hugetlb_folio(h, gfp, nid, nodemask, addr);
 
 		/* Fallback to all nodes if page==NULL */
 		nodemask = NULL;
 	}
 
 	if (!folio)
-		folio = alloc_surplus_hugetlb_folio(h, gfp_mask, nid, nodemask);
+		folio = alloc_surplus_hugetlb_folio(h, gfp_mask, nid, nodemask, addr);
 	mpol_cond_put(mpol);
 	return folio;
 }
@@ -2332,7 +2335,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 		 * down the road to pick the current node if that is the case.
 		 */
 		folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h),
-						    NUMA_NO_NODE, &alloc_nodemask);
+						    NUMA_NO_NODE, &alloc_nodemask, 0);
 		if (!folio) {
 			alloc_ok = false;
 			break;
@@ -2738,7 +2741,7 @@ static int alloc_and_dissolve_hugetlb_folio(struct folio *old_folio,
 			spin_unlock_irq(&hugetlb_lock);
 			gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
 			new_folio = alloc_fresh_hugetlb_folio(h, gfp_mask,
-							      nid, NULL);
+							      nid, NULL, 0);
 			if (!new_folio)
 				return -ENOMEM;
 			goto retry;
@@ -3434,13 +3437,13 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 			gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
 
 			folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid,
-					&node_states[N_MEMORY], NULL);
+					&node_states[N_MEMORY], NULL, 0);
 			if (!folio && !list_empty(&folio_list) &&
 			    hugetlb_vmemmap_optimizable_size(h)) {
 				prep_and_add_allocated_folios(h, &folio_list);
 				INIT_LIST_HEAD(&folio_list);
 				folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid,
-						&node_states[N_MEMORY], NULL);
+						&node_states[N_MEMORY], NULL, 0);
 			}
 			if (!folio)
 				break;
@@ -5809,8 +5812,6 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 				ret = 0;
 			goto out;
 		}
-		if (!folio_test_clear_prezeroed(folio))
-			folio_zero_user(folio, vmf->real_address);
 		__folio_mark_uptodate(folio);
 		new_folio = true;
 
diff --git a/mm/internal.h b/mm/internal.h
index ceb0b604c682..b5df7c673ce7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -672,6 +672,7 @@ struct alloc_context {
 	 */
 	enum zone_type highest_zoneidx;
 	bool spread_dirty_pages;
+	unsigned long user_addr;
 };
 
 /*
@@ -888,13 +889,13 @@ static inline void prep_compound_tail(struct page *head, int tail_idx)
 }
 
 void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
-		     bool prezeroed);
+		     bool prezeroed, unsigned long user_addr);
 extern bool free_pages_prepare(struct page *page, unsigned int order);
 
 extern int user_min_free_kbytes;
 
 struct page *__alloc_frozen_pages_noprof(gfp_t, unsigned int order, int nid,
-		nodemask_t *);
+		nodemask_t *, unsigned long user_addr);
 #define __alloc_frozen_pages(...) \
 	alloc_hooks(__alloc_frozen_pages_noprof(__VA_ARGS__))
 void free_frozen_pages(struct page *page, unsigned int order);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1dd3cfca610d..3ae80c25a51e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1055,7 +1055,7 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
 	int node = hpage_collapse_find_target_node(cc);
 	struct folio *folio;
 
-	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
+	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask, 0);
 	if (!folio) {
 		*foliop = NULL;
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
diff --git a/mm/memory.c b/mm/memory.c
index 2f61321a81fd..beb6ce312dec 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5176,7 +5176,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 		goto fallback;
 
 	/* Try allocating the highest of the remaining orders. */
-	gfp = vma_thp_gfp_mask(vma) | __GFP_PREZEROED;
+	gfp = vma_thp_gfp_mask(vma) | __GFP_ZERO;
 	while (orders) {
 		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
 		folio = vma_alloc_folio(gfp, order, vma, addr);
@@ -5187,16 +5187,6 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 				goto next;
 			}
 			folio_throttle_swaprate(folio, gfp);
-			/*
-			 * When a folio is not zeroed during allocation
-			 * (__GFP_ZERO not used) or user folios require special
-			 * handling, folio_zero_user() is used to make sure
-			 * that the page corresponding to the faulting address
-			 * will be hot in the cache after zeroing.
-			 */
-			if (user_alloc_needs_zeroing() &&
-			    !folio_test_clear_prezeroed(folio))
-				folio_zero_user(folio, vmf->address);
 			return folio;
 		}
 next:
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0e5175f1c767..d5fe2da537c9 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1454,7 +1454,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
 	else
 		gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP;
 
-	return folio_alloc_mpol(gfp, order, pol, ilx, nid);
+	return folio_alloc_mpol(gfp, order, pol, ilx, nid, 0);
 }
 #else
 
@@ -2419,9 +2419,9 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
 	 */
 	preferred_gfp = gfp | __GFP_NOWARN;
 	preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
-	page = __alloc_frozen_pages_noprof(preferred_gfp, order, nid, nodemask);
+	page = __alloc_frozen_pages_noprof(preferred_gfp, order, nid, nodemask, 0);
 	if (!page)
-		page = __alloc_frozen_pages_noprof(gfp, order, nid, NULL);
+		page = __alloc_frozen_pages_noprof(gfp, order, nid, NULL, 0);
 
 	return page;
 }
@@ -2437,7 +2437,8 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
  * Return: The page on success or NULL if allocation fails.
  */
 static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
-		struct mempolicy *pol, pgoff_t ilx, int nid)
+		struct mempolicy *pol, pgoff_t ilx, int nid,
+		unsigned long user_addr)
 {
 	nodemask_t *nodemask;
 	struct page *page;
@@ -2469,7 +2470,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 			 */
 			page = __alloc_frozen_pages_noprof(
 				gfp | __GFP_THISNODE | __GFP_NORETRY, order,
-				nid, NULL);
+				nid, NULL, user_addr);
 			if (page || !(gfp & __GFP_DIRECT_RECLAIM))
 				return page;
 			/*
@@ -2481,7 +2482,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 		}
 	}
 
-	page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask);
+	page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask, user_addr);
 
 	if (unlikely(pol->mode == MPOL_INTERLEAVE ||
 		     pol->mode == MPOL_WEIGHTED_INTERLEAVE) && page) {
@@ -2498,10 +2499,11 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 }
 
 struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
-		struct mempolicy *pol, pgoff_t ilx, int nid)
+		struct mempolicy *pol, pgoff_t ilx, int nid,
+		unsigned long user_addr)
 {
 	struct page *page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
-			ilx, nid);
+			ilx, nid, user_addr);
 	if (!page)
 		return NULL;
 
@@ -2535,7 +2537,7 @@ struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct
 		gfp |= __GFP_NOWARN;
 
 	pol = get_vma_policy(vma, addr, order, &ilx);
-	folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
+	folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id(), addr);
 	mpol_cond_put(pol);
 	return folio;
 }
@@ -2553,7 +2555,7 @@ struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned order)
 		pol = get_task_policy(current);
 
 	return alloc_pages_mpol(gfp, order, pol, NO_INTERLEAVE_INDEX,
-				       numa_node_id());
+				       numa_node_id(), 0);
 }
 
 /**
diff --git a/mm/migrate.c b/mm/migrate.c
index 1bf2cf8c44dd..e899b0dc2461 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2202,7 +2202,7 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private)
 	if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE)
 		gfp_mask |= __GFP_HIGHMEM;
 
-	return __folio_alloc(gfp_mask, order, nid, mtc->nmask);
+	return __folio_alloc(gfp_mask, order, nid, mtc->nmask, 0);
 }
 
 #ifdef CONFIG_NUMA
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 57dc5195b29b..65f4f9ebd4a1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1864,19 +1864,14 @@ static inline bool should_skip_init(gfp_t flags)
 }
 
 inline void post_alloc_hook(struct page *page, unsigned int order,
-				gfp_t gfp_flags, bool prezeroed)
+				gfp_t gfp_flags, bool prezeroed,
+				unsigned long user_addr)
 {
 	bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
 			!should_skip_init(gfp_flags);
-	bool preserve_prezeroed = prezeroed && (gfp_flags & __GFP_PREZEROED);
 	bool zero_tags = init && (gfp_flags & __GFP_ZEROTAGS);
 	int i;
 
-	/*
-	 * If the page is pre-zeroed and the caller opted in via
-	 * __GFP_PREZEROED, preserve the marker so the caller can
-	 * skip its own zeroing.
-	 */
 	__ClearPagePrezeroed(page);
 
 	/*
@@ -1923,12 +1918,17 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 		for (i = 0; i != 1 << order; ++i)
 			page_kasan_tag_reset(page + i);
 	}
-	/* If memory is still not initialized, initialize it now. */
-	if (init)
-		kernel_init_pages(page, 1 << order);
-
-	if (preserve_prezeroed)
-		__SetPagePrezeroed(page);
+	/*
+	 * If memory is still not initialized, initialize it now.
+	 * For user pages, use folio_zero_user() which zeros near the
+	 * faulting address last, keeping those cachelines hot.
+	 */
+	if (init) {
+		if (user_addr)
+			folio_zero_user(page_folio(page), user_addr);
+		else
+			kernel_init_pages(page, 1 << order);
+	}
 
 	set_page_owner(page, order, gfp_flags);
 	page_table_check_alloc(page, order);
@@ -1936,9 +1936,10 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 }
 
 static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
-			  unsigned int alloc_flags, bool prezeroed)
+			  unsigned int alloc_flags, bool prezeroed,
+			  unsigned long user_addr)
 {
-	post_alloc_hook(page, order, gfp_flags, prezeroed);
+	post_alloc_hook(page, order, gfp_flags, prezeroed, user_addr);
 
 	if (order && (gfp_flags & __GFP_COMP))
 		prep_compound_page(page, order);
@@ -4010,7 +4011,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 					&prezeroed);
 		if (page) {
 			prep_new_page(page, order, gfp_mask, alloc_flags,
-				      prezeroed);
+				      prezeroed, ac->user_addr);
 
 			/*
 			 * If this is a high-order atomic allocation then check
@@ -4245,7 +4246,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 	/* Prep a captured page if available */
 	if (page)
-		prep_new_page(page, order, gfp_mask, alloc_flags, false);
+		prep_new_page(page, order, gfp_mask, alloc_flags, false, 0);
 
 	/* Try get a page from the freelist if available */
 	if (!page)
@@ -5239,7 +5240,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 		}
 		nr_account++;
 
-		prep_new_page(page, 0, gfp, 0, prezeroed);
+		prep_new_page(page, 0, gfp, 0, prezeroed, 0);
 		set_page_refcounted(page);
 		page_array[nr_populated++] = page;
 	}
@@ -5253,7 +5254,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	return nr_populated;
 
 failed:
-	page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask);
+	page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask, 0);
 	if (page)
 		page_array[nr_populated++] = page;
 	goto out;
@@ -5264,12 +5265,13 @@ EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
-		int preferred_nid, nodemask_t *nodemask)
+		int preferred_nid, nodemask_t *nodemask,
+		unsigned long user_addr)
 {
 	struct page *page;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
 	gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */
-	struct alloc_context ac = { };
+	struct alloc_context ac = { .user_addr = user_addr };
 
 	/*
 	 * There are several places where we assume that the order value is sane
@@ -5329,11 +5331,12 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
 EXPORT_SYMBOL(__alloc_frozen_pages_noprof);
 
 struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
-		int preferred_nid, nodemask_t *nodemask)
+		int preferred_nid, nodemask_t *nodemask,
+		unsigned long user_addr)
 {
 	struct page *page;
 
-	page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask);
+	page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask, user_addr);
 	if (page)
 		set_page_refcounted(page);
 	return page;
@@ -5341,10 +5344,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
 EXPORT_SYMBOL(__alloc_pages_noprof);
 
 struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
-		nodemask_t *nodemask)
+		nodemask_t *nodemask, unsigned long user_addr)
 {
 	struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
-					preferred_nid, nodemask);
+					preferred_nid, nodemask, user_addr);
 	return page_rmappable_folio(page);
 }
 EXPORT_SYMBOL(__folio_alloc_noprof);
@@ -6962,7 +6965,7 @@ static void split_free_frozen_pages(struct list_head *list, gfp_t gfp_mask)
 		list_for_each_entry_safe(page, next, &list[order], lru) {
 			int i;
 
-			post_alloc_hook(page, order, gfp_mask, false);
+			post_alloc_hook(page, order, gfp_mask, false, 0);
 			if (!order)
 				continue;
 
@@ -7168,7 +7171,7 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
 		struct page *head = pfn_to_page(start);
 
 		check_new_pages(head, order);
-		prep_new_page(head, order, gfp_mask, 0, false);
+		prep_new_page(head, order, gfp_mask, 0, false, 0);
 	} else {
 		ret = -EINVAL;
 		WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n",
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index d2423f30577e..1183e1ad9b4b 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -57,10 +57,10 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
 	gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
 		   __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
 	page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
-			     numa_mem_id(), NULL);
+			     numa_mem_id(), NULL, 0);
 #endif
 	if (unlikely(!page)) {
-		page = __alloc_pages(gfp, 0, numa_mem_id(), NULL);
+		page = __alloc_pages(gfp, 0, numa_mem_id(), NULL, 0);
 		order = 0;
 	}
 
diff --git a/mm/shmem.c b/mm/shmem.c
index b40f3cd48961..367ded4375e5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1927,7 +1927,7 @@ static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
 	struct folio *folio;
 
 	mpol = shmem_get_pgoff_policy(info, index, order, &ilx);
-	folio = folio_alloc_mpol(gfp, order, mpol, ilx, numa_node_id());
+	folio = folio_alloc_mpol(gfp, order, mpol, ilx, numa_node_id(), 0);
 	mpol_cond_put(mpol);
 
 	return folio;
diff --git a/mm/slub.c b/mm/slub.c
index 0c906fefc31b..a514a3324e8a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3266,7 +3266,7 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node,
 	else if (node == NUMA_NO_NODE)
 		page = alloc_frozen_pages(flags, order);
 	else
-		page = __alloc_frozen_pages(flags, order, node, NULL);
+		page = __alloc_frozen_pages(flags, order, node, NULL, 0);
 
 	if (!page)
 		return NULL;
@@ -5178,7 +5178,7 @@ static void *___kmalloc_large_node(size_t size, gfp_t flags, int node)
 	if (node == NUMA_NO_NODE)
 		page = alloc_frozen_pages_noprof(flags, order);
 	else
-		page = __alloc_frozen_pages_noprof(flags, order, node, NULL);
+		page = __alloc_frozen_pages_noprof(flags, order, node, NULL, 0);
 
 	if (page) {
 		ptr = page_address(page);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 6d0eef7470be..f7cbe17a881b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -568,7 +568,7 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
 		return NULL;
 
 	/* Allocate a new folio to be added into the swap cache. */
-	folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
+	folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id(), 0);
 	if (!folio)
 		return NULL;
 	/* Try add the new folio, returns existing folio or NULL on failure. */



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-04-13 23:43 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-12 22:50 [PATCH RFC 0/9] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
2026-04-12 22:50 ` [PATCH RFC 1/9] mm: page_alloc: propagate PageReported flag across buddy splits Michael S. Tsirkin
2026-04-13 19:11   ` David Hildenbrand (Arm)
2026-04-13 20:32     ` Michael S. Tsirkin
2026-04-12 22:50 ` [PATCH RFC 2/9] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
2026-04-13  8:00   ` David Hildenbrand (Arm)
2026-04-13  8:10     ` Michael S. Tsirkin
2026-04-13  8:15       ` David Hildenbrand (Arm)
2026-04-13  8:29         ` Michael S. Tsirkin
2026-04-13 20:35     ` Michael S. Tsirkin
2026-04-12 22:50 ` [PATCH RFC 3/9] mm: add __GFP_PREZEROED flag and folio_test_clear_prezeroed() Michael S. Tsirkin
2026-04-13  9:05   ` David Hildenbrand (Arm)
2026-04-13 20:37     ` Michael S. Tsirkin
2026-04-13 21:37     ` Michael S. Tsirkin
2026-04-13 22:06     ` Michael S. Tsirkin
2026-04-13 23:43       ` Michael S. Tsirkin
2026-04-12 22:50 ` [PATCH RFC 4/9] mm: skip zeroing in vma_alloc_zeroed_movable_folio for pre-zeroed pages Michael S. Tsirkin
2026-04-12 22:50 ` [PATCH RFC 5/9] mm: skip zeroing in alloc_anon_folio " Michael S. Tsirkin
2026-04-12 22:50 ` [PATCH RFC 6/9] mm: skip zeroing in vma_alloc_anon_folio_pmd " Michael S. Tsirkin
2026-04-12 22:51 ` [PATCH RFC 7/9] mm: hugetlb: skip zeroing of pre-zeroed hugetlb pages Michael S. Tsirkin
2026-04-12 22:51 ` [PATCH RFC 8/9] mm: page_reporting: add flush parameter to trigger immediate reporting Michael S. Tsirkin
2026-04-12 22:51 ` [PATCH RFC 9/9] virtio_balloon: a hack to enable host-zeroed page optimization Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox