[PATCH RFC v3 00/19] mm/virtio: skip redundant zeroing of host-zeroed reported pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Michael S. Tsirkin" <mst@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Vlastimil Babka <vbabka@kernel.org>,
	Brendan Jackman <jackmanb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Jason Wang <jasowang@redhat.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Gregory Price <gourry@gourry.net>,
	linux-mm@kvack.org, virtualization@lists.linux.dev
Subject: [PATCH RFC v3 00/19] mm/virtio: skip redundant zeroing of host-zeroed reported pages
Date: Tue, 21 Apr 2026 18:01:07 -0400	[thread overview]
Message-ID: <cover.1776808209.git.mst@redhat.com> (raw)

When a guest reports free pages to the hypervisor via virtio-balloon's
free page reporting, the host typically zeros those pages when reclaiming
their backing memory (e.g., via MADV_DONTNEED on anonymous mappings).
When the guest later reallocates those pages, the kernel zeros them
again, redundantly.

Further, on architectures with aliasing caches, upstream with init_on_alloc
double-zeros user pages: once via kernel_init_pages() in
post_alloc_hook, and again via clear_user_highpage() at the
callsite (because user_alloc_needs_zeroing() returns true).
This series eliminates that double-zeroing by moving the zeroing
into the post_alloc_hook + propagating the "host
already zeroed this page" information through the buddy allocator.

For the reporting part, I am working on virtio spec now, so sending this
out for early feedback. In particular:
- is the mm zeroing rework acceptable?
- is sysfs testing hook for flushing acceptable?
- first 10 patches, including the fix for init_on_alloc double zeroing
  are independently mergeable mm rework -
  are they deemed a desirable rework, and should I post them
  separately for inclusion?

Thanks in advance.

Still an RFC as virtio bits need work, but I would very much like
to get a general agreement on mm bits first, so we don't add
a spec for something we can't then use.

-------

Performance with THP enabled on a 2GB VM, 1 vCPU, allocating
256MB of anonymous pages:

  metric         baseline            optimized           delta
  task-clock     175 +- 10 ms        40 +- 9 ms          -77%
  cache-misses   924K +- 323K        287K +- 93K         -69%
  instructions   15.3M +- 634K       13.5M +- 337K       -12%

With hugetlb surplus pages:

  metric         baseline            optimized           delta
  task-clock     169 +- 9 ms         49 +- 19 ms         -71%
  cache-misses   1.24M +- 222K       316K +- 114K        -75%
  instructions   17.3M +- 1.23M      15.0M +- 604K       -13%

Notes:
- The virtio_balloon module parameter (13/19) is a testing hack.
  A proper virtio feature flag is needed before merging.
- Patch 14/19 adds a sysfs flush trigger for deterministic testing
  (avoids waiting for the 2-second reporting delay).
- When host_zeroes_pages is set, callers skip folio_zero_user() for
  pages known to be zeroed by the host. This is safe on all
  architectures because the hypervisor invalidates guest cache lines
  when reclaiming page backing (MADV_DONTNEED).

Two flags track known-zero pages:
  PG_zeroed (aliased to PG_private) marks buddy allocator pages that
  are known to contain all zeros -- either because the host zeroed
  them during page reporting, or because they were freed via the
  balloon deflate path.  It lives on free-list pages and is consumed
  by post_alloc_hook() on allocation.
  HPG_zeroed (stored in hugetlb folio->private bits) serves the same
  purpose for hugetlb pool pages, which are kept in a pool and may
  be zeroed long after buddy allocation, so PG_zeroed (consumed at
  allocation time) cannot track their state.

- PG_zeroed is aliased to PG_private. It is excluded from
  PAGE_FLAGS_CHECK_AT_PREP because it must survive on free-list pages
  until post_alloc_hook() consumes and clears it. Is this acceptable,
  or should a different bit be used?

- On architectures with aliasing caches, upstream with init_on_alloc
  double-zeros user pages: once via kernel_init_pages() in
  post_alloc_hook, and again via clear_user_highpage() at the
  callsite (because user_alloc_needs_zeroing() returns true).
  Our patches eliminate this by zeroing once via folio_zero_user()
  in post_alloc_hook.  Not yet performance-tested on aliasing
  hardware.

PG_zeroed lifecycle:

  Sets PG_zeroed:
  - page_reporting_drain: on reported pages when host zeroes them
  - __free_pages_ok / __free_frozen_pages: when FPI_ZEROED is set
    (balloon deflate path)
  - buddy merge: on merged page if both buddies were zeroed
  - expand(): propagate to split-off buddy sub-pages

  Clears PG_zeroed:
  - buddy merge: clear both pages before merge, then conditionally
    re-set on merged head if both were zeroed
  - post_alloc_hook: clear on head page after consuming the hint

HPG_zeroed lifecycle (hugetlb pool pages, stored in folio->private):

  Sets HPG_zeroed:
  - alloc_surplus_hugetlb_folio: after buddy allocation with
    __GFP_ZERO, mark pool page as known-zero

  Clears HPG_zeroed:
  - free_huge_folio: page was mapped to userspace, no longer
    known-zero when it returns to the pool
  - alloc_hugetlb_folio / alloc_hugetlb_folio_reserve: clear
    after reporting to caller via bool *zeroed output (consumed)

- The optimization is most effective with THP, where entire 2MB
  pages are allocated directly from reported order-9+ buddy pages.
  Without THP, only ~21% of order-0 allocations come from reported
  pages due to low-order fragmentation.
- Persistent hugetlb pool pages are not covered: when freed by
  userspace they return to the hugetlb free pool, not the buddy
  allocator, so they are never reported to the host.  Surplus
  hugetlb pages are allocated from buddy and do benefit.

Test program:

  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  #include <sys/mman.h>

  #ifndef MADV_POPULATE_WRITE
  #define MADV_POPULATE_WRITE 23
  #endif
  #ifndef MAP_HUGETLB
  #define MAP_HUGETLB 0x40000
  #endif

  int main(int argc, char **argv)
  {
      unsigned long size;
      int flags = MAP_PRIVATE | MAP_ANONYMOUS;
      void *p;
      int r;

      if (argc < 2) {
          fprintf(stderr, "usage: %s <size_mb> [huge]\n", argv[0]);
          return 1;
      }
      size = atol(argv[1]) * 1024UL * 1024;
      if (argc >= 3 && strcmp(argv[2], "huge") == 0)
          flags |= MAP_HUGETLB;
      p = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
      if (p == MAP_FAILED) {
          perror("mmap");
          return 1;
      }
      r = madvise(p, size, MADV_POPULATE_WRITE);
      if (r) {
          perror("madvise");
          return 1;
      }
      munmap(p, size);
      return 0;
  }

Test script (bench.sh):

  #!/bin/bash
  # Usage: bench.sh <size_mb> <mode> <iterations> [huge]
  # mode 0 = baseline, mode 1 = skip zeroing
  SZ=${1:-256}; MODE=${2:-0}; ITER=${3:-10}; HUGE=${4:-}
  FLUSH=/sys/module/page_reporting/parameters/flush
  PERF_DATA=/tmp/perf-$MODE.csv
  rmmod virtio_balloon 2>/dev/null
  insmod virtio_balloon.ko host_zeroes_pages=$MODE
  echo 512 > $FLUSH
  [ "$HUGE" = "huge" ] && echo $((SZ/2)) > /proc/sys/vm/nr_overcommit_hugepages
  rm -f $PERF_DATA
  echo "=== sz=${SZ}MB mode=$MODE iter=$ITER $HUGE ==="
  for i in $(seq 1 $ITER); do
      echo 3 > /proc/sys/vm/drop_caches
      echo 512 > $FLUSH
      perf stat -e task-clock,instructions,cache-misses \
          -x, -o $PERF_DATA --append -- ./alloc_once $SZ $HUGE
  done
  [ "$HUGE" = "huge" ] && echo 0 > /proc/sys/vm/nr_overcommit_hugepages
  rmmod virtio_balloon
  awk -F, '/^#/||/^$/{next}{v=$1+0;e=$3;gsub(/ /,"",e);s[e]+=v;ss[e]+=v*v;n[e]++}
  END{for(e in s){a=s[e]/n[e];d=sqrt(ss[e]/n[e]-a*a);printf "  %-16s %10.0f +- %8.0f (n=%d)\n",e,a,d,n[e]}}' $PERF_DATA

Compile and run:
  gcc -static -O2 -o alloc_once alloc_once.c
  bash bench.sh 256 0 10          # baseline (regular pages)
  bash bench.sh 256 1 10          # optimized (regular pages)
  bash bench.sh 256 0 10 huge     # baseline (hugetlb surplus)
  bash bench.sh 256 1 10 huge     # optimized (hugetlb surplus)

Changes since v2 (address review by Gregory Price and David Hildenbrand):
- v2 used pghint_t / vma_alloc_folio_hints API.  v3 switches to
  threading user_addr through the page allocator and using __GFP_ZERO,
  so post_alloc_hook() can use folio_zero_user() for cache-friendly
  zeroing when the user fault address is known.
- Exclude __PG_ZEROED from PAGE_FLAGS_CHECK_AT_PREP macro definition
  instead of runtime masking in __free_one_page.
- Drop redundant page_poisoning_enabled() check from mm core free
  path -- already guarded at feature negotiation time in
  virtio_balloon_validate.  The balloon driver keeps its own
  page_poisoning_enabled_static() check as defense in depth.
- Split free_frozen_pages_zeroed and put_page_zeroed into separate
  patches.  David Hildenbrand indicated he intends to rework balloon
  pages to be frozen (no refcount), at which point put_page_zeroed
  (16/19) can be dropped and the balloon can call
  free_frozen_pages_zeroed directly.
- Use HPG_zeroed flag (in hugetlb folio->private) for hugetlb pool
  pages instead of PG_zeroed, since pool pages are zeroed long after
  buddy allocation and PG_zeroed is consumed at allocation time.
- syzbot CI found a PF_NO_COMPOUND BUG in the v2 pghint_t approach
  where __ClearPageZeroed was called on compound hugetlb pages in
  free_huge_folio.  The v3 HPG_zeroed approach avoids this.
- Remove redundant arch vma_alloc_zeroed_movable_folio overrides
  on x86, s390, m68k, and alpha (10/19). Suggested by David
  Hildenbrand.
- Updated benchmarking script to compute per-run avg +- stddev
  via awk on CSV output.

Changes v1->v2:
- Replaced __GFP_PREZEROED with PG_zeroed page flag (aliased PG_private)
- Added pghint_t type and vma_alloc_folio_hints() API
- Track PG_zeroed across buddy merges and splits
- Added post_alloc_hook integration (single consume/clear point)
- Added hugetlb support (pool pages + memfd)
- Added page_reporting flush parameter for deterministic testing
- Added free_frozen_pages_hint/put_page_hint for balloon deflate path
- Added try_to_claim_block PG_zeroed preservation
- Updated perf numbers with per-iteration flush methodology

Written with assistance from Claude (claude-opus-4-6).
Reviewed by cursor-agent (GPT-5.4-xhigh).
Everything manually read, patchset split and commit logs edited manually.

Michael S. Tsirkin (19):
  mm: thread user_addr through page allocator for cache-friendly zeroing
  mm: add folio_zero_user stub for configs without THP/HUGETLBFS
  mm: page_alloc: move prep_compound_page before post_alloc_hook
  mm: use folio_zero_user for user pages in post_alloc_hook
  mm: use __GFP_ZERO in vma_alloc_zeroed_movable_folio
  mm: use __GFP_ZERO in alloc_anon_folio
  mm: use __GFP_ZERO in vma_alloc_anon_folio_pmd
  mm: hugetlb: use __GFP_ZERO and skip zeroing for zeroed pages
  mm: memfd: skip zeroing for zeroed hugetlb pool pages
  mm: remove arch vma_alloc_zeroed_movable_folio overrides
  mm: page_alloc: propagate PageReported flag across buddy splits
  mm: page_reporting: skip redundant zeroing of host-zeroed reported
    pages
  virtio_balloon: a hack to enable host-zeroed page optimization
  mm: page_reporting: add flush parameter with page budget
  mm: add free_frozen_pages_zeroed
  mm: add put_page_zeroed and folio_put_zeroed
  mm: page_alloc: clear PG_zeroed on buddy merge if not both zero
  mm: page_alloc: preserve PG_zeroed in page_del_and_expand
  virtio_balloon: mark deflated pages as zeroed

 arch/alpha/include/asm/page.h   |   3 -
 arch/m68k/include/asm/page_no.h |   3 -
 arch/s390/include/asm/page.h    |   3 -
 arch/x86/include/asm/page.h     |   3 -
 drivers/virtio/virtio_balloon.c |  12 ++-
 fs/hugetlbfs/inode.c            |  10 ++-
 include/linux/gfp.h             |  26 ++++--
 include/linux/highmem.h         |   9 +-
 include/linux/hugetlb.h         |  14 ++-
 include/linux/mm.h              |  43 +++++++++
 include/linux/page-flags.h      |  12 ++-
 include/linux/page_reporting.h  |   3 +
 mm/compaction.c                 |   5 +-
 mm/filemap.c                    |   3 +-
 mm/huge_memory.c                |  12 +--
 mm/hugetlb.c                    | 101 +++++++++++++++-------
 mm/internal.h                   |   8 +-
 mm/khugepaged.c                 |   2 +-
 mm/memfd.c                      |  17 ++--
 mm/memory.c                     |  15 +---
 mm/mempolicy.c                  |  39 ++++++---
 mm/migrate.c                    |   2 +-
 mm/page_alloc.c                 | 149 ++++++++++++++++++++++++--------
 mm/page_frag_cache.c            |   4 +-
 mm/page_reporting.c             |  56 +++++++++++-
 mm/page_reporting.h             |  12 +++
 mm/shmem.c                      |   2 +-
 mm/slub.c                       |   4 +-
 mm/swap.c                       |  18 +++-
 mm/swap_state.c                 |   2 +-
 30 files changed, 433 insertions(+), 159 deletions(-)

-- 
MST

next             reply	other threads:[~2026-04-21 22:01 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-21 22:01 Michael S. Tsirkin [this message]
2026-04-21 22:01 ` [PATCH RFC v3 01/19] mm: thread user_addr through page allocator for cache-friendly zeroing Michael S. Tsirkin
2026-04-21 22:01 ` [PATCH RFC v3 02/19] mm: add folio_zero_user stub for configs without THP/HUGETLBFS Michael S. Tsirkin
2026-04-21 22:01 ` [PATCH RFC v3 03/19] mm: page_alloc: move prep_compound_page before post_alloc_hook Michael S. Tsirkin
2026-04-21 22:01 ` [PATCH RFC v3 04/19] mm: use folio_zero_user for user pages in post_alloc_hook Michael S. Tsirkin
2026-04-21 22:01 ` [PATCH RFC v3 05/19] mm: use __GFP_ZERO in vma_alloc_zeroed_movable_folio Michael S. Tsirkin
2026-04-21 22:01 ` [PATCH RFC v3 06/19] mm: use __GFP_ZERO in alloc_anon_folio Michael S. Tsirkin
2026-04-21 22:01 ` [PATCH RFC v3 07/19] mm: use __GFP_ZERO in vma_alloc_anon_folio_pmd Michael S. Tsirkin
2026-04-21 22:01 ` [PATCH RFC v3 08/19] mm: hugetlb: use __GFP_ZERO and skip zeroing for zeroed pages Michael S. Tsirkin
2026-04-21 22:01 ` [PATCH RFC v3 09/19] mm: memfd: skip zeroing for zeroed hugetlb pool pages Michael S. Tsirkin
2026-04-21 22:01 ` [PATCH RFC v3 10/19] mm: remove arch vma_alloc_zeroed_movable_folio overrides Michael S. Tsirkin
2026-04-21 22:01 ` [PATCH RFC v3 11/19] mm: page_alloc: propagate PageReported flag across buddy splits Michael S. Tsirkin
2026-04-21 22:01 ` [PATCH RFC v3 12/19] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
2026-04-21 22:01 ` [PATCH RFC v3 13/19] virtio_balloon: a hack to enable host-zeroed page optimization Michael S. Tsirkin
2026-04-21 22:01 ` [PATCH RFC v3 14/19] mm: page_reporting: add flush parameter with page budget Michael S. Tsirkin
2026-04-21 22:01 ` [PATCH RFC v3 15/19] mm: add free_frozen_pages_zeroed Michael S. Tsirkin
2026-04-21 22:02 ` [PATCH RFC v3 16/19] mm: add put_page_zeroed and folio_put_zeroed Michael S. Tsirkin
2026-04-21 22:02 ` [PATCH RFC v3 17/19] mm: page_alloc: clear PG_zeroed on buddy merge if not both zero Michael S. Tsirkin
2026-04-21 22:02 ` [PATCH RFC v3 18/19] mm: page_alloc: preserve PG_zeroed in page_del_and_expand Michael S. Tsirkin
2026-04-21 22:02 ` [PATCH RFC v3 19/19] virtio_balloon: mark deflated pages as zeroed Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1776808209.git.mst@redhat.com \
    --to=mst@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=gourry@gourry.net \
    --cc=jackmanb@google.com \
    --cc=jasowang@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=virtualization@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox