From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>, Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Vlastimil Babka <vbabka@kernel.org>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Zi Yan <ziy@nvidia.com>, Jonathan Corbet <corbet@lwn.net>,
Shuah Khan <skhan@linuxfoundation.org>,
Sean Christopherson <seanjc@google.com>,
Paolo Bonzini <pbonzini@redhat.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org,
kvm@vger.kernel.org, "Kiryl Shutsemau (Meta)" <kas@kernel.org>
Subject: [RFC, PATCH 00/12] userfaultfd: working set tracking for VM guest memory
Date: Tue, 14 Apr 2026 15:23:34 +0100 [thread overview]
Message-ID: <20260414142354.1465950-1-kas@kernel.org> (raw)
This series adds userfaultfd support for tracking the working set of
VM guest memory, enabling VMMs to identify cold pages and evict them
to tiered or remote storage.
== Problem ==
VMMs managing guest memory need to:
1. Track which pages are actively used (working set detection)
2. Safely evict cold pages to slower storage
3. Fetch pages back on demand when accessed again
For shmem-backed guest memory, working set tracking partially works
today: MADV_DONTNEED zaps PTEs while pages stay in page cache, and
re-access auto-resolves from cache. But safe eviction still requires
synchronous fault interception to prevent data loss races.
For anonymous guest memory (needed for KSM cross-VM deduplication),
there is no mechanism at all — clearing a PTE loses the page.
== Solution ==
The series introduces a unified userfaultfd interface that works
across both anonymous and shmem-backed memory:
UFFD_FEATURE_MINOR_ANON: extends MODE_MINOR registration to anonymous
private memory. Uses the PROT_NONE hinting mechanism (same as NUMA
balancing) to make pages inaccessible without freeing them.
UFFD_FEATURE_MINOR_ASYNC: auto-resolves minor faults without handler
involvement. The kernel restores PTE permissions immediately and the
faulting thread continues. Works for anonymous, shmem, and hugetlbfs.
UFFDIO_DEACTIVATE: marks pages as deactivated. For anonymous memory,
sets PROT_NONE on PTEs (pages stay resident). For shmem/hugetlbfs,
zaps PTEs (pages stay in page cache).
UFFDIO_SET_MODE: toggles MINOR_ASYNC at runtime, synchronized via
mmap_write_lock. Enables the VMM workflow: async mode for lightweight
detection, sync mode for race-free eviction.
PAGE_IS_UFFD_DEACTIVATED: PAGEMAP_SCAN category flag for efficient
batch detection of cold (still-deactivated) anonymous pages.
== VMM Workflow ==
UFFDIO_DEACTIVATE(all) -- async, no vCPU stalls
sleep(interval)
PAGEMAP_SCAN -- find cold pages
UFFDIO_SET_MODE(sync) -- block faults for eviction
pwrite + MADV_DONTNEED cold pages -- safe, faults block
UFFDIO_SET_MODE(async) -- resume tracking
The same workflow applies to shmem, with a different PAGEMAP_SCAN mask
(!PAGE_IS_PRESENT instead of PAGE_IS_UFFD_DEACTIVATED).
== NUMA Balancing ==
NUMA balancing scanning is skipped on anonymous VM_UFFD_MINOR VMAs to
avoid protnone conflicts. NUMA locality stats are fed from the uffd
fault path via task_numa_fault() so the scheduler retains placement
data. Shmem VMAs are unaffected (UFFDIO_DEACTIVATE zaps PTEs there,
no protnone involved).
== Testing ==
The series includes 6 new selftests covering async/sync modes,
PAGEMAP_SCAN cold detection, GUP through protnone, UFFDIO_SET_MODE
toggling, and cleanup on close. All 73 uffd unit tests pass
(including hugetlb) across defconfig, allnoconfig, allmodconfig,
and randomized configs.
Kiryl Shutsemau (Meta) (12):
userfaultfd: define UAPI constants for anonymous minor faults
userfaultfd: add UFFD_FEATURE_MINOR_ANON registration support
userfaultfd: implement UFFDIO_DEACTIVATE ioctl
userfaultfd: UFFDIO_CONTINUE for anonymous memory
mm: intercept protnone faults on VM_UFFD_MINOR anonymous VMAs
userfaultfd: auto-resolve shmem and hugetlbfs minor faults in async
mode
sched/numa: skip scanning anonymous VM_UFFD_MINOR VMAs
userfaultfd: enable UFFD_FEATURE_MINOR_ANON
mm/pagemap: add PAGE_IS_UFFD_DEACTIVATED to PAGEMAP_SCAN
userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle
selftests/mm: add userfaultfd anonymous minor fault tests
Documentation/userfaultfd: document working set tracking
Documentation/admin-guide/mm/userfaultfd.rst | 141 ++++-
fs/proc/task_mmu.c | 11 +-
fs/userfaultfd.c | 184 +++++-
include/linux/huge_mm.h | 6 +
include/linux/mm.h | 2 +
include/linux/sched/numa_balancing.h | 1 +
include/linux/userfaultfd_k.h | 21 +-
include/trace/events/sched.h | 3 +-
include/uapi/linux/fs.h | 1 +
include/uapi/linux/userfaultfd.h | 40 +-
kernel/sched/fair.c | 13 +
mm/huge_memory.c | 33 +-
mm/hugetlb.c | 3 +-
mm/memory.c | 51 +-
mm/mprotect.c | 9 +-
mm/shmem.c | 3 +-
mm/userfaultfd.c | 164 +++++-
tools/testing/selftests/mm/uffd-unit-tests.c | 458 +++++++++++++++
18 files changed, 1096 insertions(+), 48 deletions(-)
Kiryl Shutsemau (Meta) (12):
userfaultfd: define UAPI constants for anonymous minor faults
userfaultfd: add UFFD_FEATURE_MINOR_ANON registration support
userfaultfd: implement UFFDIO_DEACTIVATE ioctl
userfaultfd: UFFDIO_CONTINUE for anonymous memory
mm: intercept protnone faults on VM_UFFD_MINOR anonymous VMAs
userfaultfd: auto-resolve shmem and hugetlbfs minor faults in async
mode
sched/numa: skip scanning anonymous VM_UFFD_MINOR VMAs
userfaultfd: enable UFFD_FEATURE_MINOR_ANON
mm/pagemap: add PAGE_IS_UFFD_DEACTIVATED to PAGEMAP_SCAN
userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle
selftests/mm: add userfaultfd anonymous minor fault tests
Documentation/userfaultfd: document working set tracking
Documentation/admin-guide/mm/userfaultfd.rst | 141 +++++-
fs/proc/task_mmu.c | 11 +-
fs/userfaultfd.c | 184 +++++++-
include/linux/huge_mm.h | 6 +
include/linux/mm.h | 2 +
include/linux/sched/numa_balancing.h | 1 +
include/linux/userfaultfd_k.h | 21 +-
include/trace/events/sched.h | 3 +-
include/uapi/linux/fs.h | 1 +
include/uapi/linux/userfaultfd.h | 40 +-
kernel/sched/fair.c | 13 +
mm/huge_memory.c | 33 +-
mm/hugetlb.c | 3 +-
mm/memory.c | 51 ++-
mm/mprotect.c | 9 +-
mm/shmem.c | 3 +-
mm/userfaultfd.c | 164 ++++++-
tools/testing/selftests/mm/uffd-unit-tests.c | 458 +++++++++++++++++++
18 files changed, 1096 insertions(+), 48 deletions(-)
--
2.51.2
next reply other threads:[~2026-04-14 14:24 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-14 14:23 Kiryl Shutsemau (Meta) [this message]
2026-04-14 14:23 ` [RFC, PATCH 01/12] userfaultfd: define UAPI constants for anonymous minor faults Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 02/12] userfaultfd: add UFFD_FEATURE_MINOR_ANON registration support Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 03/12] userfaultfd: implement UFFDIO_DEACTIVATE ioctl Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 04/12] userfaultfd: UFFDIO_CONTINUE for anonymous memory Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 05/12] mm: intercept protnone faults on VM_UFFD_MINOR anonymous VMAs Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 06/12] userfaultfd: auto-resolve shmem and hugetlbfs minor faults in async mode Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 07/12] sched/numa: skip scanning anonymous VM_UFFD_MINOR VMAs Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 08/12] userfaultfd: enable UFFD_FEATURE_MINOR_ANON Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 09/12] mm/pagemap: add PAGE_IS_UFFD_DEACTIVATED to PAGEMAP_SCAN Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 10/12] userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 11/12] selftests/mm: add userfaultfd anonymous minor fault tests Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 12/12] Documentation/userfaultfd: document working set tracking Kiryl Shutsemau (Meta)
2026-04-14 15:28 ` [RFC, PATCH 00/12] userfaultfd: working set tracking for VM guest memory Peter Xu
2026-04-14 17:08 ` Kiryl Shutsemau
2026-04-14 17:45 ` Peter Xu
2026-04-14 15:37 ` David Hildenbrand (Arm)
2026-04-14 17:10 ` Kiryl Shutsemau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260414142354.1465950-1-kas@kernel.org \
--to=kas@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=corbet@lwn.net \
--cc=david@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=rppt@kernel.org \
--cc=seanjc@google.com \
--cc=skhan@linuxfoundation.org \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox