[PATCH v2 0/3] memfd-based Userspace MFR Policy for HugeTLB

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jiaqi Yan <jiaqiyan@google.com>
To: nao.horiguchi@gmail.com, linmiaohe@huawei.com,
	william.roche@oracle.com,  harry.yoo@oracle.com
Cc: tony.luck@intel.com, wangkefeng.wang@huawei.com,
	willy@infradead.org,  jane.chu@oracle.com,
	akpm@linux-foundation.org, osalvador@suse.de,
	 rientjes@google.com, duenwen@google.com, jthoughton@google.com,
	 jgg@nvidia.com, ankita@nvidia.com, peterx@redhat.com,
	 sidhartha.kumar@oracle.com, ziy@nvidia.com, david@redhat.com,
	 dave.hansen@linux.intel.com, muchun.song@linux.dev,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,  Jiaqi Yan <jiaqiyan@google.com>
Subject: [PATCH v2 0/3] memfd-based Userspace MFR Policy for HugeTLB
Date: Sun, 16 Nov 2025 01:32:20 +0000	[thread overview]
Message-ID: <20251116013223.1557158-1-jiaqiyan@google.com> (raw)

Problem
=======

This patchset is a follow-up for the userspace memory failure
recovery (MFR) policy proposed in [1] and [2], but focused on
a smaller scope: HugeTLB.

To recap the problem for HugeTLB discussed in [1] and [2]:
Cloud providers like Google and Oracle usually serve capacity-
and performance-critical guest memory with 1G HugeTLB
hugepages, as this significantly reduces the overhead
associated with managing page tables and TLB misses. However,
the kernel's current MFR behavior for HugeTLB is not ideal.
Once a byte of memory in a hugepage is hardware corrupted, the
kernel discards the whole hugepage, including the healthy
portion, from the HugeTLB system. Customer workload running in
the VM can hardly recover from such a great loss of memory.

[1] and [2] proposed the idea that the decision to keep or
discard a large chunk of contiguous memory exclusively owned
by a userspace process due to a recoverable uncorrected
memory error (UE) should be controlled by userspace. What this
means in the Cloud case is that, since a virtual machine
monitor (VMM) has taken host memory to exclusively back the
guest memory for a VM, the VMM can keep holding the memory
even after memory errors occur.

MFD_MF_KEEP_UE_MAPPED for HugeTLB
=================================

[2] proposed a solution centered around the memfd associated
with the memory exclusively owned by userspace.

A userspace process must opt into the MFD_MF_KEEP_UE_MAPPED
policy when it creates a new HugeTLB-backed memfd:

  #define MFD_MF_KEEP_UE_MAPPED	0x0020U
  int memfd_create(const char *name, unsigned int flags);

For any hugepage associated with the MFD_MF_KEEP_UE_MAPPED
enabled memfd, whenever it runs into a UE, MFR doesn't hard
offline the HWPoison-ed huge folio. In other words, the
HWPoison-ed memory remains accessible via the returned memfd
or the memory mapping created with that memfd. MFR still sends
SIGBUS to the userspace process as required. MFR also still
maintains HWPoison metadata on the hugepage having the UE.

A HWPoison-ed hugepage will be immediately isolated and
prevented from future allocation once userspace truncates it
via the memfd, or the owning memfd is closed.

By default MFD_MF_KEEP_UE_MAPPED is not set, and MFR hard
offlines hugepages having UEs.

Implementation
==============

Implementation is relatively straightforward with two major parts.

Part 1: When hugepages owned by an MFD_MF_KEEP_UE_MAPPED
enabled memfd run into a UE:

* MFR defers hard offline operations, i.e., unmapping and
  dissolving. MFR still sets HWPoison flags and holds a refcount
  for every raw HWPoison-ed page. MFR still sends SIGBUS to the
  consuming thread, but si_addr_lsb will be reduced to PAGE_SHIFT.
* If the memory was not faulted in yet, the fault handler also
  needs to unblock the fault to the HWPoison-ed folio.

Part 2: When an MFD_MF_KEEP_UE_MAPPED enabled memfd is being
released, or when a userspace process truncates a range of
hugepages belonging to an MFD_MF_KEEP_UE_MAPPED enabled memfd:

* When the HugeTLB in-memory file system removes a filemap's
  folios one by one, it asks MFR to deal with HWPoison-ed folios
  on the fly, implemented by filemap_offline_hwpoison_folio().

* MFR drops the refcounts being held for the raw HWPoison-ed
  pages within the folio. Now that the HWPoison-ed folio becomes
  a free HugeTLB folio, MFR dissolves it into a set of raw pages.
  dissolve_free_hugetlb_folio() frees them all to the buddy
  allocator, including the HWPoison-ed raw pages. So MFR also
  needs to take these HWPoison-ed pages off the buddy allocator.

One thing worthy of note, as pointed out by William Roche:
During the time window between freeing to the buddy allocator
and taking off the buddy allocator, a high-order folio with
HWPoison-ed subpages can be allocated. This racing issue already
exists today, after buddy allocator reduced sanity checks [3].
With MFD_MF_KEEP_UE_MAPPED, multiple raw HWPoison-ed pages can
be allocated. Since MFD_MF_KEEP_UE_MAPPED could exaggerate the
issue, I have proposed a solution [4] based on discussion with
Harry Yoo and Miaohe Lin, and will send it out as a separately
formal patchset.

Changelog
=========

v2 -> v1 [2]
- Rebased onto commit 6da43bbeb6918 ("Merge tag 'vfio-v6.18-rc6' of
  https://github.com/awilliam/linux-vfio").
- Removed populate_memfd_hwp_folios and offline_memfd_hwp_folios so
  that no memory allocation is needed during releasing HWPoison-ed
  memfd.
- Inserted filemap_offline_hwpoison_folio into remove_inode_single_folio.
  Now dissolving and offlining HWPoison-ed huge folios is done on the fly.
- Fixed the bug pointed out by William Roche <william.roche@oracle.com>:
  call take_page_off_buddy no matter HWPoison-ed page is buddy page or not.
- Removed update_per_node_mf_stats when dissolve failed.
- Made hugetlb-mfr allocate 4 1G hugepages to cover new code introduced
  in remove_inode_hugepages.
- Made hugetlb-mfr support testing both 1GB and 2MB HugeTLB hugepages.
- Fixed some typos in documentation.

[1] https://lwn.net/Articles/991513
[2] https://lore.kernel.org/lkml/20250118231549.1652825-1-jiaqiyan@google.com
[3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
[4] https://lore.kernel.org/lkml/CACw3F51VGxg4q9nM_eQN7OXs7JaZo9K-nvDwxtZgtjFSNyjQaw@mail.gmail.com

Jiaqi Yan (3):
  mm: memfd/hugetlb: introduce memfd-based userspace MFR policy
  selftests/mm: test userspace MFR for HugeTLB hugepage
  Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED

 Documentation/userspace-api/index.rst         |   1 +
 .../userspace-api/mfd_mfr_policy.rst          |  60 ++++
 fs/hugetlbfs/inode.c                          |  25 +-
 include/linux/hugetlb.h                       |   7 +
 include/linux/pagemap.h                       |  24 ++
 include/uapi/linux/memfd.h                    |   6 +
 mm/hugetlb.c                                  |  20 +-
 mm/memfd.c                                    |  15 +-
 mm/memory-failure.c                           | 124 ++++++-
 tools/testing/selftests/mm/.gitignore         |   1 +
 tools/testing/selftests/mm/Makefile           |   1 +
 tools/testing/selftests/mm/hugetlb-mfr.c      | 327 ++++++++++++++++++
 12 files changed, 592 insertions(+), 19 deletions(-)
 create mode 100644 Documentation/userspace-api/mfd_mfr_policy.rst
 create mode 100644 tools/testing/selftests/mm/hugetlb-mfr.c

-- 
2.52.0.rc1.455.g30608eb744-goog

next             reply	other threads:[~2025-11-16  1:32 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-16  1:32 Jiaqi Yan [this message]
2025-11-16  1:32 ` [PATCH v2 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy Jiaqi Yan
2025-11-25 21:47   ` William Roche
2025-11-25 22:04   ` William Roche
2025-12-03  4:11   ` jane.chu
2025-12-03 19:41     ` Jiaqi Yan
2025-11-16  1:32 ` [PATCH v2 2/3] selftests/mm: test userspace MFR for HugeTLB hugepage Jiaqi Yan
2025-12-03  4:14   ` jane.chu
2025-11-16  1:32 ` [PATCH v2 3/3] Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED Jiaqi Yan
2025-12-03  4:18   ` jane.chu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251116013223.1557158-1-jiaqiyan@google.com \
    --to=jiaqiyan@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=ankita@nvidia.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=duenwen@google.com \
    --cc=harry.yoo@oracle.com \
    --cc=jane.chu@oracle.com \
    --cc=jgg@nvidia.com \
    --cc=jthoughton@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=nao.horiguchi@gmail.com \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=sidhartha.kumar@oracle.com \
    --cc=tony.luck@intel.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=william.roche@oracle.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox