From: "“William Roche" <william.roche@oracle.com>
To: jiaqiyan@google.com, jgg@nvidia.com
Cc: akpm@linux-foundation.org, ankita@nvidia.com,
dave.hansen@linux.intel.com, david@redhat.com,
duenwen@google.com, jane.chu@oracle.com, jthoughton@google.com,
linmiaohe@huawei.com, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
muchun.song@linux.dev, nao.horiguchi@gmail.com,
osalvador@suse.de, peterx@redhat.com, rientjes@google.com,
sidhartha.kumar@oracle.com, tony.luck@intel.com,
wangkefeng.wang@huawei.com, willy@infradead.org,
harry.yoo@oracle.com
Subject: Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
Date: Fri, 19 Sep 2025 15:58:32 +0000 [thread overview]
Message-ID: <20250919155832.1084091-1-william.roche@oracle.com> (raw)
In-Reply-To: <20250118231549.1652825-1-jiaqiyan@google.com>
From: William Roche <william.roche@oracle.com>
Hello,
The possibility to keep a VM using large hugetlbfs pages running after a memory
error is very important, and the possibility described here could be a good
candidate to address this issue.
So I would like to provide my feedback after testing this code with the
introduction of persistent errors in the address space: My tests used a VM
running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
test program provided with this project. But instead of injecting the errors
with madvise calls from this program, I get the guest physical address of a
location and inject the error from the hypervisor into the VM, so that any
subsequent access to the location is prevented directly from the hypervisor
level.
Using this framework, I realized that the code provided here has a problem:
When the error impacts a large folio, the release of this folio doesn't isolate
the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
a known poisoned page to get_page_from_freelist().
This revealed some mm limitations, as I would have expected that the
check_new_pages() mechanism used by the __rmqueue functions would filter these
pages out, but I noticed that this has been disabled by default in 2023 with:
[PATCH] mm, page_alloc: reduce page alloc/free sanity checks
https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
This problem seems to be avoided if we call take_page_off_buddy(page) in the
filemap_offline_hwpoison_folio_hugetlb() function without testing if
PageBuddy(page) is true first.
But according to me it leaves a (small) race condition where a new page
allocation could get a poisoned sub-page between the dissolve phase and the
attempt to remove it from the buddy allocator.
I do have the impression that a correct behavior (isolating an impacted
sub-page and remapping the valid memory content) using large pages is
currently only achieved with Transparent Huge Pages.
If performance requires using Hugetlb pages, than maybe we could accept to
loose a huge page after a memory impacted MFD_MF_KEEP_UE_MAPPED memfd segment
is released ? If it can easily avoid some other corruption.
I'm very interested in finding an appropriate way to deal with memory errors on
hugetlbfs pages, and willing to help to build a valid solution. This project
showed a real possibility to do so, even in cases where pinned memory is used -
with VFIO for example.
I would really be interested in knowing your feedback about this project, and
if another solution is considered more adapted to deal with errors on hugetlbfs
pages, please let us know.
Thanks in advance for your answers.
William.
next prev parent reply other threads:[~2025-09-19 15:59 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-18 23:15 Jiaqi Yan
2025-01-18 23:15 ` [RFC PATCH v1 1/3] mm: memfd/hugetlb: introduce userspace memory failure recovery policy Jiaqi Yan
2025-01-18 23:15 ` [RFC PATCH v1 2/3] selftests/mm: test userspace MFR for HugeTLB 1G hugepage Jiaqi Yan
2025-01-18 23:15 ` [RFC PATCH v1 3/3] Documentation: add userspace MF recovery policy via memfd Jiaqi Yan
2025-01-20 17:26 ` [RFC PATCH v1 0/3] Userspace MFR Policy " Jason Gunthorpe
2025-01-21 21:45 ` Jiaqi Yan
2025-01-22 16:41 ` Zi Yan
2025-09-19 15:58 ` “William Roche [this message]
2025-10-13 22:14 ` Jiaqi Yan
2025-10-14 20:57 ` William Roche
2025-10-28 4:17 ` Jiaqi Yan
2025-10-22 13:09 ` Harry Yoo
2025-10-28 4:17 ` Jiaqi Yan
2025-10-28 7:00 ` Harry Yoo
2025-10-30 11:51 ` Miaohe Lin
2025-10-30 17:28 ` Jiaqi Yan
2025-10-30 21:28 ` Jiaqi Yan
2025-11-03 8:16 ` Harry Yoo
2025-11-03 8:53 ` Harry Yoo
2025-11-03 16:57 ` Jiaqi Yan
2025-11-04 3:44 ` Miaohe Lin
2025-11-06 7:53 ` Harry Yoo
2025-11-12 1:28 ` Jiaqi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250919155832.1084091-1-william.roche@oracle.com \
--to=william.roche@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=ankita@nvidia.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=duenwen@google.com \
--cc=harry.yoo@oracle.com \
--cc=jane.chu@oracle.com \
--cc=jgg@nvidia.com \
--cc=jiaqiyan@google.com \
--cc=jthoughton@google.com \
--cc=linmiaohe@huawei.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=muchun.song@linux.dev \
--cc=nao.horiguchi@gmail.com \
--cc=osalvador@suse.de \
--cc=peterx@redhat.com \
--cc=rientjes@google.com \
--cc=sidhartha.kumar@oracle.com \
--cc=tony.luck@intel.com \
--cc=wangkefeng.wang@huawei.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox