From: Jiaqi Yan <jiaqiyan@google.com>
To: jackmanb@google.com, hannes@cmpxchg.org, linmiaohe@huawei.com,
ziy@nvidia.com, harry.yoo@oracle.com, willy@infradead.org
Cc: nao.horiguchi@gmail.com, david@redhat.com,
lorenzo.stoakes@oracle.com, william.roche@oracle.com,
tony.luck@intel.com, wangkefeng.wang@huawei.com,
jane.chu@oracle.com, akpm@linux-foundation.org,
osalvador@suse.de, muchun.song@linux.dev, rientjes@google.com,
duenwen@google.com, jthoughton@google.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, Liam.Howlett@oracle.com,
vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
mhocko@suse.com, Jiaqi Yan <jiaqiyan@google.com>
Subject: [PATCH v3 0/3] Only free healthy pages in high-order has_hwpoisoned folio
Date: Mon, 12 Jan 2026 00:49:20 +0000 [thread overview]
Message-ID: <20260112004923.888429-1-jiaqiyan@google.com> (raw)
At the end of dissolve_free_hugetlb_folio() that a free HugeTLB
folio becomes non-HugeTLB, it is released to buddy allocator
as a high-order folio, e.g. a folio that contains 262144 pages
if the folio was a 1G HugeTLB hugepage.
This is problematic if the HugeTLB hugepage contained HWPoison
subpages. In that case, since buddy allocator does not check
HWPoison for non-zero-order folio, the raw HWPoison page can
be given out with its buddy page and be re-used by either
kernel or userspace.
Memory failure recovery (MFR) in kernel does attempt to take
raw HWPoison page off buddy allocator after
dissolve_free_hugetlb_folio(). However, there is always a time
window between dissolve_free_hugetlb_folio() frees a HWPoison
high-order folio to buddy allocator and MFR takes HWPoison
raw page off buddy allocator.
One obvious way to avoid this problem is to add page sanity
checks in page allocate or free path. However, it is against
the past efforts to reduce sanity check overhead [1,2,3].
Introduce free_has_hwpoisoned() to "salvage" the healthy pages
and excludes the HWPoison ones in the high-order folio.
free_has_hwpoisoned() happens after free_pages_prepare(),
which already deals with both decomposing the original compound
page, updating page metadata like alloc tag and page owner.
Its idea is to iterate through the sub-pages of the folio to
identify contiguous ranges of healthy pages. Instead of freeing
pages one by one, decompose healthy ranges into the largest
possible blocks. Each block is freed via free_one_page() directly.
free_has_hwpoisoned has linear time complexity wrt the number
of pages in the folio. While the power-of-two decomposition
ensures that the number of calls to the buddy allocator is
logarithmic for each contiguous healthy range, the mandatory
linear scan of pages to identify PageHWPoison defines the
overall time complexity.
I tested with some test-only code [4] and hugetlb-mfr [5], by
checking the status of pcplist and freelist immediately after
dissolve_free_hugetlb_folio() a free 2M or 1G hugetlb page that
contains 1~8 HWPoison raw pages:
- HWPoison pages are excluded by free_has_hwpoisoned().
- Some healthy pages can be in zone->per_cpu_pageset (pcplist)
because pcp_count is not high enough. Many healthy pages are
in some order's zone->free_area[order].free_list (freelist).
- In rare cases, some healthy pages are in neither pcplist
nor freelist. My best guest is they are allocated before
the test checks.
To illustrate the latency free_has_hwpoisoned() added to the
freeing memory path, I tested its time cost with 8 HWPoison
pages with instrument code in [4] for 20 sample runs:
- Has HWPoison path: mean=2.02ms, stdev=0.14ms
- No HWPoison path: mean=66us, stdev=6us
free_has_hwpoisoned() is around 30x the baseline. It is far from
triggering soft lockup, and the cost is fair for handling
exceptional hardware memory errors.
Given this nontrivial overhead, checking PG_has_hwpoisoned, doing
normal free_pages_prepare(), and doing free_has_hwpoisoned() when
necessary are wrapped in free_pages_prepare_has_hwpoisoned(), which
replaces free_pages_prepare() calls in free_frozen_pages().
With free_has_hwpoisoned() ensuring HWPoison pages never made into
buddy allocator, MFR don't need to take_page_off_buddy() anymore
after disovling HWPoison hugepages. So refactor page_handle_poison
to remove take_page_off_buddy() in case of hugepage, but still
take_page_off_buddy() in case of free buddy page.
Based on commit ccd1cdca5cd4 ("Merge tag 'nfsd-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux")
Changelog
v2 [7] -> v3:
- Address comments from Mathew Wilcox, Harry Hoo, Miaohe Lin.
- Let free_has_hwpoisoned() happen after free_pages_prepare(),
which help to deal with decomposing the original compound page,
and with page metadata like alloc tag and page owner.
- Tested with "page_owner=on" and CONFIG_MEM_ALLOC_PROFILING*=y.
- Wrap checking PG_has_hwpoisoned and free_has_hwpoisoned() into
free_pages_prepare_has_hwpoisoned(), which replaces
free_pages_prepare() calls in free_frozen_pages().
- Rename free_has_hwpoison_page() to free_has_hwpoisoned().
- Measure latency added by free_has_hwpoisoned().
- Ensure struct page *end is only used for pointer arithmetic,
instead of accessed as page.
- Refactor page_handl_poison instead of just __page_handle_poison.
v1 [6] -> v2:
- Total reimplementation based on discussions with Mathew Wilcox,
Harry Hoo, Zi Yan etc
- hugetlb_free_hwpoison_folio => free_has_hwpoison_pages.
- Utilize has_hwpoisoned flag to tell buddy allocator a high-order
folio contains HWPoison.
- Simplify __page_handle_poison given that the HWPoison page(s)
won't be freed within high-order folio.
[1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
[2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
[3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
[4] https://drive.google.com/file/d/1CzJn1Cc4wCCm183Y77h244fyZIkTLzCt/view?usp=sharing
[5] https://lore.kernel.org/linux-mm/20251116013223.1557158-3-jiaqiyan@google.com
[6] https://lore.kernel.org/linux-mm/20251116014721.1561456-1-jiaqiyan@google.com
[7] https://lore.kernel.org/linux-mm/20251219183346.3627510-1-jiaqiyan@google.com
Jiaqi Yan (3):
mm/memory-failure: set has_hwpoisoned flags on HugeTLB folio
mm/page_alloc: only free healthy pages in high-order has_hwpoisoned
folio
mm/memory-failure: refactor page_handle_poison()
include/linux/page-flags.h | 2 +-
mm/memory-failure.c | 85 ++++++++++----------
mm/page_alloc.c | 157 ++++++++++++++++++++++++++++++++++++-
3 files changed, 197 insertions(+), 47 deletions(-)
--
2.52.0.457.g6b5491de43-goog
next reply other threads:[~2026-01-12 0:49 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-12 0:49 Jiaqi Yan [this message]
2026-01-12 0:49 ` [PATCH v3 1/3] mm/memory-failure: set has_hwpoisoned flags on HugeTLB folio Jiaqi Yan
2026-01-12 2:50 ` Zi Yan
2026-01-12 0:49 ` [PATCH v3 2/3] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio Jiaqi Yan
2026-01-12 0:49 ` [PATCH v3 3/3] mm/memory-failure: refactor page_handle_poison() Jiaqi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260112004923.888429-1-jiaqiyan@google.com \
--to=jiaqiyan@google.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=duenwen@google.com \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=jackmanb@google.com \
--cc=jane.chu@oracle.com \
--cc=jthoughton@google.com \
--cc=linmiaohe@huawei.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=nao.horiguchi@gmail.com \
--cc=osalvador@suse.de \
--cc=rientjes@google.com \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=tony.luck@intel.com \
--cc=vbabka@suse.cz \
--cc=wangkefeng.wang@huawei.com \
--cc=william.roche@oracle.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox