linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/sparse: Fix race on mem_section->usage in pfn walkers
@ 2026-04-15  2:23 Muchun Song
  2026-04-15  5:44 ` Andrew Morton
  0 siblings, 1 reply; 3+ messages in thread
From: Muchun Song @ 2026-04-15  2:23 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Oscar Salvador, Charan Teja Kalla
  Cc: Muchun Song, Muchun Song, Kairui Song, Qi Zheng, Shakeel Butt,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel,
	linux-cxl

When memory is hot-removed, section_deactivate() can tear down
mem_section->usage while concurrent pfn walkers still inspect the
subsection map via pfn_section_valid() or pfn_section_first_valid().

After commit 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing
memory_section->usage") converted the teardown to an RCU-based
scheme, the code still relies on SECTION_HAS_MEM_MAP becoming visible
to readers before ms->usage is cleared and queued for freeing.

That ordering is not guaranteed. section_deactivate() can clear
ms->usage and queue kfree_rcu() before another CPU observes the
SECTION_HAS_MEM_MAP clear. A concurrent pfn walker can therefore see
valid_section() return true, enter its sched-RCU read-side critical
section after kfree_rcu() has already been queued, and then dereference
a stale ms->usage pointer.

And pfn_to_online_page() can call pfn_section_valid() without its
own sched-RCU read-side critical section, which has similar problem.

The race looks like this:

  compact_zone()                    memunmap_pages
  ==============                    ==============
                                    __remove_pages()->
                                      sparse_remove_section()->
                                        section_deactivate():
                                          a) [ Clear SECTION_HAS_MEM_MAP
                                               is reordered to b) ]
                                          kfree_rcu(ms->usage)
      __pageblock_pfn_to_page
         ......
          pfn_valid():
            rcu_read_lock_sched()
            valid_section() // return true
            pfn_section_valid()
              [Access ms->usage which is UAF]
                                          WRITE_ONCE(ms->usage, NULL)
            rcu_read_unlock_sched()       b) Clear SECTION_HAS_MEM_MAP

Fix this by using rcu_replace_pointer() when clearing ms->usage in
section_deactivate(), then it does not rely on the order of clearing
of SECTION_HAS_MEM_MAP.

Fixes: 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing memory_section->usage")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
This patch is focused on the ms->usage lifetime race only.

One open question is the interaction between pfn_to_online_page() and
vmemmap teardown during memory hot-remove.

Could pfn_to_online_page() still hand out a stale struct page here?

The new sched-RCU critical section ends before pfn_to_page(pfn), but
section_deactivate() can still tear the vmemmap down immediately afterwards:

mm/sparse-vmemmap.c:section_deactivate()
    ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
    usage = rcu_replace_pointer(ms->usage, NULL, true);
    kfree_rcu(usage, rcu);
    depopulate_section_memmap(...);

That looks like a reader can observe valid = true, drop sched-RCU, race with
section_deactivate(), and then execute pfn_to_page(pfn) after the backing
vmemmap was depopulated.

Callers such as mm/compaction.c:__reset_isolation_pfn(),
mm/page_idle.c:page_idle_get_folio(), and fs/proc/kcore.c:read_kcore_iter()
dereference the returned page immediately, and they do not appear to hold
get_online_mems() across the pfn_to_online_page() call.

I am not fully sure whether that reasoning is correct, or whether current
callers are expected to rely on additional hotplug serialization instead.
Comments on whether this is a real issue, and how the vmemmap lifetime is
expected to be handled here, would be very helpful.
---
 include/linux/mmzone.h | 6 +++---
 mm/memory_hotplug.c    | 6 +++++-
 mm/sparse-vmemmap.c    | 6 ++++--
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 238bf2d35a54..0e850924cbeb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2014,7 +2014,7 @@ struct mem_section {
 	 */
 	unsigned long section_mem_map;
 
-	struct mem_section_usage *usage;
+	struct mem_section_usage __rcu *usage;
 #ifdef CONFIG_PAGE_EXTENSION
 	/*
 	 * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use
@@ -2178,14 +2178,14 @@ static inline int subsection_map_index(unsigned long pfn)
 static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
 {
 	int idx = subsection_map_index(pfn);
-	struct mem_section_usage *usage = READ_ONCE(ms->usage);
+	struct mem_section_usage *usage = rcu_dereference_sched(ms->usage);
 
 	return usage ? test_bit(idx, usage->subsection_map) : 0;
 }
 
 static inline bool pfn_section_first_valid(struct mem_section *ms, unsigned long *pfn)
 {
-	struct mem_section_usage *usage = READ_ONCE(ms->usage);
+	struct mem_section_usage *usage = rcu_dereference_sched(ms->usage);
 	int idx = subsection_map_index(*pfn);
 	unsigned long bit;
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0c1d3df3a296..335835abe74c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -340,6 +340,7 @@ struct page *pfn_to_online_page(unsigned long pfn)
 	unsigned long nr = pfn_to_section_nr(pfn);
 	struct dev_pagemap *pgmap;
 	struct mem_section *ms;
+	bool valid;
 
 	if (nr >= NR_MEM_SECTIONS)
 		return NULL;
@@ -355,7 +356,10 @@ struct page *pfn_to_online_page(unsigned long pfn)
 	if (IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID) && !pfn_valid(pfn))
 		return NULL;
 
-	if (!pfn_section_valid(ms, pfn))
+	rcu_read_lock_sched();
+	valid = pfn_section_valid(ms, pfn);
+	rcu_read_unlock_sched();
+	if (!valid)
 		return NULL;
 
 	if (!online_device_section(ms))
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 717ac953bba2..05f68dcec0f8 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -601,8 +601,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 		 * was allocated during boot.
 		 */
 		if (!PageReserved(virt_to_page(ms->usage))) {
-			kfree_rcu(ms->usage, rcu);
-			WRITE_ONCE(ms->usage, NULL);
+			struct mem_section_usage *usage;
+
+			usage = rcu_replace_pointer(ms->usage, NULL, true);
+			kfree_rcu(usage, rcu);
 		}
 		memmap = pfn_to_page(SECTION_ALIGN_DOWN(pfn));
 	}
-- 
2.20.1



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-04-15  6:08 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-15  2:23 [PATCH] mm/sparse: Fix race on mem_section->usage in pfn walkers Muchun Song
2026-04-15  5:44 ` Andrew Morton
2026-04-15  6:06   ` Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox