linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/sparse: Fix race on mem_section->usage in pfn walkers
@ 2026-04-15  2:23 Muchun Song
  2026-04-15  5:44 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Muchun Song @ 2026-04-15  2:23 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Oscar Salvador, Charan Teja Kalla
  Cc: Muchun Song, Muchun Song, Kairui Song, Qi Zheng, Shakeel Butt,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel,
	linux-cxl

When memory is hot-removed, section_deactivate() can tear down
mem_section->usage while concurrent pfn walkers still inspect the
subsection map via pfn_section_valid() or pfn_section_first_valid().

After commit 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing
memory_section->usage") converted the teardown to an RCU-based
scheme, the code still relies on SECTION_HAS_MEM_MAP becoming visible
to readers before ms->usage is cleared and queued for freeing.

That ordering is not guaranteed. section_deactivate() can clear
ms->usage and queue kfree_rcu() before another CPU observes the
SECTION_HAS_MEM_MAP clear. A concurrent pfn walker can therefore see
valid_section() return true, enter its sched-RCU read-side critical
section after kfree_rcu() has already been queued, and then dereference
a stale ms->usage pointer.

And pfn_to_online_page() can call pfn_section_valid() without its
own sched-RCU read-side critical section, which has similar problem.

The race looks like this:

  compact_zone()                    memunmap_pages
  ==============                    ==============
                                    __remove_pages()->
                                      sparse_remove_section()->
                                        section_deactivate():
                                          a) [ Clear SECTION_HAS_MEM_MAP
                                               is reordered to b) ]
                                          kfree_rcu(ms->usage)
      __pageblock_pfn_to_page
         ......
          pfn_valid():
            rcu_read_lock_sched()
            valid_section() // return true
            pfn_section_valid()
              [Access ms->usage which is UAF]
                                          WRITE_ONCE(ms->usage, NULL)
            rcu_read_unlock_sched()       b) Clear SECTION_HAS_MEM_MAP

Fix this by using rcu_replace_pointer() when clearing ms->usage in
section_deactivate(), then it does not rely on the order of clearing
of SECTION_HAS_MEM_MAP.

Fixes: 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing memory_section->usage")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
This patch is focused on the ms->usage lifetime race only.

One open question is the interaction between pfn_to_online_page() and
vmemmap teardown during memory hot-remove.

Could pfn_to_online_page() still hand out a stale struct page here?

The new sched-RCU critical section ends before pfn_to_page(pfn), but
section_deactivate() can still tear the vmemmap down immediately afterwards:

mm/sparse-vmemmap.c:section_deactivate()
    ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
    usage = rcu_replace_pointer(ms->usage, NULL, true);
    kfree_rcu(usage, rcu);
    depopulate_section_memmap(...);

That looks like a reader can observe valid = true, drop sched-RCU, race with
section_deactivate(), and then execute pfn_to_page(pfn) after the backing
vmemmap was depopulated.

Callers such as mm/compaction.c:__reset_isolation_pfn(),
mm/page_idle.c:page_idle_get_folio(), and fs/proc/kcore.c:read_kcore_iter()
dereference the returned page immediately, and they do not appear to hold
get_online_mems() across the pfn_to_online_page() call.

I am not fully sure whether that reasoning is correct, or whether current
callers are expected to rely on additional hotplug serialization instead.
Comments on whether this is a real issue, and how the vmemmap lifetime is
expected to be handled here, would be very helpful.
---
 include/linux/mmzone.h | 6 +++---
 mm/memory_hotplug.c    | 6 +++++-
 mm/sparse-vmemmap.c    | 6 ++++--
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 238bf2d35a54..0e850924cbeb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2014,7 +2014,7 @@ struct mem_section {
 	 */
 	unsigned long section_mem_map;
 
-	struct mem_section_usage *usage;
+	struct mem_section_usage __rcu *usage;
 #ifdef CONFIG_PAGE_EXTENSION
 	/*
 	 * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use
@@ -2178,14 +2178,14 @@ static inline int subsection_map_index(unsigned long pfn)
 static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
 {
 	int idx = subsection_map_index(pfn);
-	struct mem_section_usage *usage = READ_ONCE(ms->usage);
+	struct mem_section_usage *usage = rcu_dereference_sched(ms->usage);
 
 	return usage ? test_bit(idx, usage->subsection_map) : 0;
 }
 
 static inline bool pfn_section_first_valid(struct mem_section *ms, unsigned long *pfn)
 {
-	struct mem_section_usage *usage = READ_ONCE(ms->usage);
+	struct mem_section_usage *usage = rcu_dereference_sched(ms->usage);
 	int idx = subsection_map_index(*pfn);
 	unsigned long bit;
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0c1d3df3a296..335835abe74c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -340,6 +340,7 @@ struct page *pfn_to_online_page(unsigned long pfn)
 	unsigned long nr = pfn_to_section_nr(pfn);
 	struct dev_pagemap *pgmap;
 	struct mem_section *ms;
+	bool valid;
 
 	if (nr >= NR_MEM_SECTIONS)
 		return NULL;
@@ -355,7 +356,10 @@ struct page *pfn_to_online_page(unsigned long pfn)
 	if (IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID) && !pfn_valid(pfn))
 		return NULL;
 
-	if (!pfn_section_valid(ms, pfn))
+	rcu_read_lock_sched();
+	valid = pfn_section_valid(ms, pfn);
+	rcu_read_unlock_sched();
+	if (!valid)
 		return NULL;
 
 	if (!online_device_section(ms))
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 717ac953bba2..05f68dcec0f8 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -601,8 +601,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 		 * was allocated during boot.
 		 */
 		if (!PageReserved(virt_to_page(ms->usage))) {
-			kfree_rcu(ms->usage, rcu);
-			WRITE_ONCE(ms->usage, NULL);
+			struct mem_section_usage *usage;
+
+			usage = rcu_replace_pointer(ms->usage, NULL, true);
+			kfree_rcu(usage, rcu);
 		}
 		memmap = pfn_to_page(SECTION_ALIGN_DOWN(pfn));
 	}
-- 
2.20.1



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/sparse: Fix race on mem_section->usage in pfn walkers
  2026-04-15  2:23 [PATCH] mm/sparse: Fix race on mem_section->usage in pfn walkers Muchun Song
@ 2026-04-15  5:44 ` Andrew Morton
  2026-04-15  6:06   ` Muchun Song
  2026-04-15  8:04 ` David Hildenbrand (Arm)
  2026-04-15  8:37 ` Oscar Salvador
  2 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2026-04-15  5:44 UTC (permalink / raw)
  To: Muchun Song
  Cc: David Hildenbrand, Oscar Salvador, Charan Teja Kalla,
	Muchun Song, Kairui Song, Qi Zheng, Shakeel Butt, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel,
	linux-cxl

On Wed, 15 Apr 2026 10:23:26 +0800 Muchun Song <songmuchun@bytedance.com> wrote:

> When memory is hot-removed, section_deactivate() can tear down
> mem_section->usage while concurrent pfn walkers still inspect the
> subsection map via pfn_section_valid() or pfn_section_first_valid().
> 
> After commit 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing
> memory_section->usage") converted the teardown to an RCU-based
> scheme, the code still relies on SECTION_HAS_MEM_MAP becoming visible
> to readers before ms->usage is cleared and queued for freeing.
> 
> That ordering is not guaranteed. section_deactivate() can clear
> ms->usage and queue kfree_rcu() before another CPU observes the
> SECTION_HAS_MEM_MAP clear. A concurrent pfn walker can therefore see
> valid_section() return true, enter its sched-RCU read-side critical
> section after kfree_rcu() has already been queued, and then dereference
> a stale ms->usage pointer.

Then what happens?  Can it oops?

> And pfn_to_online_page() can call pfn_section_valid() without its
> own sched-RCU read-side critical section, which has similar problem.
> 
> The race looks like this:
> 
>   compact_zone()                    memunmap_pages
>   ==============                    ==============
>                                     __remove_pages()->
>                                       sparse_remove_section()->
>                                         section_deactivate():
>                                           a) [ Clear SECTION_HAS_MEM_MAP
>                                                is reordered to b) ]
>                                           kfree_rcu(ms->usage)
>       __pageblock_pfn_to_page
>          ......
>           pfn_valid():
>             rcu_read_lock_sched()
>             valid_section() // return true
>             pfn_section_valid()
>               [Access ms->usage which is UAF]
>                                           WRITE_ONCE(ms->usage, NULL)
>             rcu_read_unlock_sched()       b) Clear SECTION_HAS_MEM_MAP
> 
> Fix this by using rcu_replace_pointer() when clearing ms->usage in
> section_deactivate(), then it does not rely on the order of clearing
> of SECTION_HAS_MEM_MAP.
> 
> Fixes: 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing memory_section->usage")

December 2023.

> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
> This patch is focused on the ms->usage lifetime race only.
> 
> ...
>
> I am not fully sure whether that reasoning is correct, or whether current
> callers are expected to rely on additional hotplug serialization instead.
> Comments on whether this is a real issue, and how the vmemmap lifetime is
> expected to be handled here, would be very helpful.

Thanks.  Quite a bit for consideration.

> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -601,8 +601,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>  		 * was allocated during boot.
>  		 */
>  		if (!PageReserved(virt_to_page(ms->usage))) {
> -			kfree_rcu(ms->usage, rcu);
> -			WRITE_ONCE(ms->usage, NULL);
> +			struct mem_section_usage *usage;
> +
> +			usage = rcu_replace_pointer(ms->usage, NULL, true);
> +			kfree_rcu(usage, rcu);
>  		}
>  		memmap = pfn_to_page(SECTION_ALIGN_DOWN(pfn));
>  	}

This part isn't applicable to 7.0 - it depends on material I've sent to
Linus for 7.1-rc1.

So for now I'll drop this into mm-unstable to get it some runtime
testing.  If people like this patch and we decide to proceed with it
then I can make it a hotfix for 7.1-rcX.  But the -stable people will
be wanting a backportable version of it, if we decide to backport,



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/sparse: Fix race on mem_section->usage in pfn walkers
  2026-04-15  5:44 ` Andrew Morton
@ 2026-04-15  6:06   ` Muchun Song
  0 siblings, 0 replies; 7+ messages in thread
From: Muchun Song @ 2026-04-15  6:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Muchun Song, David Hildenbrand, Oscar Salvador,
	Charan Teja Kalla, Kairui Song, Qi Zheng, Shakeel Butt,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel,
	linux-cxl



> On Apr 15, 2026, at 13:44, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> On Wed, 15 Apr 2026 10:23:26 +0800 Muchun Song <songmuchun@bytedance.com> wrote:
> 
>> When memory is hot-removed, section_deactivate() can tear down
>> mem_section->usage while concurrent pfn walkers still inspect the
>> subsection map via pfn_section_valid() or pfn_section_first_valid().
>> 
>> After commit 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing
>> memory_section->usage") converted the teardown to an RCU-based
>> scheme, the code still relies on SECTION_HAS_MEM_MAP becoming visible
>> to readers before ms->usage is cleared and queued for freeing.
>> 
>> That ordering is not guaranteed. section_deactivate() can clear
>> ms->usage and queue kfree_rcu() before another CPU observes the
>> SECTION_HAS_MEM_MAP clear. A concurrent pfn walker can therefore see
>> valid_section() return true, enter its sched-RCU read-side critical
>> section after kfree_rcu() has already been queued, and then dereference
>> a stale ms->usage pointer.
> 
> Then what happens?  Can it oops?

Probably not, because struct mem_section_usage has no pointer members,
so there will be no dereference of a pointer. The UAF here may lead to
incorrect logic judgments later on.

> 
>> And pfn_to_online_page() can call pfn_section_valid() without its
>> own sched-RCU read-side critical section, which has similar problem.
>> 
>> The race looks like this:
>> 
>>  compact_zone()                    memunmap_pages
>>  ==============                    ==============
>>                                    __remove_pages()->
>>                                      sparse_remove_section()->
>>                                        section_deactivate():
>>                                          a) [ Clear SECTION_HAS_MEM_MAP
>>                                               is reordered to b) ]
>>                                          kfree_rcu(ms->usage)
>>      __pageblock_pfn_to_page
>>         ......
>>          pfn_valid():
>>            rcu_read_lock_sched()
>>            valid_section() // return true
>>            pfn_section_valid()
>>              [Access ms->usage which is UAF]
>>                                          WRITE_ONCE(ms->usage, NULL)
>>            rcu_read_unlock_sched()       b) Clear SECTION_HAS_MEM_MAP
>> 
>> Fix this by using rcu_replace_pointer() when clearing ms->usage in
>> section_deactivate(), then it does not rely on the order of clearing
>> of SECTION_HAS_MEM_MAP.
>> 
>> Fixes: 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing memory_section->usage")
> 
> December 2023.

The probability of reordering is relatively low, and as mentioned above,
serious issues are unlikely to occur, so it will be hard to be discovered.

Thanks,
Muchun.

> 
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> ---
>> This patch is focused on the ms->usage lifetime race only.
>> 
>> ...
>> 
>> I am not fully sure whether that reasoning is correct, or whether current
>> callers are expected to rely on additional hotplug serialization instead.
>> Comments on whether this is a real issue, and how the vmemmap lifetime is
>> expected to be handled here, would be very helpful.
> 
> Thanks.  Quite a bit for consideration.
> 
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -601,8 +601,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>>  * was allocated during boot.
>>  */
>> 	if (!PageReserved(virt_to_page(ms->usage))) {
>> - 		kfree_rcu(ms->usage, rcu);
>> - 		WRITE_ONCE(ms->usage, NULL);
>> + 		struct mem_section_usage *usage;
>> +
>> + 		usage = rcu_replace_pointer(ms->usage, NULL, true);
>> + 		kfree_rcu(usage, rcu);
>> 	}
>> 	memmap = pfn_to_page(SECTION_ALIGN_DOWN(pfn));
>> }
> 
> This part isn't applicable to 7.0 - it depends on material I've sent to
> Linus for 7.1-rc1.
> 
> So for now I'll drop this into mm-unstable to get it some runtime
> testing.  If people like this patch and we decide to proceed with it
> then I can make it a hotfix for 7.1-rcX.  But the -stable people will
> be wanting a backportable version of it, if we decide to backport,




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/sparse: Fix race on mem_section->usage in pfn walkers
  2026-04-15  2:23 [PATCH] mm/sparse: Fix race on mem_section->usage in pfn walkers Muchun Song
  2026-04-15  5:44 ` Andrew Morton
@ 2026-04-15  8:04 ` David Hildenbrand (Arm)
  2026-04-15  9:20   ` Muchun Song
  2026-04-15  8:37 ` Oscar Salvador
  2 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-15  8:04 UTC (permalink / raw)
  To: Muchun Song, Andrew Morton, Oscar Salvador, Charan Teja Kalla
  Cc: Muchun Song, Kairui Song, Qi Zheng, Shakeel Butt, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel,
	linux-cxl

On 4/15/26 04:23, Muchun Song wrote:
> When memory is hot-removed, section_deactivate() can tear down
> mem_section->usage while concurrent pfn walkers still inspect the
> subsection map via pfn_section_valid() or pfn_section_first_valid().

I'll note that it's all racy either way: someone checking pfn_valid() /
pfn_to_online_page() can race with concurrent unplug.

We've known that for years; it's hard to fix; it never ever triggers :)

So is this really worth it, when we should in fact, work on protecting
the users of pfn_valid() / pfn_to_online_page() with rcu or similar?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/sparse: Fix race on mem_section->usage in pfn walkers
  2026-04-15  2:23 [PATCH] mm/sparse: Fix race on mem_section->usage in pfn walkers Muchun Song
  2026-04-15  5:44 ` Andrew Morton
  2026-04-15  8:04 ` David Hildenbrand (Arm)
@ 2026-04-15  8:37 ` Oscar Salvador
  2026-04-15  9:45   ` Muchun Song
  2 siblings, 1 reply; 7+ messages in thread
From: Oscar Salvador @ 2026-04-15  8:37 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, David Hildenbrand, Charan Teja Kalla, Muchun Song,
	Kairui Song, Qi Zheng, Shakeel Butt, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel, linux-cxl

On Wed, Apr 15, 2026 at 10:23:26AM +0800, Muchun Song wrote:
> When memory is hot-removed, section_deactivate() can tear down
> mem_section->usage while concurrent pfn walkers still inspect the
> subsection map via pfn_section_valid() or pfn_section_first_valid().
> 
> After commit 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing
> memory_section->usage") converted the teardown to an RCU-based
> scheme, the code still relies on SECTION_HAS_MEM_MAP becoming visible
> to readers before ms->usage is cleared and queued for freeing.
> 
> That ordering is not guaranteed. section_deactivate() can clear
> ms->usage and queue kfree_rcu() before another CPU observes the
> SECTION_HAS_MEM_MAP clear. A concurrent pfn walker can therefore see
> valid_section() return true, enter its sched-RCU read-side critical
> section after kfree_rcu() has already been queued, and then dereference
> a stale ms->usage pointer.
> 
> And pfn_to_online_page() can call pfn_section_valid() without its
> own sched-RCU read-side critical section, which has similar problem.
> 
> The race looks like this:
> 
>   compact_zone()                    memunmap_pages
>   ==============                    ==============
>                                     __remove_pages()->
>                                       sparse_remove_section()->
>                                         section_deactivate():
>                                           a) [ Clear SECTION_HAS_MEM_MAP
>                                                is reordered to b) ]
>                                           kfree_rcu(ms->usage)
>       __pageblock_pfn_to_page
>          ......
>           pfn_valid():
>             rcu_read_lock_sched()
>             valid_section() // return true
>             pfn_section_valid()
>               [Access ms->usage which is UAF]
>                                           WRITE_ONCE(ms->usage, NULL)
>             rcu_read_unlock_sched()       b) Clear SECTION_HAS_MEM_MAP
> 
> Fix this by using rcu_replace_pointer() when clearing ms->usage in
> section_deactivate(), then it does not rely on the order of clearing
> of SECTION_HAS_MEM_MAP.

The fix itself does not look too intrusive and I guess it kind of makes
sense when you think about the ordering issue, so if we want to be
rock solid, why not. 
Does it slow down operations a lot?

I would also point out that you rcu-protect pfn_section_valid().

Regarding the pfn_to_online_page() race, that is something that every now
and then pops up, but as David said, we never seen that happening in the
wild so I guess no one really made the time to look into that.

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/sparse: Fix race on mem_section->usage in pfn walkers
  2026-04-15  8:04 ` David Hildenbrand (Arm)
@ 2026-04-15  9:20   ` Muchun Song
  0 siblings, 0 replies; 7+ messages in thread
From: Muchun Song @ 2026-04-15  9:20 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Muchun Song, Andrew Morton, Oscar Salvador, Charan Teja Kalla,
	Kairui Song, Qi Zheng, Shakeel Butt, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel, linux-cxl



> On Apr 15, 2026, at 16:04, David Hildenbrand (Arm) <david@kernel.org> wrote:
> 
> On 4/15/26 04:23, Muchun Song wrote:
>> When memory is hot-removed, section_deactivate() can tear down
>> mem_section->usage while concurrent pfn walkers still inspect the
>> subsection map via pfn_section_valid() or pfn_section_first_valid().
> 
> I'll note that it's all racy either way: someone checking pfn_valid() /
> pfn_to_online_page() can race with concurrent unplug.

Agree. When I first saw the commit message for 5ec8e8ea8b77, I was curious
because the goal of this commit was to fix an access issue with ms->usage.
Looking at the race diagram, I realized that while this only addresses the
->usage access, subsequent accesses to struct page will still be problematic.
It's just that the former issue happened to be triggered first in this specific
commit.

> 
> We've known that for years; it's hard to fix; it never ever triggers :)

Glad to know my analysis wasn't off! It seems I've just stumbled upon a
'well-known secret' within the community. :)

> 
> So is this really worth it, when we should in fact, work on protecting
> the users of pfn_valid() / pfn_to_online_page() with rcu or similar?

I am not sure if it is worth fixing, especially since I just realized the
community has been aware of this issue for many years. If we do decide to
fix it, I think the most straightforward approach would be to protect it
using RCU, something like:

# the user side of pfn_to_online_page():
	rcu_read_lock(); 
	page = pfn_to_online_page();
	if (!get_page_unless_zero(page))
	    goto out_unlock;
	rcu_read_unlock();


# the vmemmap freeing side should free the vmemmap pages via RCU.


Thanks,
Muchun

> 
> -- 
> Cheers,
> 
> David



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/sparse: Fix race on mem_section->usage in pfn walkers
  2026-04-15  8:37 ` Oscar Salvador
@ 2026-04-15  9:45   ` Muchun Song
  0 siblings, 0 replies; 7+ messages in thread
From: Muchun Song @ 2026-04-15  9:45 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Muchun Song, Andrew Morton, David Hildenbrand, Charan Teja Kalla,
	Kairui Song, Qi Zheng, Shakeel Butt, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel, linux-cxl



> On Apr 15, 2026, at 16:37, Oscar Salvador <osalvador@suse.de> wrote:
> 
> On Wed, Apr 15, 2026 at 10:23:26AM +0800, Muchun Song wrote:
>> When memory is hot-removed, section_deactivate() can tear down
>> mem_section->usage while concurrent pfn walkers still inspect the
>> subsection map via pfn_section_valid() or pfn_section_first_valid().
>> 
>> After commit 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing
>> memory_section->usage") converted the teardown to an RCU-based
>> scheme, the code still relies on SECTION_HAS_MEM_MAP becoming visible
>> to readers before ms->usage is cleared and queued for freeing.
>> 
>> That ordering is not guaranteed. section_deactivate() can clear
>> ms->usage and queue kfree_rcu() before another CPU observes the
>> SECTION_HAS_MEM_MAP clear. A concurrent pfn walker can therefore see
>> valid_section() return true, enter its sched-RCU read-side critical
>> section after kfree_rcu() has already been queued, and then dereference
>> a stale ms->usage pointer.
>> 
>> And pfn_to_online_page() can call pfn_section_valid() without its
>> own sched-RCU read-side critical section, which has similar problem.
>> 
>> The race looks like this:
>> 
>>  compact_zone()                    memunmap_pages
>>  ==============                    ==============
>>                                    __remove_pages()->
>>                                      sparse_remove_section()->
>>                                        section_deactivate():
>>                                          a) [ Clear SECTION_HAS_MEM_MAP
>>                                               is reordered to b) ]
>>                                          kfree_rcu(ms->usage)
>>      __pageblock_pfn_to_page
>>         ......
>>          pfn_valid():
>>            rcu_read_lock_sched()
>>            valid_section() // return true
>>            pfn_section_valid()
>>              [Access ms->usage which is UAF]
>>                                          WRITE_ONCE(ms->usage, NULL)
>>            rcu_read_unlock_sched()       b) Clear SECTION_HAS_MEM_MAP
>> 
>> Fix this by using rcu_replace_pointer() when clearing ms->usage in
>> section_deactivate(), then it does not rely on the order of clearing
>> of SECTION_HAS_MEM_MAP.
> 
> The fix itself does not look too intrusive and I guess it kind of makes
> sense when you think about the ordering issue, so if we want to be
> rock solid, why not.

You are right. Generally speaking, where RCU is used, the pointer should
be cleared first, and then the memory corresponding to the pointer should
be released via kfree_rcu(). Therefore, the correct sequence of use here
should be:

	WRITE_ONCE(ms->usage, NULL);
	kfree_rcu(ms->usage);

The actual code has the order reversed. To prevent such errors, the RCU
mechanism provides the rcu_replace_pointer() interface. If you look at its
implementation, this interface is essentially equivalent to the code sequence
mentioned above.

> Does it slow down operations a lot?

Regarding section_deactivate, I believe there is no functional difference.

> 
> I would also point out that you rcu-protect pfn_section_valid().

rcu_dereference_sched() is equivalent to the previous READ_ONCE(), but for
RCU-protected resources, it is recommended to use the RCU-specific interfaces
to avoid having to manually account for memory ordering issues.

> 
> Regarding the pfn_to_online_page() race, that is something that every now
> and then pops up, but as David said, we never seen that happening in the
> wild so I guess no one really made the time to look into that.

After taking a closer look at commit 5ec8e8ea8b77, it’s clear that it was
intended to resolve a race between __pageblock_pfn_to_page and section_deactivate.
While the issue surfaced this time due to ms->usage, I’m concerned that even
with ms->usage fixed, the underlying race condition in the execution path
still exists. This suggests that subsequent accesses to struct page might
run into similar trouble down the road.

That said, I’d love to hear what everyone thinks about whether this warrants a
full fix.

Thanks,
Muchun.

> 
> 
> 
> -- 
> Oscar Salvador
> SUSE Labs



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-04-15  9:46 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-15  2:23 [PATCH] mm/sparse: Fix race on mem_section->usage in pfn walkers Muchun Song
2026-04-15  5:44 ` Andrew Morton
2026-04-15  6:06   ` Muchun Song
2026-04-15  8:04 ` David Hildenbrand (Arm)
2026-04-15  9:20   ` Muchun Song
2026-04-15  8:37 ` Oscar Salvador
2026-04-15  9:45   ` Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox