* [PATCH] mm/page_owner: fix prematurely released rcu_read_lock()
@ 2025-12-23 9:25 ranxiaokai627
2025-12-23 9:42 ` David Hildenbrand (Red Hat)
0 siblings, 1 reply; 4+ messages in thread
From: ranxiaokai627 @ 2025-12-23 9:25 UTC (permalink / raw)
To: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, david, luizcap
Cc: linux-kernel, linux-mm, ran.xiaokai, ranxiaokai627
From: Ran Xiaokai <ran.xiaokai@zte.com.cn>
In CONFIG_SPARSEMEM systems, page_ext uses RCU to synchronize with
memory hotplug operations, ensuring page_ext memory won't be freed
due to MEM_OFFLINE during page_ext data access.
Since page_owner is part of page_ext, rcu_read_lock() must be held
continuously throughout the entire page_owner access period and
should not be released midway. Otherwise, it may cause the
use-after-free issue. The sequence is like this:
CPU0 CPU1
__folio_copy_owner(): MEM_OFFLINE:
page_ext = page_ext_get(&old->page);
old_page_owner = ...
page_ext_put(page_ext);
page_ext = page_ext_get(&newfolio->page);
new_page_owner = ...
page_ext_put(page_ext);
__invalidate_page_ext(pfn);
synchronize_rcu();
__free_page_ext(pfn);
old_page_owner->pid
new_page_owner->order ---> access to freed area
Fixes: 3a812bed3d32a ("mm: page_owner: use new iteration API")
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
---
mm/page_owner.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/mm/page_owner.c b/mm/page_owner.c
index b6a394a130ec..5d6860e54be7 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -375,24 +375,25 @@ void __split_page_owner(struct page *page, int old_order, int new_order)
void __folio_copy_owner(struct folio *newfolio, struct folio *old)
{
struct page_ext *page_ext;
+ struct page_ext *old_page_ext, *new_page_ext;
struct page_ext_iter iter;
struct page_owner *old_page_owner;
struct page_owner *new_page_owner;
depot_stack_handle_t migrate_handle;
- page_ext = page_ext_get(&old->page);
- if (unlikely(!page_ext))
+ old_page_ext = page_ext_get(&old->page);
+ if (unlikely(!old_page_ext))
return;
- old_page_owner = get_page_owner(page_ext);
- page_ext_put(page_ext);
+ old_page_owner = get_page_owner(old_page_ext);
- page_ext = page_ext_get(&newfolio->page);
- if (unlikely(!page_ext))
+ new_page_ext = page_ext_get(&newfolio->page);
+ if (unlikely(!new_page_ext)) {
+ page_ext_put(old_page_ext);
return;
+ }
- new_page_owner = get_page_owner(page_ext);
- page_ext_put(page_ext);
+ new_page_owner = get_page_owner(new_page_ext);
migrate_handle = new_page_owner->handle;
__update_page_owner_handle(&newfolio->page, old_page_owner->handle,
@@ -414,12 +415,12 @@ void __folio_copy_owner(struct folio *newfolio, struct folio *old)
* for the new one and the old folio otherwise there will be an imbalance
* when subtracting those pages from the stack.
*/
- rcu_read_lock();
for_each_page_ext(&old->page, 1 << new_page_owner->order, page_ext, iter) {
old_page_owner = get_page_owner(page_ext);
old_page_owner->handle = migrate_handle;
}
- rcu_read_unlock();
+ page_ext_put(new_page_ext);
+ page_ext_put(old_page_ext);
}
void pagetypeinfo_showmixedcount_print(struct seq_file *m,
--
2.25.1
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH] mm/page_owner: fix prematurely released rcu_read_lock() 2025-12-23 9:25 [PATCH] mm/page_owner: fix prematurely released rcu_read_lock() ranxiaokai627 @ 2025-12-23 9:42 ` David Hildenbrand (Red Hat) 2025-12-25 8:17 ` ranxiaokai627 0 siblings, 1 reply; 4+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-12-23 9:42 UTC (permalink / raw) To: ranxiaokai627, akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, luizcap Cc: linux-kernel, linux-mm, ran.xiaokai On 12/23/25 10:25, ranxiaokai627@163.com wrote: > From: Ran Xiaokai <ran.xiaokai@zte.com.cn> > > In CONFIG_SPARSEMEM systems, page_ext uses RCU to synchronize with > memory hotplug operations, ensuring page_ext memory won't be freed > due to MEM_OFFLINE during page_ext data access. > > Since page_owner is part of page_ext, rcu_read_lock() must be held > continuously throughout the entire page_owner access period and > should not be released midway. Otherwise, it may cause the > use-after-free issue. The sequence is like this: > > CPU0 CPU1 > __folio_copy_owner(): MEM_OFFLINE: > page_ext = page_ext_get(&old->page); > old_page_owner = ... > page_ext_put(page_ext); > > page_ext = page_ext_get(&newfolio->page); > new_page_owner = ... > page_ext_put(page_ext); > __invalidate_page_ext(pfn); > synchronize_rcu(); > __free_page_ext(pfn); > old_page_owner->pid > new_page_owner->order ---> access to freed area > > Fixes: 3a812bed3d32a ("mm: page_owner: use new iteration API") > Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> > --- > mm/page_owner.c | 21 +++++++++++---------- > 1 file changed, 11 insertions(+), 10 deletions(-) > > diff --git a/mm/page_owner.c b/mm/page_owner.c > index b6a394a130ec..5d6860e54be7 100644 > --- a/mm/page_owner.c > +++ b/mm/page_owner.c > @@ -375,24 +375,25 @@ void __split_page_owner(struct page *page, int old_order, int new_order) > void __folio_copy_owner(struct folio *newfolio, struct folio *old) > { > struct page_ext *page_ext; > + struct page_ext *old_page_ext, *new_page_ext; > struct page_ext_iter iter; > struct page_owner *old_page_owner; > struct page_owner *new_page_owner; > depot_stack_handle_t migrate_handle; > > - page_ext = page_ext_get(&old->page); > - if (unlikely(!page_ext)) > + old_page_ext = page_ext_get(&old->page); > + if (unlikely(!old_page_ext)) > return; > > - old_page_owner = get_page_owner(page_ext); > - page_ext_put(page_ext); > + old_page_owner = get_page_owner(old_page_ext); > > - page_ext = page_ext_get(&newfolio->page); > - if (unlikely(!page_ext)) > + new_page_ext = page_ext_get(&newfolio->page); > + if (unlikely(!new_page_ext)) { > + page_ext_put(old_page_ext); > return; > + } > > - new_page_owner = get_page_owner(page_ext); > - page_ext_put(page_ext); > + new_page_owner = get_page_owner(new_page_ext); > > migrate_handle = new_page_owner->handle; > __update_page_owner_handle(&newfolio->page, old_page_owner->handle, > @@ -414,12 +415,12 @@ void __folio_copy_owner(struct folio *newfolio, struct folio *old) > * for the new one and the old folio otherwise there will be an imbalance > * when subtracting those pages from the stack. > */ > - rcu_read_lock(); > for_each_page_ext(&old->page, 1 << new_page_owner->order, page_ext, iter) { > old_page_owner = get_page_owner(page_ext); > old_page_owner->handle = migrate_handle; > } > - rcu_read_unlock(); > + page_ext_put(new_page_ext); > + page_ext_put(old_page_ext); > } How are you possibly able to call into __split_page_owner() while concurrently we are already finished with offlining the memory (-> all memory freed and isolated in the buddy) and triggering the notifier? Doesn't make sense, no? -- Cheers David ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] mm/page_owner: fix prematurely released rcu_read_lock() 2025-12-23 9:42 ` David Hildenbrand (Red Hat) @ 2025-12-25 8:17 ` ranxiaokai627 2025-12-30 20:45 ` David Hildenbrand (Red Hat) 0 siblings, 1 reply; 4+ messages in thread From: ranxiaokai627 @ 2025-12-25 8:17 UTC (permalink / raw) To: david Cc: akpm, hannes, jackmanb, linux-mm, luizcap, mhocko, ran.xiaokai, ranxiaokai627, surenb, vbabka, ziy [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset=y, Size: 3863 bytes --] Hi, David >On 12/23/25 10:25, ranxiaokai627@163.com wrote: >> From: Ran Xiaokai <ran.xiaokai@zte.com.cn> >> --- a/mm/page_owner.c >> +++ b/mm/page_owner.c >> @@ -375,24 +375,25 @@ void __split_page_owner(struct page *page, int old_order, int new_order) >> void __folio_copy_owner(struct folio *newfolio, struct folio *old) >> { >> struct page_ext *page_ext; >> + struct page_ext *old_page_ext, *new_page_ext; >> struct page_ext_iter iter; >> struct page_owner *old_page_owner; >> struct page_owner *new_page_owner; >> depot_stack_handle_t migrate_handle; >> >> - page_ext = page_ext_get(&old->page); >> - if (unlikely(!page_ext)) >> + old_page_ext = page_ext_get(&old->page); >> + if (unlikely(!old_page_ext)) >> return; >> >> - old_page_owner = get_page_owner(page_ext); >> - page_ext_put(page_ext); >> + old_page_owner = get_page_owner(old_page_ext); >> >> - page_ext = page_ext_get(&newfolio->page); >> - if (unlikely(!page_ext)) >> + new_page_ext = page_ext_get(&newfolio->page); >> + if (unlikely(!new_page_ext)) { >> + page_ext_put(old_page_ext); >> return; >> + } >> >> - new_page_owner = get_page_owner(page_ext); >> - page_ext_put(page_ext); >> + new_page_owner = get_page_owner(new_page_ext); >> >> migrate_handle = new_page_owner->handle; >> __update_page_owner_handle(&newfolio->page, old_page_owner->handle, >> @@ -414,12 +415,12 @@ void __folio_copy_owner(struct folio *newfolio, struct folio *old) >> * for the new one and the old folio otherwise there will be an imbalance >> * when subtracting those pages from the stack. >> */ >> - rcu_read_lock(); >> for_each_page_ext(&old->page, 1 << new_page_owner->order, page_ext, iter) { >> old_page_owner = get_page_owner(page_ext); >> old_page_owner->handle = migrate_handle; >> } >> - rcu_read_unlock(); >> + page_ext_put(new_page_ext); >> + page_ext_put(old_page_ext); >> } > >How are you possibly able to call into __split_page_owner() while >concurrently we are already finished with offlining the memory (-> all >memory freed and isolated in the buddy) and triggering the notifier? This patch does not involve __split_page_owner() – did you perhaps mean __folio_copy_owner()? You are right: when memory_notify(MEM_OFFLINE, &mem_arg) is called, memory hot-remove has already completed. At this stage, all memory in the mem_section has been fully freed and removed from zone->free_area[], making them impossible to be allocated. Currently, only read_page_owner() and pagetypeinfo_showmixedcount_print() genuinely require RCU locks to handle concurrency with MEM_OFFLINE events. This is because during traversal of page frames across the system/zone, we cannot pre-determine the hotplug/remove state of these PFNs. In other functions, RCU locks are merely used to satisfy the assertion WARN_ON_ONCE(!rcu_read_lock_held()) within lookup_page_ext(). right ? Regarding page_ext, I have a question: Semantically, page_ext should correspond to one structure per folio, not per 4K base page. In x86_64, page_ext consumes 88 bytes—more memory than struct page itself. As mTHP adoption grows, avoiding page owner metadata setup/cleanup for tail pages would yield performance gains. Moreover, with folio gradually replacing struct page, will struct page eventually be superseded by dynamically allocated all kinds of memdesc_xxx_t? If so, wouldn’t it be more reasonable to dynamically allocate a corresponding page_ext structure after folio allocation? While this would introduce overhead on the memory allocation hotpath, I believe we could optimize it—perhaps by allocating multiple page frames at once to store page_ext structures, or by establishing a dedicated kmem_cache for page_ext. What are your suggestions regarding this approach of dynamically allocating page_ext structures? >Doesn't make sense, no? ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] mm/page_owner: fix prematurely released rcu_read_lock() 2025-12-25 8:17 ` ranxiaokai627 @ 2025-12-30 20:45 ` David Hildenbrand (Red Hat) 0 siblings, 0 replies; 4+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-12-30 20:45 UTC (permalink / raw) To: ranxiaokai627 Cc: akpm, hannes, jackmanb, linux-mm, luizcap, mhocko, ran.xiaokai, surenb, vbabka, ziy On 12/25/25 09:17, ranxiaokai627@163.com wrote: > Hi, David > >> On 12/23/25 10:25, ranxiaokai627@163.com wrote: >>> From: Ran Xiaokai <ran.xiaokai@zte.com.cn> >>> --- a/mm/page_owner.c >>> +++ b/mm/page_owner.c >>> @@ -375,24 +375,25 @@ void __split_page_owner(struct page *page, int old_order, int new_order) >>> void __folio_copy_owner(struct folio *newfolio, struct folio *old) >>> { >>> struct page_ext *page_ext; >>> + struct page_ext *old_page_ext, *new_page_ext; >>> struct page_ext_iter iter; >>> struct page_owner *old_page_owner; >>> struct page_owner *new_page_owner; >>> depot_stack_handle_t migrate_handle; >>> >>> - page_ext = page_ext_get(&old->page); >>> - if (unlikely(!page_ext)) >>> + old_page_ext = page_ext_get(&old->page); >>> + if (unlikely(!old_page_ext)) >>> return; >>> >>> - old_page_owner = get_page_owner(page_ext); >>> - page_ext_put(page_ext); >>> + old_page_owner = get_page_owner(old_page_ext); >>> >>> - page_ext = page_ext_get(&newfolio->page); >>> - if (unlikely(!page_ext)) >>> + new_page_ext = page_ext_get(&newfolio->page); >>> + if (unlikely(!new_page_ext)) { >>> + page_ext_put(old_page_ext); >>> return; >>> + } >>> >>> - new_page_owner = get_page_owner(page_ext); >>> - page_ext_put(page_ext); >>> + new_page_owner = get_page_owner(new_page_ext); >>> >>> migrate_handle = new_page_owner->handle; >>> __update_page_owner_handle(&newfolio->page, old_page_owner->handle, >>> @@ -414,12 +415,12 @@ void __folio_copy_owner(struct folio *newfolio, struct folio *old) >>> * for the new one and the old folio otherwise there will be an imbalance >>> * when subtracting those pages from the stack. >>> */ >>> - rcu_read_lock(); >>> for_each_page_ext(&old->page, 1 << new_page_owner->order, page_ext, iter) { >>> old_page_owner = get_page_owner(page_ext); >>> old_page_owner->handle = migrate_handle; >>> } >>> - rcu_read_unlock(); >>> + page_ext_put(new_page_ext); >>> + page_ext_put(old_page_ext); >>> } >> >> How are you possibly able to call into __split_page_owner() while >> concurrently we are already finished with offlining the memory (-> all >> memory freed and isolated in the buddy) and triggering the notifier? > > This patch does not involve __split_page_owner() – did you perhaps > mean __folio_copy_owner()? Yes, any kind of folio operation. Once memory is offline any operation on the page/folio itself would already be fatal. So if we would ever reach this code while memory is already offline we would be having a bad time, even without any page_ext magic. > > You are right: when memory_notify(MEM_OFFLINE, &mem_arg) > is called, memory hot-remove has already completed. > At this stage, all memory in the mem_section has been fully freed > and removed from zone->free_area[], making them impossible > to be allocated. > > Currently, only read_page_owner() and pagetypeinfo_showmixedcount_print() > genuinely require RCU locks to handle concurrency with MEM_OFFLINE events. > This is because during traversal of page frames across the system/zone, > we cannot pre-determine the hotplug/remove state of these PFNs. > In other functions, RCU locks are merely used to satisfy the assertion > WARN_ON_ONCE(!rcu_read_lock_held()) within lookup_page_ext(). right ? Exactly. > > Regarding page_ext, I have a question: Semantically, page_ext should > correspond to one structure per folio, not per 4K base page. In x86_64, > page_ext consumes 88 bytes—more memory than struct page itself. > As mTHP adoption grows, avoiding page owner metadata setup/cleanup > for tail pages would yield performance gains. > Moreover, with folio gradually replacing struct page, will struct page > eventually be superseded by dynamically allocated all kinds of memdesc_xxx_t? > > If so, wouldn’t it be more reasonable to dynamically allocate a > corresponding page_ext structure after folio allocation? Most (but not all ...) page_ext data will be per folio, yes. Some is per allocation (which might not be a folio in the future ...), which is where it gets tricky. But yes, Willy already thought about simply placing some of the page_ext stuff into "struct folio" once we dynamically allocate "struct folio". The question is how we will handle this for other allocations that will not be folios (e.g., where to store page owner). -- Cheers David ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-12-30 20:45 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-12-23 9:25 [PATCH] mm/page_owner: fix prematurely released rcu_read_lock() ranxiaokai627 2025-12-23 9:42 ` David Hildenbrand (Red Hat) 2025-12-25 8:17 ` ranxiaokai627 2025-12-30 20:45 ` David Hildenbrand (Red Hat)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox