* [PATCH V2] mm: swap: check for stable address space before operating on the VMA
@ 2025-09-24 18:11 Charan Teja Kalla
2025-09-24 18:15 ` David Hildenbrand
2025-09-25 17:35 ` Chris Li
0 siblings, 2 replies; 5+ messages in thread
From: Charan Teja Kalla @ 2025-09-24 18:11 UTC (permalink / raw)
To: david, Liam.Howlett, lorenzo.stoakes, akpm, shikemeng, kasong,
nphamcs, bhe, baohua, chrisl, zhangpeng.00
Cc: linux-mm, linux-kernel, Charan Teja Kalla
It is possible to hit a zero entry while traversing the vmas in
unuse_mm() called from swapoff path and accessing it causes the
OOPS:
Unable to handle kernel NULL pointer dereference at virtual address
0000000000000446--> Loading the memory from offset 0x40 on the
XA_ZERO_ENTRY as address.
Mem abort info:
ESR = 0x0000000096000005
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x05: level 1 translation fault
The issue is manifested from the below race between the fork() on a
process and swapoff:
fork(dup_mmap()) swapoff(unuse_mm)
--------------- -----------------
1) Identical mtree is built using
__mt_dup().
2) copy_pte_range()-->
copy_nonpresent_pte():
The dst mm is added into the
mmlist to be visible to the
swapoff operation.
3) Fatal signal is sent to the parent
process(which is the current during the
fork) thus skip the duplication of the
vmas and mark the vma range with
XA_ZERO_ENTRY as a marker for this process
that helps during exit_mmap().
4) swapoff is tried on the
'mm' added to the 'mmlist' as
part of the 2.
5) unuse_mm(), that iterates
through the vma's of this 'mm'
will hit the non-NULL zero entry
and operating on this zero entry
as a vma is resulting into the
oops.
The proper fix would be around not exposing this partially-valid tree to
others when droping the mmap lock, which is being solved with [1]. A
simpler solution would be checking for MMF_UNSTABLE, as it is set if
mm_struct is not fully initialized in dup_mmap().
Thanks to Liam/Lorenzo/David for all the suggestions in fixing this
issue.
[1] https://lore.kernel.org/all/20250815191031.3769540-1-Liam.Howlett@oracle.com/
Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Charan Teja Kalla <charan.kalla@oss.qualcomm.com>
---
V1:
-- Checking for xa_zero_entry() instead of most cleaner way of
checking for MMF_UNSTABLE
-- https://lore.kernel.org/linux-mm/20250808092156.1918973-1-quic_charante@quicinc.com/
mm/swapfile.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 890b410d77b6..10760240a3a2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2389,6 +2389,8 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
VMA_ITERATOR(vmi, mm, 0);
mmap_read_lock(mm);
+ if (check_stable_address_space(mm))
+ goto unlock;
for_each_vma(vmi, vma) {
if (vma->anon_vma && !is_vm_hugetlb_page(vma)) {
ret = unuse_vma(vma, type);
@@ -2398,6 +2400,7 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
cond_resched();
}
+unlock:
mmap_read_unlock(mm);
return ret;
}
--
2.34.1
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH V2] mm: swap: check for stable address space before operating on the VMA
2025-09-24 18:11 [PATCH V2] mm: swap: check for stable address space before operating on the VMA Charan Teja Kalla
@ 2025-09-24 18:15 ` David Hildenbrand
2025-09-25 17:35 ` Chris Li
1 sibling, 0 replies; 5+ messages in thread
From: David Hildenbrand @ 2025-09-24 18:15 UTC (permalink / raw)
To: Charan Teja Kalla, Liam.Howlett, lorenzo.stoakes, akpm,
shikemeng, kasong, nphamcs, bhe, baohua, chrisl, zhangpeng.00
Cc: linux-mm, linux-kernel
On 24.09.25 20:11, Charan Teja Kalla wrote:
> It is possible to hit a zero entry while traversing the vmas in
> unuse_mm() called from swapoff path and accessing it causes the
> OOPS:
>
> Unable to handle kernel NULL pointer dereference at virtual address
> 0000000000000446--> Loading the memory from offset 0x40 on the
> XA_ZERO_ENTRY as address.
> Mem abort info:
> ESR = 0x0000000096000005
> EC = 0x25: DABT (current EL), IL = 32 bits
> SET = 0, FnV = 0
> EA = 0, S1PTW = 0
> FSC = 0x05: level 1 translation fault
>
> The issue is manifested from the below race between the fork() on a
> process and swapoff:
> fork(dup_mmap()) swapoff(unuse_mm)
> --------------- -----------------
> 1) Identical mtree is built using
> __mt_dup().
>
> 2) copy_pte_range()-->
> copy_nonpresent_pte():
> The dst mm is added into the
> mmlist to be visible to the
> swapoff operation.
>
> 3) Fatal signal is sent to the parent
> process(which is the current during the
> fork) thus skip the duplication of the
> vmas and mark the vma range with
> XA_ZERO_ENTRY as a marker for this process
> that helps during exit_mmap().
>
> 4) swapoff is tried on the
> 'mm' added to the 'mmlist' as
> part of the 2.
>
> 5) unuse_mm(), that iterates
> through the vma's of this 'mm'
> will hit the non-NULL zero entry
> and operating on this zero entry
> as a vma is resulting into the
> oops.
>
> The proper fix would be around not exposing this partially-valid tree to
> others when droping the mmap lock, which is being solved with [1]. A
> simpler solution would be checking for MMF_UNSTABLE, as it is set if
> mm_struct is not fully initialized in dup_mmap().
>
> Thanks to Liam/Lorenzo/David for all the suggestions in fixing this
> issue.
>
> [1] https://lore.kernel.org/all/20250815191031.3769540-1-Liam.Howlett@oracle.com/
>
> Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Charan Teja Kalla <charan.kalla@oss.qualcomm.com>
I assume we want to CC stable?
> ---
>
> V1:
> -- Checking for xa_zero_entry() instead of most cleaner way of
> checking for MMF_UNSTABLE
> -- https://lore.kernel.org/linux-mm/20250808092156.1918973-1-quic_charante@quicinc.com/
>
> mm/swapfile.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 890b410d77b6..10760240a3a2 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2389,6 +2389,8 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
> VMA_ITERATOR(vmi, mm, 0);
>
> mmap_read_lock(mm);
> + if (check_stable_address_space(mm))
> + goto unlock;
> for_each_vma(vmi, vma) {
> if (vma->anon_vma && !is_vm_hugetlb_page(vma)) {
> ret = unuse_vma(vma, type);
> @@ -2398,6 +2400,7 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
>
> cond_resched();
> }
> +unlock:
> mmap_read_unlock(mm);
> return ret;
> }
Yeah, this should do until Liam's rework is in.
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH V2] mm: swap: check for stable address space before operating on the VMA
2025-09-24 18:11 [PATCH V2] mm: swap: check for stable address space before operating on the VMA Charan Teja Kalla
2025-09-24 18:15 ` David Hildenbrand
@ 2025-09-25 17:35 ` Chris Li
2025-09-26 8:30 ` Barry Song
1 sibling, 1 reply; 5+ messages in thread
From: Chris Li @ 2025-09-25 17:35 UTC (permalink / raw)
To: Charan Teja Kalla
Cc: david, Liam.Howlett, lorenzo.stoakes, akpm, shikemeng, kasong,
nphamcs, bhe, baohua, zhangpeng.00, linux-mm, linux-kernel
?
On Wed, Sep 24, 2025 at 11:12 AM Charan Teja Kalla
<charan.kalla@oss.qualcomm.com> wrote:
>
> It is possible to hit a zero entry while traversing the vmas in
> unuse_mm() called from swapoff path and accessing it causes the
> OOPS:
>
> Unable to handle kernel NULL pointer dereference at virtual address
> 0000000000000446--> Loading the memory from offset 0x40 on the
> XA_ZERO_ENTRY as address.
> Mem abort info:
> ESR = 0x0000000096000005
> EC = 0x25: DABT (current EL), IL = 32 bits
> SET = 0, FnV = 0
> EA = 0, S1PTW = 0
> FSC = 0x05: level 1 translation fault
>
> The issue is manifested from the below race between the fork() on a
> process and swapoff:
> fork(dup_mmap()) swapoff(unuse_mm)
> --------------- -----------------
> 1) Identical mtree is built using
> __mt_dup().
>
> 2) copy_pte_range()-->
> copy_nonpresent_pte():
> The dst mm is added into the
> mmlist to be visible to the
> swapoff operation.
>
> 3) Fatal signal is sent to the parent
> process(which is the current during the
> fork) thus skip the duplication of the
> vmas and mark the vma range with
> XA_ZERO_ENTRY as a marker for this process
> that helps during exit_mmap().
>
> 4) swapoff is tried on the
> 'mm' added to the 'mmlist' as
> part of the 2.
>
> 5) unuse_mm(), that iterates
> through the vma's of this 'mm'
> will hit the non-NULL zero entry
> and operating on this zero entry
> as a vma is resulting into the
> oops.
>
> The proper fix would be around not exposing this partially-valid tree to
> others when droping the mmap lock, which is being solved with [1]. A
> simpler solution would be checking for MMF_UNSTABLE, as it is set if
> mm_struct is not fully initialized in dup_mmap().
>
> Thanks to Liam/Lorenzo/David for all the suggestions in fixing this
> issue.
>
> [1] https://lore.kernel.org/all/20250815191031.3769540-1-Liam.Howlett@oracle.com/
>
> Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Charan Teja Kalla <charan.kalla@oss.qualcomm.com>
> ---
>
> V1:
> -- Checking for xa_zero_entry() instead of most cleaner way of
> checking for MMF_UNSTABLE
> -- https://lore.kernel.org/linux-mm/20250808092156.1918973-1-quic_charante@quicinc.com/
>
> mm/swapfile.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 890b410d77b6..10760240a3a2 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2389,6 +2389,8 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
> VMA_ITERATOR(vmi, mm, 0);
>
> mmap_read_lock(mm);
> + if (check_stable_address_space(mm))
> + goto unlock;
This is checking the MMF_UNSTABLE bit in the mm flags.
What is the locking requirement for accessing the mm flags MMF_UNSTABLE bit?
Here we hold the mm mmap read lock.
As far as I can tell, there are two paths that can set that bit.
1) dup_mm()
It holds the mm mmap write lock. This path is fine due to the write lock.
So far the above race against dup_mm(), adding this check is fine.
2) __oom_reap_task_mm()
It holds the mmap read lock when setting the MMF_UNSTABLE as far as I can tell.
So checking the MMF_UNSTABLE with another __oom_reap_task_mm() does
not exclude each other.
This is more of a question for oom reaping.
Does MMF_UNSTABLE have the test vs set racing here? It seems this
check does not protect against __oom_reap_task_mm(). I have no idea
if this race is triggerable. Just want someone else to double check if
my understanding is correct or not.
I can see this patch does protect the intended race in dup_mm() vs
unuse_mm(), it adds value.
Chris
> for_each_vma(vmi, vma) {
> if (vma->anon_vma && !is_vm_hugetlb_page(vma)) {
> ret = unuse_vma(vma, type);
> @@ -2398,6 +2400,7 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
>
> cond_resched();
> }
> +unlock:
> mmap_read_unlock(mm);
> return ret;
> }
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH V2] mm: swap: check for stable address space before operating on the VMA
2025-09-25 17:35 ` Chris Li
@ 2025-09-26 8:30 ` Barry Song
2025-09-28 16:01 ` Charan Teja Kalla
0 siblings, 1 reply; 5+ messages in thread
From: Barry Song @ 2025-09-26 8:30 UTC (permalink / raw)
To: Chris Li
Cc: Charan Teja Kalla, david, Liam.Howlett, lorenzo.stoakes, akpm,
shikemeng, kasong, nphamcs, bhe, zhangpeng.00, linux-mm,
linux-kernel
On Fri, Sep 26, 2025 at 9:03 AM Chris Li <chrisl@kernel.org> wrote:
[...]
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -2389,6 +2389,8 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
> > VMA_ITERATOR(vmi, mm, 0);
> >
> > mmap_read_lock(mm);
> > + if (check_stable_address_space(mm))
> > + goto unlock;
>
> This is checking the MMF_UNSTABLE bit in the mm flags.
> What is the locking requirement for accessing the mm flags MMF_UNSTABLE bit?
> Here we hold the mm mmap read lock.
>
> As far as I can tell, there are two paths that can set that bit.
>
> 1) dup_mm()
> It holds the mm mmap write lock. This path is fine due to the write lock.
> So far the above race against dup_mm(), adding this check is fine.
>
> 2) __oom_reap_task_mm()
> It holds the mmap read lock when setting the MMF_UNSTABLE as far as I can tell.
> So checking the MMF_UNSTABLE with another __oom_reap_task_mm() does
> not exclude each other.
> This is more of a question for oom reaping.
> Does MMF_UNSTABLE have the test vs set racing here? It seems this
> check does not protect against __oom_reap_task_mm(). I have no idea
> if this race is triggerable. Just want someone else to double check if
> my understanding is correct or not.
I haven’t actually run the code.
My guess is there’s a race when checking MMF_UNSTABLE against the
OOM reaper. I think it’s fine either way—whether we skip an OOM-reaped
mm upfront or take a middle path—since the OOM reaper will handle those
PTEs with the PTL just like unuse_pte() does and eventually free the mm
of the reaped process. It’s probably better to skip it early and avoid
unnecessary unuse_pte() calls.
>
> I can see this patch does protect the intended race in dup_mm() vs
> unuse_mm(), it adds value.
This also seems to add values for OOM-reaped processes to avoid a
useless unuse(), in case we aren’t skipping this mm right now. I’m
not sure if we’ve been skipping OOM-reaped processes elsewhere.
Hi Charan, do you have any observations on this? If an additional value is
added, could we record it in the changelog? Otherwise, can we add some
description in the changelog to address Chris’ concern?
Thanks
Barry
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH V2] mm: swap: check for stable address space before operating on the VMA
2025-09-26 8:30 ` Barry Song
@ 2025-09-28 16:01 ` Charan Teja Kalla
0 siblings, 0 replies; 5+ messages in thread
From: Charan Teja Kalla @ 2025-09-28 16:01 UTC (permalink / raw)
To: Barry Song, Chris Li
Cc: david, Liam.Howlett, lorenzo.stoakes, akpm, shikemeng, kasong,
nphamcs, bhe, zhangpeng.00, linux-mm, linux-kernel
Thanks Barry/Chris for checking it.
On 9/26/2025 2:00 PM, Barry Song wrote:
>> 2) __oom_reap_task_mm()
>> It holds the mmap read lock when setting the MMF_UNSTABLE as far as I can tell.
>> So checking the MMF_UNSTABLE with another __oom_reap_task_mm() does
>> not exclude each other.
>> This is more of a question for oom reaping.
>> Does MMF_UNSTABLE have the test vs set racing here? It seems this
>> check does not protect against __oom_reap_task_mm(). I have no idea
>> if this race is triggerable. Just want someone else to double check if
>> my understanding is correct or not.
> I haven’t actually run the code.
> My guess is there’s a race when checking MMF_UNSTABLE against the
> OOM reaper. I think it’s fine either way—whether we skip an OOM-reaped
> mm upfront or take a middle path—since the OOM reaper will handle those
> PTEs with the PTL just like unuse_pte() does and eventually free the mm
> of the reaped process. It’s probably better to skip it early and avoid
> unnecessary unuse_pte() calls.
>
I am sorry that I can't really see any stability issue b/n oom reaper
and unuse_mm(), but yes that unnecessary unuse_pte() calls, as Barry
mentioned, after reaping.
>> I can see this patch does protect the intended race in dup_mm() vs
>> unuse_mm(), it adds value.
> This also seems to add values for OOM-reaped processes to avoid a
> useless unuse(), in case we aren’t skipping this mm right now. I’m
> not sure if we’ve been skipping OOM-reaped processes elsewhere.
>
I don't see any explicit flags that tells the process is already
oom-reaped/under it. There is MMF_OOM_REAP_QUEUED, but this doesn't tell
if it is already reaped.
If the unnecessary calls to unuse_vma() is really of a concern, then
check the MMF_UNSTABLE while traversing VMA may be a solution but this
looks ugly.
> Hi Charan, do you have any observations on this? If an additional value is
> added, could we record it in the changelog? Otherwise, can we add some
> description in the changelog to address Chris’ concern?
I do see that Chris ask can go as completely different change as the
mentioned problem exist even before this change, please CMIW.
Thanks,
Charan
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-09-28 16:01 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-24 18:11 [PATCH V2] mm: swap: check for stable address space before operating on the VMA Charan Teja Kalla
2025-09-24 18:15 ` David Hildenbrand
2025-09-25 17:35 ` Chris Li
2025-09-26 8:30 ` Barry Song
2025-09-28 16:01 ` Charan Teja Kalla
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox