linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] mm/memory: align alloc_swap_folio() logic with alloc_anon_folio()
@ 2025-12-16  7:59 Wei Yang
  2025-12-16  7:59 ` [PATCH 1/2] mm: bail out do_swap_page() when no PTE table exist Wei Yang
  2025-12-16  7:59 ` [PATCH 2/2] mm: avoid unnecessary PTE table lock during initial swap folio scan Wei Yang
  0 siblings, 2 replies; 9+ messages in thread
From: Wei Yang @ 2025-12-16  7:59 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko
  Cc: linux-mm, hanchuanhua, v-songbaohua, Wei Yang

This series cleans up the logic used when swapping in large folios to bring it
in line with the established patterns in alloc_anon_folio().

Context:

When swapping in a large folio, the kernel performs a process similar to
alloc_anon_folio(): it scans the PTE range to determine the highest suitable
order for a large folio and then attempts allocation. However, two key
discrepancies currently exist in alloc_swap_folio():

  1. Improper Fallback: It currently falls back to an order-0 folio if
     pte_offset_map_lock() returns NULL.
  2. Unnecessary Locking: It locks the page table during the initial scanning
     phase.

Improvements:

Based on an analysis of the alloc_anon_folio() implementation, this series
applies the following corrections to the swap-in path:

  1. Bail out on NULL PTE: If pte_offset_map_lock() returns NULL, it indicates
     the page table has changed under us. In this scenario, the operation is
     likely to fail later anyway; continuing the process is inefficient and
     unnecessary.

  2. Lockless Initial Scan: The initial scan only reads PTEs to determine the
     appropriate folio order. Since no modifications are made at this stage,
     and the table is re-verified and locked before the actual write, the
     initial lock is redundant. Removing it reduces lock contention and system
     load.

Wei Yang (2):
  mm: bail out do_swap_page() when no PTE table exist
  mm: avoid unnecessary PTE table lock during initial swap folio scan

 mm/memory.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] mm: bail out do_swap_page() when no PTE table exist
  2025-12-16  7:59 [PATCH 0/3] mm/memory: align alloc_swap_folio() logic with alloc_anon_folio() Wei Yang
@ 2025-12-16  7:59 ` Wei Yang
  2025-12-19  8:42   ` David Hildenbrand (Red Hat)
  2025-12-16  7:59 ` [PATCH 2/2] mm: avoid unnecessary PTE table lock during initial swap folio scan Wei Yang
  1 sibling, 1 reply; 9+ messages in thread
From: Wei Yang @ 2025-12-16  7:59 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko
  Cc: linux-mm, hanchuanhua, v-songbaohua, Wei Yang

The alloc_swap_folio() function scans the PTE table to determine the
potential size (order) of the folio content to be swapped in.

Currently, if the call to pte_offset_map_lock() returns NULL, it
indicates that no PTE table exists for that range. Despite this, the
code proceeds to allocate an order-0 folio and continues the swap-in
process. This is unnecessary if the required table is absent.

This commit modifies the logic to immediately bail out of the swap-in
process when the PTE table is missing (i.e., pte_offset_map_lock()
returns NULL). This ensures we do not attempt to continue swapping when
the page table structure is incomplete or changed, preventing
unnecessary work.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Chuanhua Han <hanchuanhua@oppo.com>
Cc: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2a55edc48a65..1b8ef4f0ea60 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4566,7 +4566,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
 				  vmf->address & PMD_MASK, &ptl);
 	if (unlikely(!pte))
-		goto fallback;
+		return ERR_PTR(-EAGAIN);
 
 	/*
 	 * For do_swap_page, find the highest order where the aligned range is
@@ -4709,6 +4709,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		    __swap_count(entry) == 1) {
 			/* skip swapcache */
 			folio = alloc_swap_folio(vmf);
+			if (IS_ERR(folio))
+				goto out;
 			if (folio) {
 				__folio_set_locked(folio);
 				__folio_set_swapbacked(folio);
-- 
2.34.1



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/2] mm: avoid unnecessary PTE table lock during initial swap folio scan
  2025-12-16  7:59 [PATCH 0/3] mm/memory: align alloc_swap_folio() logic with alloc_anon_folio() Wei Yang
  2025-12-16  7:59 ` [PATCH 1/2] mm: bail out do_swap_page() when no PTE table exist Wei Yang
@ 2025-12-16  7:59 ` Wei Yang
  2025-12-19  8:47   ` David Hildenbrand (Red Hat)
  1 sibling, 1 reply; 9+ messages in thread
From: Wei Yang @ 2025-12-16  7:59 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko
  Cc: linux-mm, hanchuanhua, v-songbaohua, Wei Yang

The alloc_swap_folio() function performs an initial scan of the PTE
table solely to determine the potential size (order) of the folio
content that needs to be swapped in.

Locking the PTE table during this initial read is unnecessary for two
reasons:

  * We are not writing to the PTE table at this stage.

  * The code will re-check and lock the table again immediately before
    any actual modification is attempted.

This commit refactors the initial scan to map the PTE table without
acquiring the lock. This reduces contention and overhead, improving
performance of the swap-in path.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Chuanhua Han <hanchuanhua@oppo.com>
Cc: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 1b8ef4f0ea60..f8d6adfa83d7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4529,7 +4529,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	struct folio *folio;
 	unsigned long addr;
 	softleaf_t entry;
-	spinlock_t *ptl;
 	pte_t *pte;
 	gfp_t gfp;
 	int order;
@@ -4563,8 +4562,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	if (!orders)
 		goto fallback;
 
-	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
-				  vmf->address & PMD_MASK, &ptl);
+	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
 	if (unlikely(!pte))
 		return ERR_PTR(-EAGAIN);
 
@@ -4580,7 +4578,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 		order = next_order(&orders, order);
 	}
 
-	pte_unmap_unlock(pte, ptl);
+	pte_unmap(pte);
 
 	/* Try allocating the highest of the remaining orders. */
 	gfp = vma_thp_gfp_mask(vma);
-- 
2.34.1



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mm: bail out do_swap_page() when no PTE table exist
  2025-12-16  7:59 ` [PATCH 1/2] mm: bail out do_swap_page() when no PTE table exist Wei Yang
@ 2025-12-19  8:42   ` David Hildenbrand (Red Hat)
  2025-12-20  3:24     ` Wei Yang
  0 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-19  8:42 UTC (permalink / raw)
  To: Wei Yang, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko
  Cc: linux-mm, hanchuanhua, v-songbaohua

On 12/16/25 08:59, Wei Yang wrote:
> The alloc_swap_folio() function scans the PTE table to determine the
> potential size (order) of the folio content to be swapped in.
> 
> Currently, if the call to pte_offset_map_lock() returns NULL, it
> indicates that no PTE table exists for that range. Despite this, the
> code proceeds to allocate an order-0 folio and continues the swap-in
> process. This is unnecessary if the required table is absent.
> 
> This commit modifies the logic to immediately bail out of the swap-in
> process when the PTE table is missing (i.e., pte_offset_map_lock()
> returns NULL). This ensures we do not attempt to continue swapping when
> the page table structure is incomplete or changed, preventing
> unnecessary work.
> 
> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> Cc: Chuanhua Han <hanchuanhua@oppo.com>
> Cc: Barry Song <v-songbaohua@oppo.com>
> ---
>   mm/memory.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 2a55edc48a65..1b8ef4f0ea60 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4566,7 +4566,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>   	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
>   				  vmf->address & PMD_MASK, &ptl);
>   	if (unlikely(!pte))
> -		goto fallback;
> +		return ERR_PTR(-EAGAIN);
>   
>   	/*
>   	 * For do_swap_page, find the highest order where the aligned range is
> @@ -4709,6 +4709,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   		    __swap_count(entry) == 1) {
>   			/* skip swapcache */
>   			folio = alloc_swap_folio(vmf);
> +			if (IS_ERR(folio))
> +				goto out;
>   			if (folio) {
>   				__folio_set_locked(folio);
>   				__folio_set_swapbacked(folio);

How would we be able to even trigger this?

Trigger a swap fault with concurrent MADV_DONTNEED and concurrent page 
table reclaim.

Is that really something we should be worrying about?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] mm: avoid unnecessary PTE table lock during initial swap folio scan
  2025-12-16  7:59 ` [PATCH 2/2] mm: avoid unnecessary PTE table lock during initial swap folio scan Wei Yang
@ 2025-12-19  8:47   ` David Hildenbrand (Red Hat)
  2025-12-20  3:36     ` Wei Yang
  0 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-19  8:47 UTC (permalink / raw)
  To: Wei Yang, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko
  Cc: linux-mm, hanchuanhua, v-songbaohua

On 12/16/25 08:59, Wei Yang wrote:
> The alloc_swap_folio() function performs an initial scan of the PTE
> table solely to determine the potential size (order) of the folio
> content that needs to be swapped in.
> 
> Locking the PTE table during this initial read is unnecessary for two
> reasons:
> 
>    * We are not writing to the PTE table at this stage.
> 
>    * The code will re-check and lock the table again immediately before
>      any actual modification is attempted.
> 
> This commit refactors the initial scan to map the PTE table without
> acquiring the lock. This reduces contention and overhead, improving
> performance of the swap-in path.
> 
> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> Cc: Chuanhua Han <hanchuanhua@oppo.com>
> Cc: Barry Song <v-songbaohua@oppo.com>
> ---
>   mm/memory.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 1b8ef4f0ea60..f8d6adfa83d7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4529,7 +4529,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>   	struct folio *folio;
>   	unsigned long addr;
>   	softleaf_t entry;
> -	spinlock_t *ptl;
>   	pte_t *pte;
>   	gfp_t gfp;
>   	int order;
> @@ -4563,8 +4562,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>   	if (!orders)
>   		goto fallback;
>   
> -	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> -				  vmf->address & PMD_MASK, &ptl);
> +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>   	if (unlikely(!pte))

Can can_swapin_thp() deal with concurrent unmap and possible freeing of 
pages+swap?

We have some code that depends on swap entries stabilizing the swap 
device etc; the moment you allow for that concurrently to go away you 
open a can of worns.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mm: bail out do_swap_page() when no PTE table exist
  2025-12-19  8:42   ` David Hildenbrand (Red Hat)
@ 2025-12-20  3:24     ` Wei Yang
  2025-12-21  9:40       ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 9+ messages in thread
From: Wei Yang @ 2025-12-20  3:24 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Wei Yang, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, hanchuanhua, v-songbaohua

On Fri, Dec 19, 2025 at 09:42:16AM +0100, David Hildenbrand (Red Hat) wrote:
>On 12/16/25 08:59, Wei Yang wrote:
>> The alloc_swap_folio() function scans the PTE table to determine the
>> potential size (order) of the folio content to be swapped in.
>> 
>> Currently, if the call to pte_offset_map_lock() returns NULL, it
>> indicates that no PTE table exists for that range. Despite this, the
>> code proceeds to allocate an order-0 folio and continues the swap-in
>> process. This is unnecessary if the required table is absent.
>> 
>> This commit modifies the logic to immediately bail out of the swap-in
>> process when the PTE table is missing (i.e., pte_offset_map_lock()
>> returns NULL). This ensures we do not attempt to continue swapping when
>> the page table structure is incomplete or changed, preventing
>> unnecessary work.
>> 
>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>> Cc: Chuanhua Han <hanchuanhua@oppo.com>
>> Cc: Barry Song <v-songbaohua@oppo.com>
>> ---
>>   mm/memory.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>> 
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 2a55edc48a65..1b8ef4f0ea60 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4566,7 +4566,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>>   	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
>>   				  vmf->address & PMD_MASK, &ptl);
>>   	if (unlikely(!pte))
>> -		goto fallback;
>> +		return ERR_PTR(-EAGAIN);
>>   	/*
>>   	 * For do_swap_page, find the highest order where the aligned range is
>> @@ -4709,6 +4709,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>   		    __swap_count(entry) == 1) {
>>   			/* skip swapcache */
>>   			folio = alloc_swap_folio(vmf);
>> +			if (IS_ERR(folio))
>> +				goto out;
>>   			if (folio) {
>>   				__folio_set_locked(folio);
>>   				__folio_set_swapbacked(folio);
>
>How would we be able to even trigger this?

To be honest, I haven't thought about it. Thanks for the question.

>
>Trigger a swap fault with concurrent MADV_DONTNEED and concurrent page table
>reclaim.
>

Let me try to catch up with you.

  swap fault is triggered because user is access this range.
  MADV_DONTNEED is also triggered by user and means this range is not necessary.

So, we don't expect user will do these two contrary behavior at the same time.
This is your point, right?

>Is that really something we should be worrying about?
>

Now I question myself why alloc_anon_folio() need to bail out like this.

page fault vs MADV_DONTNEED is not in concern. So page fault vs khugepaged
is the case? But I see khugepaged would mmap_write_lock(), which prevent page
fault in my mind, before collapsing pmd.

I am curious alloc_anon_folio() tries to handle which case here.

>-- 
>Cheers
>
>David

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] mm: avoid unnecessary PTE table lock during initial swap folio scan
  2025-12-19  8:47   ` David Hildenbrand (Red Hat)
@ 2025-12-20  3:36     ` Wei Yang
  2025-12-21  9:47       ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 9+ messages in thread
From: Wei Yang @ 2025-12-20  3:36 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Wei Yang, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, hanchuanhua, v-songbaohua

On Fri, Dec 19, 2025 at 09:47:17AM +0100, David Hildenbrand (Red Hat) wrote:
>On 12/16/25 08:59, Wei Yang wrote:
>> The alloc_swap_folio() function performs an initial scan of the PTE
>> table solely to determine the potential size (order) of the folio
>> content that needs to be swapped in.
>> 
>> Locking the PTE table during this initial read is unnecessary for two
>> reasons:
>> 
>>    * We are not writing to the PTE table at this stage.
>> 
>>    * The code will re-check and lock the table again immediately before
>>      any actual modification is attempted.
>> 
>> This commit refactors the initial scan to map the PTE table without
>> acquiring the lock. This reduces contention and overhead, improving
>> performance of the swap-in path.
>> 
>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>> Cc: Chuanhua Han <hanchuanhua@oppo.com>
>> Cc: Barry Song <v-songbaohua@oppo.com>
>> ---
>>   mm/memory.c | 6 ++----
>>   1 file changed, 2 insertions(+), 4 deletions(-)
>> 
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 1b8ef4f0ea60..f8d6adfa83d7 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4529,7 +4529,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>>   	struct folio *folio;
>>   	unsigned long addr;
>>   	softleaf_t entry;
>> -	spinlock_t *ptl;
>>   	pte_t *pte;
>>   	gfp_t gfp;
>>   	int order;
>> @@ -4563,8 +4562,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>>   	if (!orders)
>>   		goto fallback;
>> -	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
>> -				  vmf->address & PMD_MASK, &ptl);
>> +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>   	if (unlikely(!pte))
>
>Can can_swapin_thp() deal with concurrent unmap and possible freeing of
>pages+swap?
>
>We have some code that depends on swap entries stabilizing the swap device
>etc; the moment you allow for that concurrently to go away you open a can of
>worns.
>

Sorry I don't follow you.

You mean some swap entry would be unmapped and cleared?

This is the first scan of swap entries, and there is another
swap_entry_batch() before real work with ptl locked to make sure the range is
still valid for swap in.

If someone change the pte at this moment, it will be detected after ptl
locked by swap_entry_batch().

Maybe I misunderstand you. Would you mind giving me more hint?

>-- 
>Cheers
>
>David

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mm: bail out do_swap_page() when no PTE table exist
  2025-12-20  3:24     ` Wei Yang
@ 2025-12-21  9:40       ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-21  9:40 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, linux-mm, hanchuanhua, v-songbaohua

On 12/20/25 04:24, Wei Yang wrote:
> On Fri, Dec 19, 2025 at 09:42:16AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/16/25 08:59, Wei Yang wrote:
>>> The alloc_swap_folio() function scans the PTE table to determine the
>>> potential size (order) of the folio content to be swapped in.
>>>
>>> Currently, if the call to pte_offset_map_lock() returns NULL, it
>>> indicates that no PTE table exists for that range. Despite this, the
>>> code proceeds to allocate an order-0 folio and continues the swap-in
>>> process. This is unnecessary if the required table is absent.
>>>
>>> This commit modifies the logic to immediately bail out of the swap-in
>>> process when the PTE table is missing (i.e., pte_offset_map_lock()
>>> returns NULL). This ensures we do not attempt to continue swapping when
>>> the page table structure is incomplete or changed, preventing
>>> unnecessary work.
>>>
>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>>> Cc: Chuanhua Han <hanchuanhua@oppo.com>
>>> Cc: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>    mm/memory.c | 4 +++-
>>>    1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 2a55edc48a65..1b8ef4f0ea60 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4566,7 +4566,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>>>    	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
>>>    				  vmf->address & PMD_MASK, &ptl);
>>>    	if (unlikely(!pte))
>>> -		goto fallback;
>>> +		return ERR_PTR(-EAGAIN);
>>>    	/*
>>>    	 * For do_swap_page, find the highest order where the aligned range is
>>> @@ -4709,6 +4709,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>    		    __swap_count(entry) == 1) {
>>>    			/* skip swapcache */
>>>    			folio = alloc_swap_folio(vmf);
>>> +			if (IS_ERR(folio))
>>> +				goto out;
>>>    			if (folio) {
>>>    				__folio_set_locked(folio);
>>>    				__folio_set_swapbacked(folio);
>>
>> How would we be able to even trigger this?
> 
> To be honest, I haven't thought about it. Thanks for the question.

Always ask yourself that question when trying to optimize something :)

Optimizing out unnecessary work in something that doesn't happen all the 
frequently is not particularly helpful.

> 
>>
>> Trigger a swap fault with concurrent MADV_DONTNEED and concurrent page table
>> reclaim.
>>
> 
> Let me try to catch up with you.
> 
>    swap fault is triggered because user is access this range.

And there was a page table with a swap entry.

>    MADV_DONTNEED is also triggered by user and means this range is not necessary.

Right, another thread could be zapping that range.

> 
> So, we don't expect user will do these two contrary behavior at the same time.
> This is your point, right?

They could. And it's valid. It just likely doesn't make a lot of sense :)

> 
>> Is that really something we should be worrying about?

But for the page table to vanish you'd actually need page table reclaim 
(as triggered by MADV_DONTNEED) to zap the whole page table.

>>
> 
> Now I question myself why alloc_anon_folio() need to bail out like this.

Your patch adds more complication by making it valid for 
alloc_swap_folio() to return an error pointer. I don't like that when 
there is no evidence that we frequently trigger that.

Also, there is no real difference between not finding a page table 
(reclaimed) or if the pte changed (as handled by can_swapin_thp()).

In fact, the latter (PTE changed) is even much more likely than a 
reclaimed page table.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] mm: avoid unnecessary PTE table lock during initial swap folio scan
  2025-12-20  3:36     ` Wei Yang
@ 2025-12-21  9:47       ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-21  9:47 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, linux-mm, hanchuanhua, v-songbaohua

On 12/20/25 04:36, Wei Yang wrote:
> On Fri, Dec 19, 2025 at 09:47:17AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/16/25 08:59, Wei Yang wrote:
>>> The alloc_swap_folio() function performs an initial scan of the PTE
>>> table solely to determine the potential size (order) of the folio
>>> content that needs to be swapped in.
>>>
>>> Locking the PTE table during this initial read is unnecessary for two
>>> reasons:
>>>
>>>     * We are not writing to the PTE table at this stage.
>>>
>>>     * The code will re-check and lock the table again immediately before
>>>       any actual modification is attempted.
>>>
>>> This commit refactors the initial scan to map the PTE table without
>>> acquiring the lock. This reduces contention and overhead, improving
>>> performance of the swap-in path.
>>>
>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>>> Cc: Chuanhua Han <hanchuanhua@oppo.com>
>>> Cc: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>    mm/memory.c | 6 ++----
>>>    1 file changed, 2 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 1b8ef4f0ea60..f8d6adfa83d7 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4529,7 +4529,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>>>    	struct folio *folio;
>>>    	unsigned long addr;
>>>    	softleaf_t entry;
>>> -	spinlock_t *ptl;
>>>    	pte_t *pte;
>>>    	gfp_t gfp;
>>>    	int order;
>>> @@ -4563,8 +4562,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>>>    	if (!orders)
>>>    		goto fallback;
>>> -	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
>>> -				  vmf->address & PMD_MASK, &ptl);
>>> +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>>    	if (unlikely(!pte))
>>
>> Can can_swapin_thp() deal with concurrent unmap and possible freeing of
>> pages+swap?
>>
>> We have some code that depends on swap entries stabilizing the swap device
>> etc; the moment you allow for that concurrently to go away you open a can of
>> worns.
>>
> 
> Sorry I don't follow you.
> 
> You mean some swap entry would be unmapped and cleared?

We could concurrently be zapping the page table. That means, after we 
read a swap-PTE, we could be concurrently freeing the swap entry from a 
different thread.

So the moment you depend on something that goes from PTE to something in 
the swap subsystem you might be in trouble.

swap_pte_batch() does things like lookup_swap_cgroup_id(), and 
can_swapin_thp() does things like swap_zeromap_batch() and 
non_swapcache_batch().

I don't know what happens if we can have concurrent zap+freeing of swap 
entries there, and if we could trigger some undefined behavior.

Therefore, we have to a bit more careful here.

Because I assume this is the first time that we walk swap entries 
without the PTE lock held?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-12-21  9:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-16  7:59 [PATCH 0/3] mm/memory: align alloc_swap_folio() logic with alloc_anon_folio() Wei Yang
2025-12-16  7:59 ` [PATCH 1/2] mm: bail out do_swap_page() when no PTE table exist Wei Yang
2025-12-19  8:42   ` David Hildenbrand (Red Hat)
2025-12-20  3:24     ` Wei Yang
2025-12-21  9:40       ` David Hildenbrand (Red Hat)
2025-12-16  7:59 ` [PATCH 2/2] mm: avoid unnecessary PTE table lock during initial swap folio scan Wei Yang
2025-12-19  8:47   ` David Hildenbrand (Red Hat)
2025-12-20  3:36     ` Wei Yang
2025-12-21  9:47       ` David Hildenbrand (Red Hat)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox