linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries
@ 2025-10-01  3:22 Lance Yang
  2025-10-01  8:31 ` David Hildenbrand
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Lance Yang @ 2025-10-01  3:22 UTC (permalink / raw)
  To: akpm
  Cc: david, lorenzo.stoakes, Liam.Howlett, baohua, baolin.wang,
	dev.jain, hughd, ioworker0, kirill, linux-kernel, linux-mm,
	mpenttil, npache, ryan.roberts, ziy, richard.weiyang, Lance Yang

From: Lance Yang <lance.yang@linux.dev>

Currently, special non-swap entries (like migration, hwpoison, or PTE
markers) are not caught early in hpage_collapse_scan_pmd(), leading to
failures deep in the swap-in logic.

hpage_collapse_scan_pmd()
 `- collapse_huge_page()
     `- __collapse_huge_page_swapin() -> fails!

As David suggested[1], this patch skips any such non-swap entries
early. If any one is found, the scan is aborted immediately with the
SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
work.

[1] https://lore.kernel.org/linux-mm/7840f68e-7580-42cb-a7c8-1ba64fd6df69@redhat.com
[2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680-dcd55219c8bd@lucifer.local

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
v1 -> v2:
 - Skip all non-present entries except swap entries (per David) thanks!
 - https://lore.kernel.org/linux-mm/20250924100207.28332-1-lance.yang@linux.dev/

 mm/khugepaged.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 7ab2d1a42df3..d0957648db19 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1284,7 +1284,23 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
 	     _pte++, addr += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
-		if (is_swap_pte(pteval)) {
+		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+			++none_or_zero;
+			if (!userfaultfd_armed(vma) &&
+			    (!cc->is_khugepaged ||
+			     none_or_zero <= khugepaged_max_ptes_none)) {
+				continue;
+			} else {
+				result = SCAN_EXCEED_NONE_PTE;
+				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+				goto out_unmap;
+			}
+		} else if (!pte_present(pteval)) {
+			if (non_swap_entry(pte_to_swp_entry(pteval))) {
+				result = SCAN_PTE_NON_PRESENT;
+				goto out_unmap;
+			}
+
 			++unmapped;
 			if (!cc->is_khugepaged ||
 			    unmapped <= khugepaged_max_ptes_swap) {
@@ -1293,7 +1309,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 				 * enabled swap entries.  Please see
 				 * comment below for pte_uffd_wp().
 				 */
-				if (pte_swp_uffd_wp_any(pteval)) {
+				if (pte_swp_uffd_wp(pteval)) {
 					result = SCAN_PTE_UFFD_WP;
 					goto out_unmap;
 				}
@@ -1304,18 +1320,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 				goto out_unmap;
 			}
 		}
-		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
-			++none_or_zero;
-			if (!userfaultfd_armed(vma) &&
-			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
-				continue;
-			} else {
-				result = SCAN_EXCEED_NONE_PTE;
-				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
-				goto out_unmap;
-			}
-		}
 		if (pte_uffd_wp(pteval)) {
 			/*
 			 * Don't collapse the page if any of the small
-- 
2.49.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries
  2025-10-01  3:22 [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries Lance Yang
@ 2025-10-01  8:31 ` David Hildenbrand
  2025-10-01  9:38   ` Lance Yang
  2025-10-01  8:54 ` Wei Yang
  2025-10-01 10:20 ` Dev Jain
  2 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2025-10-01  8:31 UTC (permalink / raw)
  To: Lance Yang, akpm
  Cc: lorenzo.stoakes, Liam.Howlett, baohua, baolin.wang, dev.jain,
	hughd, ioworker0, kirill, linux-kernel, linux-mm, mpenttil,
	npache, ryan.roberts, ziy, richard.weiyang

On 01.10.25 05:22, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> Currently, special non-swap entries (like migration, hwpoison, or PTE
> markers) are not caught early in hpage_collapse_scan_pmd(), leading to
> failures deep in the swap-in logic.
> 
> hpage_collapse_scan_pmd()
>   `- collapse_huge_page()
>       `- __collapse_huge_page_swapin() -> fails!
> 
> As David suggested[1], this patch skips any such non-swap entries
> early. If any one is found, the scan is aborted immediately with the
> SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
> work.
> 
> [1] https://lore.kernel.org/linux-mm/7840f68e-7580-42cb-a7c8-1ba64fd6df69@redhat.com
> [2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680-dcd55219c8bd@lucifer.local
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---

Could have mentioned you adjust the flow of code to resemble what we 
have in __collapse_huge_page_isolate().

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries
  2025-10-01  3:22 [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries Lance Yang
  2025-10-01  8:31 ` David Hildenbrand
@ 2025-10-01  8:54 ` Wei Yang
  2025-10-01  9:15   ` David Hildenbrand
  2025-10-01 10:05   ` Lance Yang
  2025-10-01 10:20 ` Dev Jain
  2 siblings, 2 replies; 16+ messages in thread
From: Wei Yang @ 2025-10-01  8:54 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, baohua, baolin.wang,
	dev.jain, hughd, ioworker0, kirill, linux-kernel, linux-mm,
	mpenttil, npache, ryan.roberts, ziy, richard.weiyang

On Wed, Oct 01, 2025 at 11:22:51AM +0800, Lance Yang wrote:
>From: Lance Yang <lance.yang@linux.dev>
>
>Currently, special non-swap entries (like migration, hwpoison, or PTE
>markers) are not caught early in hpage_collapse_scan_pmd(), leading to
>failures deep in the swap-in logic.
>
>hpage_collapse_scan_pmd()
> `- collapse_huge_page()
>     `- __collapse_huge_page_swapin() -> fails!
>
>As David suggested[1], this patch skips any such non-swap entries
>early. If any one is found, the scan is aborted immediately with the
>SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
>work.
>
>[1] https://lore.kernel.org/linux-mm/7840f68e-7580-42cb-a7c8-1ba64fd6df69@redhat.com
>[2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680-dcd55219c8bd@lucifer.local
>
>Suggested-by: David Hildenbrand <david@redhat.com>
>Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>Signed-off-by: Lance Yang <lance.yang@linux.dev>
>---
>v1 -> v2:
> - Skip all non-present entries except swap entries (per David) thanks!
> - https://lore.kernel.org/linux-mm/20250924100207.28332-1-lance.yang@linux.dev/
>
> mm/khugepaged.c | 32 ++++++++++++++++++--------------
> 1 file changed, 18 insertions(+), 14 deletions(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index 7ab2d1a42df3..d0957648db19 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -1284,7 +1284,23 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> 	     _pte++, addr += PAGE_SIZE) {
> 		pte_t pteval = ptep_get(_pte);
>-		if (is_swap_pte(pteval)) {

It looks is_swap_pte() is mis-leading?

>+		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>+			++none_or_zero;
>+			if (!userfaultfd_armed(vma) &&
>+			    (!cc->is_khugepaged ||
>+			     none_or_zero <= khugepaged_max_ptes_none)) {
>+				continue;
>+			} else {
>+				result = SCAN_EXCEED_NONE_PTE;
>+				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>+				goto out_unmap;
>+			}
>+		} else if (!pte_present(pteval)) {
>+			if (non_swap_entry(pte_to_swp_entry(pteval))) {
>+				result = SCAN_PTE_NON_PRESENT;
>+				goto out_unmap;
>+			}
>+
> 			++unmapped;
> 			if (!cc->is_khugepaged ||
> 			    unmapped <= khugepaged_max_ptes_swap) {
>@@ -1293,7 +1309,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> 				 * enabled swap entries.  Please see
> 				 * comment below for pte_uffd_wp().
> 				 */
>-				if (pte_swp_uffd_wp_any(pteval)) {
>+				if (pte_swp_uffd_wp(pteval)) {

I am not sure why we want to change this. There is no description in the
change log.

Would you mind giving some hint on this?

> 					result = SCAN_PTE_UFFD_WP;
> 					goto out_unmap;
> 				}
>@@ -1304,18 +1320,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> 				goto out_unmap;
> 			}
> 		}
>-		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>-			++none_or_zero;
>-			if (!userfaultfd_armed(vma) &&
>-			    (!cc->is_khugepaged ||
>-			     none_or_zero <= khugepaged_max_ptes_none)) {
>-				continue;
>-			} else {
>-				result = SCAN_EXCEED_NONE_PTE;
>-				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>-				goto out_unmap;
>-			}
>-		}
> 		if (pte_uffd_wp(pteval)) {
> 			/*
> 			 * Don't collapse the page if any of the small
>-- 
>2.49.0

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries
  2025-10-01  8:54 ` Wei Yang
@ 2025-10-01  9:15   ` David Hildenbrand
  2025-10-01 10:05   ` Lance Yang
  1 sibling, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2025-10-01  9:15 UTC (permalink / raw)
  To: Wei Yang, Lance Yang
  Cc: akpm, lorenzo.stoakes, Liam.Howlett, baohua, baolin.wang,
	dev.jain, hughd, ioworker0, kirill, linux-kernel, linux-mm,
	mpenttil, npache, ryan.roberts, ziy

On 01.10.25 10:54, Wei Yang wrote:
> On Wed, Oct 01, 2025 at 11:22:51AM +0800, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Currently, special non-swap entries (like migration, hwpoison, or PTE
>> markers) are not caught early in hpage_collapse_scan_pmd(), leading to
>> failures deep in the swap-in logic.
>>
>> hpage_collapse_scan_pmd()
>> `- collapse_huge_page()
>>      `- __collapse_huge_page_swapin() -> fails!
>>
>> As David suggested[1], this patch skips any such non-swap entries
>> early. If any one is found, the scan is aborted immediately with the
>> SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
>> work.
>>
>> [1] https://lore.kernel.org/linux-mm/7840f68e-7580-42cb-a7c8-1ba64fd6df69@redhat.com
>> [2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680-dcd55219c8bd@lucifer.local
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>> v1 -> v2:
>> - Skip all non-present entries except swap entries (per David) thanks!
>> - https://lore.kernel.org/linux-mm/20250924100207.28332-1-lance.yang@linux.dev/
>>
>> mm/khugepaged.c | 32 ++++++++++++++++++--------------
>> 1 file changed, 18 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 7ab2d1a42df3..d0957648db19 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1284,7 +1284,23 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>> 	     _pte++, addr += PAGE_SIZE) {
>> 		pte_t pteval = ptep_get(_pte);
>> -		if (is_swap_pte(pteval)) {
> 
> It looks is_swap_pte() is mis-leading?

Yeah, it nowadays includes non-swap entries as well.

Maybe we should consider renaming it at some point to something better.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries
  2025-10-01  8:31 ` David Hildenbrand
@ 2025-10-01  9:38   ` Lance Yang
  0 siblings, 0 replies; 16+ messages in thread
From: Lance Yang @ 2025-10-01  9:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: lorenzo.stoakes, Liam.Howlett, baohua, baolin.wang, dev.jain,
	hughd, ioworker0, kirill, linux-kernel, linux-mm, mpenttil,
	npache, ryan.roberts, ziy, richard.weiyang, akpm



On 2025/10/1 16:31, David Hildenbrand wrote:
> On 01.10.25 05:22, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Currently, special non-swap entries (like migration, hwpoison, or PTE
>> markers) are not caught early in hpage_collapse_scan_pmd(), leading to
>> failures deep in the swap-in logic.
>>
>> hpage_collapse_scan_pmd()
>>   `- collapse_huge_page()
>>       `- __collapse_huge_page_swapin() -> fails!
>>
>> As David suggested[1], this patch skips any such non-swap entries
>> early. If any one is found, the scan is aborted immediately with the
>> SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
>> work.
>>
>> [1] https://lore.kernel.org/linux-mm/7840f68e-7580-42cb- 
>> a7c8-1ba64fd6df69@redhat.com
>> [2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680- 
>> dcd55219c8bd@lucifer.local
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
> 
> Could have mentioned you adjust the flow of code to resemble what we 
> have in __collapse_huge_page_isolate().

Thanks for pointing that out! Let's do that separately. I'm working
on a follow-up patch that will unify the scanning with the
almost-duplicated code as you suggested ;)

> 
> Acked-by: David Hildenbrand <david@redhat.com>

Cheers!


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries
  2025-10-01  8:54 ` Wei Yang
  2025-10-01  9:15   ` David Hildenbrand
@ 2025-10-01 10:05   ` Lance Yang
  2025-10-01 13:52     ` Wei Yang
  2025-10-05  1:05     ` Wei Yang
  1 sibling, 2 replies; 16+ messages in thread
From: Lance Yang @ 2025-10-01 10:05 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, baohua, baolin.wang,
	dev.jain, hughd, ioworker0, kirill, linux-kernel, linux-mm,
	mpenttil, npache, ryan.roberts, ziy



On 2025/10/1 16:54, Wei Yang wrote:
> On Wed, Oct 01, 2025 at 11:22:51AM +0800, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Currently, special non-swap entries (like migration, hwpoison, or PTE
>> markers) are not caught early in hpage_collapse_scan_pmd(), leading to
>> failures deep in the swap-in logic.
>>
>> hpage_collapse_scan_pmd()
>> `- collapse_huge_page()
>>      `- __collapse_huge_page_swapin() -> fails!
>>
>> As David suggested[1], this patch skips any such non-swap entries
>> early. If any one is found, the scan is aborted immediately with the
>> SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
>> work.
>>
>> [1] https://lore.kernel.org/linux-mm/7840f68e-7580-42cb-a7c8-1ba64fd6df69@redhat.com
>> [2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680-dcd55219c8bd@lucifer.local
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>> v1 -> v2:
>> - Skip all non-present entries except swap entries (per David) thanks!
>> - https://lore.kernel.org/linux-mm/20250924100207.28332-1-lance.yang@linux.dev/
>>
>> mm/khugepaged.c | 32 ++++++++++++++++++--------------
>> 1 file changed, 18 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 7ab2d1a42df3..d0957648db19 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1284,7 +1284,23 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>> 	     _pte++, addr += PAGE_SIZE) {
>> 		pte_t pteval = ptep_get(_pte);
>> -		if (is_swap_pte(pteval)) {
> 
> It looks is_swap_pte() is mis-leading?

Hmm.. not to me, IMO. is_swap_pte() just means:

!pte_none(pte) && !pte_present(pte)

> 
>> +		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> +			++none_or_zero;
>> +			if (!userfaultfd_armed(vma) &&
>> +			    (!cc->is_khugepaged ||
>> +			     none_or_zero <= khugepaged_max_ptes_none)) {
>> +				continue;
>> +			} else {
>> +				result = SCAN_EXCEED_NONE_PTE;
>> +				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> +				goto out_unmap;
>> +			}
>> +		} else if (!pte_present(pteval)) {
>> +			if (non_swap_entry(pte_to_swp_entry(pteval))) {
>> +				result = SCAN_PTE_NON_PRESENT;
>> +				goto out_unmap;
>> +			}
>> +
>> 			++unmapped;
>> 			if (!cc->is_khugepaged ||
>> 			    unmapped <= khugepaged_max_ptes_swap) {
>> @@ -1293,7 +1309,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> 				 * enabled swap entries.  Please see
>> 				 * comment below for pte_uffd_wp().
>> 				 */
>> -				if (pte_swp_uffd_wp_any(pteval)) {
>> +				if (pte_swp_uffd_wp(pteval)) {
> 
> I am not sure why we want to change this. There is no description in the
> change log.
> 
> Would you mind giving some hint on this?

The reason is that pte_swp_uffd_wp_any(pte) is broader than what
we need :)

static inline bool pte_swp_uffd_wp_any(pte_t pte)
{
#ifdef CONFIG_PTE_MARKER_UFFD_WP
	if (!is_swap_pte(pte))
		return false;

	if (pte_swp_uffd_wp(pte))
		return true;

	if (pte_marker_uffd_wp(pte))
		return true;
#endif
	return false;
}

In the context within hpage_collapse_scan_pmd(), we are already inside
an is_swap_pte() block, and we have just handled all non-swap entries
(which would include pte_marker_uffd_wp()).

So we only need to check if the swap entry itself is write-protected
for userfaultfd ;)

Hope that explains it. I skipped it in the changelog as it's a tiny
cleanup ...

Thanks,
Lance

> 
>> 					result = SCAN_PTE_UFFD_WP;
>> 					goto out_unmap;
>> 				}
>> @@ -1304,18 +1320,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> 				goto out_unmap;
>> 			}
>> 		}
>> -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> -			++none_or_zero;
>> -			if (!userfaultfd_armed(vma) &&
>> -			    (!cc->is_khugepaged ||
>> -			     none_or_zero <= khugepaged_max_ptes_none)) {
>> -				continue;
>> -			} else {
>> -				result = SCAN_EXCEED_NONE_PTE;
>> -				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> -				goto out_unmap;
>> -			}
>> -		}
>> 		if (pte_uffd_wp(pteval)) {
>> 			/*
>> 			 * Don't collapse the page if any of the small
>> -- 
>> 2.49.0
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries
  2025-10-01  3:22 [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries Lance Yang
  2025-10-01  8:31 ` David Hildenbrand
  2025-10-01  8:54 ` Wei Yang
@ 2025-10-01 10:20 ` Dev Jain
  2025-10-01 10:48   ` Lance Yang
  2 siblings, 1 reply; 16+ messages in thread
From: Dev Jain @ 2025-10-01 10:20 UTC (permalink / raw)
  To: Lance Yang, akpm
  Cc: david, lorenzo.stoakes, Liam.Howlett, baohua, baolin.wang, hughd,
	ioworker0, kirill, linux-kernel, linux-mm, mpenttil, npache,
	ryan.roberts, ziy, richard.weiyang


On 01/10/25 8:52 am, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> Currently, special non-swap entries (like migration, hwpoison, or PTE
> markers) are not caught early in hpage_collapse_scan_pmd(), leading to
> failures deep in the swap-in logic.
>
> hpage_collapse_scan_pmd()
>   `- collapse_huge_page()
>       `- __collapse_huge_page_swapin() -> fails!
>
> As David suggested[1], this patch skips any such non-swap entries
> early. If any one is found, the scan is aborted immediately with the
> SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
> work.
>
> [1] https://lore.kernel.org/linux-mm/7840f68e-7580-42cb-a7c8-1ba64fd6df69@redhat.com
> [2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680-dcd55219c8bd@lucifer.local
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
> v1 -> v2:
>   - Skip all non-present entries except swap entries (per David) thanks!
>   - https://lore.kernel.org/linux-mm/20250924100207.28332-1-lance.yang@linux.dev/
>
>   mm/khugepaged.c | 32 ++++++++++++++++++--------------
>   1 file changed, 18 insertions(+), 14 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 7ab2d1a42df3..d0957648db19 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1284,7 +1284,23 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>   	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>   	     _pte++, addr += PAGE_SIZE) {
>   		pte_t pteval = ptep_get(_pte);
> -		if (is_swap_pte(pteval)) {
> +		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> +			++none_or_zero;
> +			if (!userfaultfd_armed(vma) &&
> +			    (!cc->is_khugepaged ||
> +			     none_or_zero <= khugepaged_max_ptes_none)) {
> +				continue;
> +			} else {
> +				result = SCAN_EXCEED_NONE_PTE;
> +				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				goto out_unmap;
> +			}
> +		} else if (!pte_present(pteval)) {

If you are trying to merge this with the _isolate() conditions, we can do
a micro-optimization here - is_swap_pte, (pte_none && is_zero_pfn), and pte_uffd_wp
are disjoint conditions, so we can use if-else-if-else-if to write them.

> +			if (non_swap_entry(pte_to_swp_entry(pteval))) {
> +				result = SCAN_PTE_NON_PRESENT;
> +				goto out_unmap;
> +			}
> +
>   			++unmapped;
>   			if (!cc->is_khugepaged ||
>   			    unmapped <= khugepaged_max_ptes_swap) {
> @@ -1293,7 +1309,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>   				 * enabled swap entries.  Please see
>   				 * comment below for pte_uffd_wp().
>   				 */
> -				if (pte_swp_uffd_wp_any(pteval)) {
> +				if (pte_swp_uffd_wp(pteval)) {
>   					result = SCAN_PTE_UFFD_WP;

Could have mentioned in the changelog "while at it, convert pte_swp_uffd_wp_any to
pte_swp_uffd_wp since we are in the swap pte branch".

>   					goto out_unmap;
>   				}
> @@ -1304,18 +1320,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>   				goto out_unmap;
>   			}
>   		}
> -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> -			++none_or_zero;
> -			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> -				continue;
> -			} else {
> -				result = SCAN_EXCEED_NONE_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> -				goto out_unmap;
> -			}
> -		}
>   		if (pte_uffd_wp(pteval)) {
>   			/*
>   			 * Don't collapse the page if any of the small

Otherwise LGTM

Reviewed-by: Dev Jain <dev.jain@arm.com>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries
  2025-10-01 10:20 ` Dev Jain
@ 2025-10-01 10:48   ` Lance Yang
  0 siblings, 0 replies; 16+ messages in thread
From: Lance Yang @ 2025-10-01 10:48 UTC (permalink / raw)
  To: Dev Jain
  Cc: david, lorenzo.stoakes, Liam.Howlett, baohua, baolin.wang, hughd,
	ioworker0, kirill, linux-kernel, linux-mm, mpenttil, npache,
	ryan.roberts, ziy, richard.weiyang, akpm



On 2025/10/1 18:20, Dev Jain wrote:
> 
> On 01/10/25 8:52 am, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Currently, special non-swap entries (like migration, hwpoison, or PTE
>> markers) are not caught early in hpage_collapse_scan_pmd(), leading to
>> failures deep in the swap-in logic.
>>
>> hpage_collapse_scan_pmd()
>>   `- collapse_huge_page()
>>       `- __collapse_huge_page_swapin() -> fails!
>>
>> As David suggested[1], this patch skips any such non-swap entries
>> early. If any one is found, the scan is aborted immediately with the
>> SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
>> work.
>>
>> [1] https://lore.kernel.org/linux-mm/7840f68e-7580-42cb- 
>> a7c8-1ba64fd6df69@redhat.com
>> [2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680- 
>> dcd55219c8bd@lucifer.local
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>> v1 -> v2:
>>   - Skip all non-present entries except swap entries (per David) thanks!
>>   - https://lore.kernel.org/linux-mm/20250924100207.28332-1- 
>> lance.yang@linux.dev/
>>
>>   mm/khugepaged.c | 32 ++++++++++++++++++--------------
>>   1 file changed, 18 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 7ab2d1a42df3..d0957648db19 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1284,7 +1284,23 @@ static int hpage_collapse_scan_pmd(struct 
>> mm_struct *mm,
>>       for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>            _pte++, addr += PAGE_SIZE) {
>>           pte_t pteval = ptep_get(_pte);
>> -        if (is_swap_pte(pteval)) {
>> +        if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> +            ++none_or_zero;
>> +            if (!userfaultfd_armed(vma) &&
>> +                (!cc->is_khugepaged ||
>> +                 none_or_zero <= khugepaged_max_ptes_none)) {
>> +                continue;
>> +            } else {
>> +                result = SCAN_EXCEED_NONE_PTE;
>> +                count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> +                goto out_unmap;
>> +            }
>> +        } else if (!pte_present(pteval)) {
> 
> If you are trying to merge this with the _isolate() conditions, we can do
> a micro-optimization here - is_swap_pte, (pte_none && is_zero_pfn), and 
> pte_uffd_wp
> are disjoint conditions, so we can use if-else-if-else-if to write them.

Ah, indeed, thanks!

I think it would fit better into the follow-up patch that unifies the
scanning logic, and I'll make sure to include it there ;p

> 
>> +            if (non_swap_entry(pte_to_swp_entry(pteval))) {
>> +                result = SCAN_PTE_NON_PRESENT;
>> +                goto out_unmap;
>> +            }
>> +
>>               ++unmapped;
>>               if (!cc->is_khugepaged ||
>>                   unmapped <= khugepaged_max_ptes_swap) {
>> @@ -1293,7 +1309,7 @@ static int hpage_collapse_scan_pmd(struct 
>> mm_struct *mm,
>>                    * enabled swap entries.  Please see
>>                    * comment below for pte_uffd_wp().
>>                    */
>> -                if (pte_swp_uffd_wp_any(pteval)) {
>> +                if (pte_swp_uffd_wp(pteval)) {
>>                       result = SCAN_PTE_UFFD_WP;
> 
> Could have mentioned in the changelog "while at it, convert 
> pte_swp_uffd_wp_any to
> pte_swp_uffd_wp since we are in the swap pte branch".

Right, that would have been clearer.

I'll add that if a next version is needed :)

> 
>>                       goto out_unmap;
>>                   }
>> @@ -1304,18 +1320,6 @@ static int hpage_collapse_scan_pmd(struct 
>> mm_struct *mm,
>>                   goto out_unmap;
>>               }
>>           }
>> -        if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> -            ++none_or_zero;
>> -            if (!userfaultfd_armed(vma) &&
>> -                (!cc->is_khugepaged ||
>> -                 none_or_zero <= khugepaged_max_ptes_none)) {
>> -                continue;
>> -            } else {
>> -                result = SCAN_EXCEED_NONE_PTE;
>> -                count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> -                goto out_unmap;
>> -            }
>> -        }
>>           if (pte_uffd_wp(pteval)) {
>>               /*
>>                * Don't collapse the page if any of the small
> 
> Otherwise LGTM
> 
> Reviewed-by: Dev Jain <dev.jain@arm.com>

Cheers!



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries
  2025-10-01 10:05   ` Lance Yang
@ 2025-10-01 13:52     ` Wei Yang
  2025-10-05  1:05     ` Wei Yang
  1 sibling, 0 replies; 16+ messages in thread
From: Wei Yang @ 2025-10-01 13:52 UTC (permalink / raw)
  To: Lance Yang
  Cc: Wei Yang, akpm, david, lorenzo.stoakes, Liam.Howlett, baohua,
	baolin.wang, dev.jain, hughd, ioworker0, kirill, linux-kernel,
	linux-mm, mpenttil, npache, ryan.roberts, ziy

On Wed, Oct 01, 2025 at 06:05:57PM +0800, Lance Yang wrote:
>
>
>On 2025/10/1 16:54, Wei Yang wrote:
>> On Wed, Oct 01, 2025 at 11:22:51AM +0800, Lance Yang wrote:
>> > From: Lance Yang <lance.yang@linux.dev>
>> > 
>> > Currently, special non-swap entries (like migration, hwpoison, or PTE
>> > markers) are not caught early in hpage_collapse_scan_pmd(), leading to
>> > failures deep in the swap-in logic.
>> > 
>> > hpage_collapse_scan_pmd()
>> > `- collapse_huge_page()
>> >      `- __collapse_huge_page_swapin() -> fails!
>> > 
>> > As David suggested[1], this patch skips any such non-swap entries
>> > early. If any one is found, the scan is aborted immediately with the
>> > SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
>> > work.
>> > 
>> > [1] https://lore.kernel.org/linux-mm/7840f68e-7580-42cb-a7c8-1ba64fd6df69@redhat.com
>> > [2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680-dcd55219c8bd@lucifer.local
>> > 
>> > Suggested-by: David Hildenbrand <david@redhat.com>
>> > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> > Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> > ---
>> > v1 -> v2:
>> > - Skip all non-present entries except swap entries (per David) thanks!
>> > - https://lore.kernel.org/linux-mm/20250924100207.28332-1-lance.yang@linux.dev/
>> > 
>> > mm/khugepaged.c | 32 ++++++++++++++++++--------------
>> > 1 file changed, 18 insertions(+), 14 deletions(-)
>> > 
>> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> > index 7ab2d1a42df3..d0957648db19 100644
>> > --- a/mm/khugepaged.c
>> > +++ b/mm/khugepaged.c
>> > @@ -1284,7 +1284,23 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> > 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>> > 	     _pte++, addr += PAGE_SIZE) {
>> > 		pte_t pteval = ptep_get(_pte);
>> > -		if (is_swap_pte(pteval)) {
>> 
>> It looks is_swap_pte() is mis-leading?
>
>Hmm.. not to me, IMO. is_swap_pte() just means:
>
>!pte_none(pte) && !pte_present(pte)
>
>> 
>> > +		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> > +			++none_or_zero;
>> > +			if (!userfaultfd_armed(vma) &&
>> > +			    (!cc->is_khugepaged ||
>> > +			     none_or_zero <= khugepaged_max_ptes_none)) {
>> > +				continue;
>> > +			} else {
>> > +				result = SCAN_EXCEED_NONE_PTE;
>> > +				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> > +				goto out_unmap;
>> > +			}
>> > +		} else if (!pte_present(pteval)) {
>> > +			if (non_swap_entry(pte_to_swp_entry(pteval))) {
>> > +				result = SCAN_PTE_NON_PRESENT;
>> > +				goto out_unmap;
>> > +			}
>> > +
>> > 			++unmapped;
>> > 			if (!cc->is_khugepaged ||
>> > 			    unmapped <= khugepaged_max_ptes_swap) {
>> > @@ -1293,7 +1309,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> > 				 * enabled swap entries.  Please see
>> > 				 * comment below for pte_uffd_wp().
>> > 				 */
>> > -				if (pte_swp_uffd_wp_any(pteval)) {
>> > +				if (pte_swp_uffd_wp(pteval)) {
>> 
>> I am not sure why we want to change this. There is no description in the
>> change log.
>> 
>> Would you mind giving some hint on this?
>
>The reason is that pte_swp_uffd_wp_any(pte) is broader than what
>we need :)
>
>static inline bool pte_swp_uffd_wp_any(pte_t pte)
>{
>#ifdef CONFIG_PTE_MARKER_UFFD_WP
>	if (!is_swap_pte(pte))
>		return false;
>
>	if (pte_swp_uffd_wp(pte))
>		return true;
>
>	if (pte_marker_uffd_wp(pte))
>		return true;
>#endif
>	return false;
>}
>
>In the context within hpage_collapse_scan_pmd(), we are already inside
>an is_swap_pte() block, and we have just handled all non-swap entries
>(which would include pte_marker_uffd_wp()).
>
>So we only need to check if the swap entry itself is write-protected
>for userfaultfd ;)
>
>Hope that explains it. I skipped it in the changelog as it's a tiny
>cleanup ...

Thanks, I got it.

Generally, looks good to me.

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries
  2025-10-01 10:05   ` Lance Yang
  2025-10-01 13:52     ` Wei Yang
@ 2025-10-05  1:05     ` Wei Yang
  2025-10-05  2:12       ` Lance Yang
  1 sibling, 1 reply; 16+ messages in thread
From: Wei Yang @ 2025-10-05  1:05 UTC (permalink / raw)
  To: Lance Yang
  Cc: Wei Yang, akpm, david, lorenzo.stoakes, Liam.Howlett, baohua,
	baolin.wang, dev.jain, hughd, ioworker0, kirill, linux-kernel,
	linux-mm, mpenttil, npache, ryan.roberts, ziy

On Wed, Oct 01, 2025 at 06:05:57PM +0800, Lance Yang wrote:
>
>
>On 2025/10/1 16:54, Wei Yang wrote:
>> On Wed, Oct 01, 2025 at 11:22:51AM +0800, Lance Yang wrote:
>> > From: Lance Yang <lance.yang@linux.dev>
>> > 
>> > Currently, special non-swap entries (like migration, hwpoison, or PTE
>> > markers) are not caught early in hpage_collapse_scan_pmd(), leading to
>> > failures deep in the swap-in logic.
>> > 
>> > hpage_collapse_scan_pmd()
>> > `- collapse_huge_page()
>> >      `- __collapse_huge_page_swapin() -> fails!
>> > 
>> > As David suggested[1], this patch skips any such non-swap entries
>> > early. If any one is found, the scan is aborted immediately with the
>> > SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
>> > work.
>> > 
>> > [1] https://lore.kernel.org/linux-mm/7840f68e-7580-42cb-a7c8-1ba64fd6df69@redhat.com
>> > [2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680-dcd55219c8bd@lucifer.local
>> > 
>> > Suggested-by: David Hildenbrand <david@redhat.com>
>> > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> > Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> > ---
>> > v1 -> v2:
>> > - Skip all non-present entries except swap entries (per David) thanks!
>> > - https://lore.kernel.org/linux-mm/20250924100207.28332-1-lance.yang@linux.dev/
>> > 
>> > mm/khugepaged.c | 32 ++++++++++++++++++--------------
>> > 1 file changed, 18 insertions(+), 14 deletions(-)
>> > 
>> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> > index 7ab2d1a42df3..d0957648db19 100644
>> > --- a/mm/khugepaged.c
>> > +++ b/mm/khugepaged.c
>> > @@ -1284,7 +1284,23 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> > 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>> > 	     _pte++, addr += PAGE_SIZE) {
>> > 		pte_t pteval = ptep_get(_pte);
>> > -		if (is_swap_pte(pteval)) {
>> 
>> It looks is_swap_pte() is mis-leading?
>
>Hmm.. not to me, IMO. is_swap_pte() just means:
>
>!pte_none(pte) && !pte_present(pte)
>

Maybe it has some reason.

I took another look into __collapse_huge_page_swapin(), which just check
is_swap_pte() before do_swap_page().

We have filtered non-swap entries in hpage_collapse_scan_pmd(), but we drop
mmap lock before isolation. This looks we may have a chance to get non-swap
entry.

Do you think it is reasonable to add a non_swap_entry() check before
do_swap_page()?

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries
  2025-10-05  1:05     ` Wei Yang
@ 2025-10-05  2:12       ` Lance Yang
  2025-10-06 14:18         ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Lance Yang @ 2025-10-05  2:12 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, baohua, baolin.wang,
	dev.jain, hughd, ioworker0, kirill, linux-kernel, linux-mm,
	mpenttil, npache, ryan.roberts, ziy



On 2025/10/5 09:05, Wei Yang wrote:
> On Wed, Oct 01, 2025 at 06:05:57PM +0800, Lance Yang wrote:
>>
>>
>> On 2025/10/1 16:54, Wei Yang wrote:
>>> On Wed, Oct 01, 2025 at 11:22:51AM +0800, Lance Yang wrote:
>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>
>>>> Currently, special non-swap entries (like migration, hwpoison, or PTE
>>>> markers) are not caught early in hpage_collapse_scan_pmd(), leading to
>>>> failures deep in the swap-in logic.
>>>>
>>>> hpage_collapse_scan_pmd()
>>>> `- collapse_huge_page()
>>>>       `- __collapse_huge_page_swapin() -> fails!
>>>>
>>>> As David suggested[1], this patch skips any such non-swap entries
>>>> early. If any one is found, the scan is aborted immediately with the
>>>> SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
>>>> work.
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/7840f68e-7580-42cb-a7c8-1ba64fd6df69@redhat.com
>>>> [2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680-dcd55219c8bd@lucifer.local
>>>>
>>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>>> ---
>>>> v1 -> v2:
>>>> - Skip all non-present entries except swap entries (per David) thanks!
>>>> - https://lore.kernel.org/linux-mm/20250924100207.28332-1-lance.yang@linux.dev/
>>>>
>>>> mm/khugepaged.c | 32 ++++++++++++++++++--------------
>>>> 1 file changed, 18 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 7ab2d1a42df3..d0957648db19 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -1284,7 +1284,23 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>>> 	     _pte++, addr += PAGE_SIZE) {
>>>> 		pte_t pteval = ptep_get(_pte);
>>>> -		if (is_swap_pte(pteval)) {
>>>
>>> It looks is_swap_pte() is mis-leading?
>>
>> Hmm.. not to me, IMO. is_swap_pte() just means:
>>
>> !pte_none(pte) && !pte_present(pte)
>>
> 
> Maybe it has some reason.
> 
> I took another look into __collapse_huge_page_swapin(), which just check
> is_swap_pte() before do_swap_page().
> 
> We have filtered non-swap entries in hpage_collapse_scan_pmd(), but we drop
> mmap lock before isolation. This looks we may have a chance to get non-swap
> entry.

Thanks for pointing that out!

Yep, there is a theoretical window between dropping the mmap lock
after the initial scan and re-acquiring it for isolation.

> 
> Do you think it is reasonable to add a non_swap_entry() check before
> do_swap_page()?

However, that seems unlikely in practice. IMHO, the early check in
hpage_collapse_scan_pmd() is sufficient for now, so I'd prefer to
keep it as-is :)




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries
  2025-10-05  2:12       ` Lance Yang
@ 2025-10-06 14:18         ` David Hildenbrand
  2025-10-06 15:02           ` Lance Yang
  2025-10-07 10:25           ` Lance Yang
  0 siblings, 2 replies; 16+ messages in thread
From: David Hildenbrand @ 2025-10-06 14:18 UTC (permalink / raw)
  To: Lance Yang, Wei Yang
  Cc: akpm, lorenzo.stoakes, Liam.Howlett, baohua, baolin.wang,
	dev.jain, hughd, ioworker0, kirill, linux-kernel, linux-mm,
	mpenttil, npache, ryan.roberts, ziy

On 05.10.25 04:12, Lance Yang wrote:
> 
> 
> On 2025/10/5 09:05, Wei Yang wrote:
>> On Wed, Oct 01, 2025 at 06:05:57PM +0800, Lance Yang wrote:
>>>
>>>
>>> On 2025/10/1 16:54, Wei Yang wrote:
>>>> On Wed, Oct 01, 2025 at 11:22:51AM +0800, Lance Yang wrote:
>>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>>
>>>>> Currently, special non-swap entries (like migration, hwpoison, or PTE
>>>>> markers) are not caught early in hpage_collapse_scan_pmd(), leading to
>>>>> failures deep in the swap-in logic.
>>>>>
>>>>> hpage_collapse_scan_pmd()
>>>>> `- collapse_huge_page()
>>>>>        `- __collapse_huge_page_swapin() -> fails!
>>>>>
>>>>> As David suggested[1], this patch skips any such non-swap entries
>>>>> early. If any one is found, the scan is aborted immediately with the
>>>>> SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
>>>>> work.
>>>>>
>>>>> [1] https://lore.kernel.org/linux-mm/7840f68e-7580-42cb-a7c8-1ba64fd6df69@redhat.com
>>>>> [2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680-dcd55219c8bd@lucifer.local
>>>>>
>>>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>>>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>>>> ---
>>>>> v1 -> v2:
>>>>> - Skip all non-present entries except swap entries (per David) thanks!
>>>>> - https://lore.kernel.org/linux-mm/20250924100207.28332-1-lance.yang@linux.dev/
>>>>>
>>>>> mm/khugepaged.c | 32 ++++++++++++++++++--------------
>>>>> 1 file changed, 18 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>> index 7ab2d1a42df3..d0957648db19 100644
>>>>> --- a/mm/khugepaged.c
>>>>> +++ b/mm/khugepaged.c
>>>>> @@ -1284,7 +1284,23 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>>> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>>>> 	     _pte++, addr += PAGE_SIZE) {
>>>>> 		pte_t pteval = ptep_get(_pte);
>>>>> -		if (is_swap_pte(pteval)) {
>>>>
>>>> It looks is_swap_pte() is mis-leading?
>>>
>>> Hmm.. not to me, IMO. is_swap_pte() just means:
>>>
>>> !pte_none(pte) && !pte_present(pte)
>>>
>>
>> Maybe it has some reason.
>>
>> I took another look into __collapse_huge_page_swapin(), which just check
>> is_swap_pte() before do_swap_page().

Thanks for pointing that out.

A function that is called __collapse_huge_page_swapin() and documented 
to "Bring missing pages in from swap" will handle other types as well.

Unbelievable horrible.

So let's think this through so we can document it in the changelog properly.

We could have currently ended up in do_swap_page() with

(1) Migration entries. We would have waited.

-> Maybe worth it to wait, maybe not. I suspect we don't stumble into
    that frequently such that we don't care. We could always unlock this
    separately later.


(2) Device-exclusive entries. We would have converted to non-exclusive.

-> See make_device_exclusive(), we cannot tolerate PMD entries and have
    to split them through FOLL_SPLIT_PMD. As popped up during a recent
    discussion, collapsing here is actually counter-productive, because
    the next conversion will PTE-map it again. (until recently, it would
    not have worked with large folios at all IIRC).

-> Ok to not collapse.

(3) Device-private entries. We would have migrated to RAM.

-> Device-private still does not support THPs, so collapsing right now 
just means that the next device access would split the folio again.

-> Ok to not collapse.

(4) HWPoison entries

-> Cannot collapse

(5) Markers

-> Cannot collapse


I suggest we add that in some form to the patch description, stating 
that we can unlock later what we really need, and not account it towards 
max_swap_ptes.

>>
>> We have filtered non-swap entries in hpage_collapse_scan_pmd(), but we drop
>> mmap lock before isolation. This looks we may have a chance to get non-swap
>> entry.
> 
> Thanks for pointing that out!
> 
> Yep, there is a theoretical window between dropping the mmap lock
> after the initial scan and re-acquiring it for isolation.
> 
>>
>> Do you think it is reasonable to add a non_swap_entry() check before
>> do_swap_page()?
> 
> However, that seems unlikely in practice. IMHO, the early check in
> hpage_collapse_scan_pmd() is sufficient for now, so I'd prefer to
> keep it as-is :)

I think we really should add that check, as per reasoning above.

I was looking into some possible races with uffd-wp being set before we 
enter do_swap_page(), but I think it might be okay (although very 
confusing).

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries
  2025-10-06 14:18         ` David Hildenbrand
@ 2025-10-06 15:02           ` Lance Yang
  2025-10-07 10:25           ` Lance Yang
  1 sibling, 0 replies; 16+ messages in thread
From: Lance Yang @ 2025-10-06 15:02 UTC (permalink / raw)
  To: David Hildenbrand, Wei Yang
  Cc: lorenzo.stoakes, Liam.Howlett, baohua, baolin.wang, dev.jain,
	hughd, ioworker0, kirill, linux-kernel, linux-mm, mpenttil,
	npache, ryan.roberts, ziy, akpm



On 2025/10/6 22:18, David Hildenbrand wrote:
> On 05.10.25 04:12, Lance Yang wrote:
>>
>>
>> On 2025/10/5 09:05, Wei Yang wrote:
>>> On Wed, Oct 01, 2025 at 06:05:57PM +0800, Lance Yang wrote:
>>>>
>>>>
>>>> On 2025/10/1 16:54, Wei Yang wrote:
>>>>> On Wed, Oct 01, 2025 at 11:22:51AM +0800, Lance Yang wrote:
>>>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>>>
>>>>>> Currently, special non-swap entries (like migration, hwpoison, or PTE
>>>>>> markers) are not caught early in hpage_collapse_scan_pmd(), 
>>>>>> leading to
>>>>>> failures deep in the swap-in logic.
>>>>>>
>>>>>> hpage_collapse_scan_pmd()
>>>>>> `- collapse_huge_page()
>>>>>>        `- __collapse_huge_page_swapin() -> fails!
>>>>>>
>>>>>> As David suggested[1], this patch skips any such non-swap entries
>>>>>> early. If any one is found, the scan is aborted immediately with the
>>>>>> SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
>>>>>> work.
>>>>>>
>>>>>> [1] https://lore.kernel.org/linux-mm/7840f68e-7580-42cb- 
>>>>>> a7c8-1ba64fd6df69@redhat.com
>>>>>> [2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680- 
>>>>>> dcd55219c8bd@lucifer.local
>>>>>>
>>>>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>>>>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>>>>> ---
>>>>>> v1 -> v2:
>>>>>> - Skip all non-present entries except swap entries (per David) 
>>>>>> thanks!
>>>>>> - https://lore.kernel.org/linux-mm/20250924100207.28332-1- 
>>>>>> lance.yang@linux.dev/
>>>>>>
>>>>>> mm/khugepaged.c | 32 ++++++++++++++++++--------------
>>>>>> 1 file changed, 18 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>>> index 7ab2d1a42df3..d0957648db19 100644
>>>>>> --- a/mm/khugepaged.c
>>>>>> +++ b/mm/khugepaged.c
>>>>>> @@ -1284,7 +1284,23 @@ static int hpage_collapse_scan_pmd(struct 
>>>>>> mm_struct *mm,
>>>>>>     for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>>>>>          _pte++, addr += PAGE_SIZE) {
>>>>>>         pte_t pteval = ptep_get(_pte);
>>>>>> -        if (is_swap_pte(pteval)) {
>>>>>
>>>>> It looks is_swap_pte() is mis-leading?
>>>>
>>>> Hmm.. not to me, IMO. is_swap_pte() just means:
>>>>
>>>> !pte_none(pte) && !pte_present(pte)
>>>>
>>>
>>> Maybe it has some reason.
>>>
>>> I took another look into __collapse_huge_page_swapin(), which just check
>>> is_swap_pte() before do_swap_page().
> 
> Thanks for pointing that out.
> 
> A function that is called __collapse_huge_page_swapin() and documented 
> to "Bring missing pages in from swap" will handle other types as well.
> 
> Unbelievable horrible.
> 
> So let's think this through so we can document it in the changelog 
> properly.
> 
> We could have currently ended up in do_swap_page() with
> 
> (1) Migration entries. We would have waited.
> 
> -> Maybe worth it to wait, maybe not. I suspect we don't stumble into
>     that frequently such that we don't care. We could always unlock this
>     separately later.
> 
> 
> (2) Device-exclusive entries. We would have converted to non-exclusive.
> 
> -> See make_device_exclusive(), we cannot tolerate PMD entries and have
>     to split them through FOLL_SPLIT_PMD. As popped up during a recent
>     discussion, collapsing here is actually counter-productive, because
>     the next conversion will PTE-map it again. (until recently, it would
>     not have worked with large folios at all IIRC).
> 
> -> Ok to not collapse.
> 
> (3) Device-private entries. We would have migrated to RAM.
> 
> -> Device-private still does not support THPs, so collapsing right now 
> just means that the next device access would split the folio again.
> 
> -> Ok to not collapse.
> 
> (4) HWPoison entries
> 
> -> Cannot collapse
> 
> (5) Markers
> 
> -> Cannot collapse
> 
> 
> I suggest we add that in some form to the patch description, stating 
> that we can unlock later what we really need, and not account it towards 
> max_swap_ptes.

Cool!

I'll take a closer look and adjust the patch description accordingly ;)

Thanks a lot for the lesson!

> 
>>>
>>> We have filtered non-swap entries in hpage_collapse_scan_pmd(), but 
>>> we drop
>>> mmap lock before isolation. This looks we may have a chance to get 
>>> non-swap
>>> entry.
>>
>> Thanks for pointing that out!
>>
>> Yep, there is a theoretical window between dropping the mmap lock
>> after the initial scan and re-acquiring it for isolation.
>>
>>>
>>> Do you think it is reasonable to add a non_swap_entry() check before
>>> do_swap_page()?
>>
>> However, that seems unlikely in practice. IMHO, the early check in
>> hpage_collapse_scan_pmd() is sufficient for now, so I'd prefer to
>> keep it as-is :)
> 
> I think we really should add that check, as per reasoning above.
> 
> I was looking into some possible races with uffd-wp being set before we 
> enter do_swap_page(), but I think it might be okay (although very 
> confusing).

Ah, I see ;p

@Wei could you send a patch to add the non_swap_entry() check there?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries
  2025-10-06 14:18         ` David Hildenbrand
  2025-10-06 15:02           ` Lance Yang
@ 2025-10-07 10:25           ` Lance Yang
  2025-10-08  1:37             ` Wei Yang
  2025-10-08  9:00             ` David Hildenbrand
  1 sibling, 2 replies; 16+ messages in thread
From: Lance Yang @ 2025-10-07 10:25 UTC (permalink / raw)
  To: David Hildenbrand, Wei Yang
  Cc: akpm, lorenzo.stoakes, Liam.Howlett, baohua, baolin.wang,
	dev.jain, hughd, ioworker0, kirill, linux-kernel, linux-mm,
	mpenttil, npache, ryan.roberts, ziy



On 2025/10/6 22:18, David Hildenbrand wrote:
> On 05.10.25 04:12, Lance Yang wrote:
>>
>>
>> On 2025/10/5 09:05, Wei Yang wrote:
>>> On Wed, Oct 01, 2025 at 06:05:57PM +0800, Lance Yang wrote:
>>>>
>>>>
>>>> On 2025/10/1 16:54, Wei Yang wrote:
>>>>> On Wed, Oct 01, 2025 at 11:22:51AM +0800, Lance Yang wrote:
>>>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>>>
>>>>>> Currently, special non-swap entries (like migration, hwpoison, or PTE
>>>>>> markers) are not caught early in hpage_collapse_scan_pmd(), 
>>>>>> leading to
>>>>>> failures deep in the swap-in logic.
>>>>>>
>>>>>> hpage_collapse_scan_pmd()
>>>>>> `- collapse_huge_page()
>>>>>>        `- __collapse_huge_page_swapin() -> fails!
>>>>>>
>>>>>> As David suggested[1], this patch skips any such non-swap entries
>>>>>> early. If any one is found, the scan is aborted immediately with the
>>>>>> SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
>>>>>> work.
>>>>>>
>>>>>> [1] https://lore.kernel.org/linux-mm/7840f68e-7580-42cb- 
>>>>>> a7c8-1ba64fd6df69@redhat.com
>>>>>> [2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680- 
>>>>>> dcd55219c8bd@lucifer.local
>>>>>>
>>>>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>>>>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>>>>> ---
>>>>>> v1 -> v2:
>>>>>> - Skip all non-present entries except swap entries (per David) 
>>>>>> thanks!
>>>>>> - https://lore.kernel.org/linux-mm/20250924100207.28332-1- 
>>>>>> lance.yang@linux.dev/
>>>>>>
>>>>>> mm/khugepaged.c | 32 ++++++++++++++++++--------------
>>>>>> 1 file changed, 18 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>>> index 7ab2d1a42df3..d0957648db19 100644
>>>>>> --- a/mm/khugepaged.c
>>>>>> +++ b/mm/khugepaged.c
>>>>>> @@ -1284,7 +1284,23 @@ static int hpage_collapse_scan_pmd(struct 
>>>>>> mm_struct *mm,
>>>>>>     for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>>>>>          _pte++, addr += PAGE_SIZE) {
>>>>>>         pte_t pteval = ptep_get(_pte);
>>>>>> -        if (is_swap_pte(pteval)) {
>>>>>
>>>>> It looks is_swap_pte() is mis-leading?
>>>>
>>>> Hmm.. not to me, IMO. is_swap_pte() just means:
>>>>
>>>> !pte_none(pte) && !pte_present(pte)
>>>>
>>>
>>> Maybe it has some reason.
>>>
>>> I took another look into __collapse_huge_page_swapin(), which just check
>>> is_swap_pte() before do_swap_page().
> 
> Thanks for pointing that out.
> 
> A function that is called __collapse_huge_page_swapin() and documented 
> to "Bring missing pages in from swap" will handle other types as well.
> 
> Unbelievable horrible.
> 
> So let's think this through so we can document it in the changelog 
> properly.
> 
> We could have currently ended up in do_swap_page() with
> 
> (1) Migration entries. We would have waited.
> 
> -> Maybe worth it to wait, maybe not. I suspect we don't stumble into
>     that frequently such that we don't care. We could always unlock this
>     separately later.
> 
> 
> (2) Device-exclusive entries. We would have converted to non-exclusive.
> 
> -> See make_device_exclusive(), we cannot tolerate PMD entries and have
>     to split them through FOLL_SPLIT_PMD. As popped up during a recent
>     discussion, collapsing here is actually counter-productive, because
>     the next conversion will PTE-map it again. (until recently, it would
>     not have worked with large folios at all IIRC).
> 
> -> Ok to not collapse.
> 
> (3) Device-private entries. We would have migrated to RAM.
> 
> -> Device-private still does not support THPs, so collapsing right now 
> just means that the next device access would split the folio again.
> 
> -> Ok to not collapse.
> 
> (4) HWPoison entries
> 
> -> Cannot collapse
> 
> (5) Markers
> 
> -> Cannot collapse
> 
> 
> I suggest we add that in some form to the patch description, stating 
> that we can unlock later what we really need, and not account it towards 
> max_swap_ptes.
> 
>>>
>>> We have filtered non-swap entries in hpage_collapse_scan_pmd(), but 
>>> we drop
>>> mmap lock before isolation. This looks we may have a chance to get 
>>> non-swap
>>> entry.
>>
>> Thanks for pointing that out!
>>
>> Yep, there is a theoretical window between dropping the mmap lock
>> after the initial scan and re-acquiring it for isolation.
>>
>>>
>>> Do you think it is reasonable to add a non_swap_entry() check before
>>> do_swap_page()?
>>
>> However, that seems unlikely in practice. IMHO, the early check in
>> hpage_collapse_scan_pmd() is sufficient for now, so I'd prefer to
>> keep it as-is :)
> 
> I think we really should add that check, as per reasoning above.
> 
> I was looking into some possible races with uffd-wp being set before we 
> enter do_swap_page(), but I think it might be okay (although very 
> confusing).

How about the version below?

```
Currently, special non-swap entries (like PTE markers) are not caught
early in hpage_collapse_scan_pmd(), leading to failures deep in the
swap-in logic.

A function that is called __collapse_huge_page_swapin() and documented
to "Bring missing pages in from swap" will handle other types as well.

As analyzed by David[1], we could have ended up with the following
entry types right before do_swap_page():

   (1) Migration entries. We would have waited.
       -> Maybe worth it to wait, maybe not. We suspect we don't stumble
          into that frequently such that we don't care. We could always
          unlock this separately later.

   (2) Device-exclusive entries. We would have converted to non-exclusive.
       -> See make_device_exclusive(), we cannot tolerate PMD entries and
          have to split them through FOLL_SPLIT_PMD. As popped up during
          a recent discussion, collapsing here is actually
          counter-productive, because the next conversion will PTE-map
          it again.
       -> Ok to not collapse.

   (3) Device-private entries. We would have migrated to RAM.
       -> Device-private still does not support THPs, so collapsing right
          now just means that the next device access would split the
          folio again.
       -> Ok to not collapse.

   (4) HWPoison entries
       -> Cannot collapse

   (5) Markers
       -> Cannot collapse

First, this patch adds an early check for these non-swap entries. If
any one is found, the scan is aborted immediately with the
SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
work.

Second, as Wei pointed out[3], we may have a chance to get a non-swap
entry, since we will drop and re-acquire the mmap lock before
__collapse_huge_page_swapin(). To handle this, we also add a
non_swap_entry() check there.

Note that we can unlock later what we really need, and not account it
towards max_swap_ptes.

[1] 
https://lore.kernel.org/linux-mm/09eaca7b-9988-41c7-8d6e-4802055b3f1e@redhat.com
[2] 
https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680-dcd55219c8bd@lucifer.local
[3] https://lore.kernel.org/linux-mm/20251005010511.ysek2nqojebqngf3@master
```

I also think it makes sense to fold the change that adds the
non_swap_entry() check in __collapse_huge_page_swapin() into
this patch, rather than creating a new patch just for that :)

Hmmm... one thing I'm not sure about: regarding the uffd-wp
race you mentioned, is the pte_swp_uffd_wp() check needed
after non_swap_entry()? It seems like it might not be ...

```
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f4f57ba69d72..bec3e268dc76 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1020,6 +1020,11 @@ static int __collapse_huge_page_swapin(struct 
mm_struct *mm,
  		if (!is_swap_pte(vmf.orig_pte))
  			continue;

+		if (non_swap_entry(pte_to_swp_entry(vmf.orig_pte))) {
+			result = SCAN_PTE_NON_PRESENT;
+			goto out;
+		}
+
  		vmf.pte = pte;
  		vmf.ptl = ptl;
  		ret = do_swap_page(&vmf);
```

@David does that sound good to you?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries
  2025-10-07 10:25           ` Lance Yang
@ 2025-10-08  1:37             ` Wei Yang
  2025-10-08  9:00             ` David Hildenbrand
  1 sibling, 0 replies; 16+ messages in thread
From: Wei Yang @ 2025-10-08  1:37 UTC (permalink / raw)
  To: Lance Yang
  Cc: David Hildenbrand, Wei Yang, akpm, lorenzo.stoakes, Liam.Howlett,
	baohua, baolin.wang, dev.jain, hughd, ioworker0, kirill,
	linux-kernel, linux-mm, mpenttil, npache, ryan.roberts, ziy

On Tue, Oct 07, 2025 at 06:25:13PM +0800, Lance Yang wrote:
>
>
>On 2025/10/6 22:18, David Hildenbrand wrote:
>> On 05.10.25 04:12, Lance Yang wrote:
[...]
>> 
>> I was looking into some possible races with uffd-wp being set before we
>> enter do_swap_page(), but I think it might be okay (although very
>> confusing).
>
>How about the version below?
>
>```
>Currently, special non-swap entries (like PTE markers) are not caught
>early in hpage_collapse_scan_pmd(), leading to failures deep in the
>swap-in logic.
>
>A function that is called __collapse_huge_page_swapin() and documented
>to "Bring missing pages in from swap" will handle other types as well.
>
>As analyzed by David[1], we could have ended up with the following
>entry types right before do_swap_page():
>
>  (1) Migration entries. We would have waited.
>      -> Maybe worth it to wait, maybe not. We suspect we don't stumble
>         into that frequently such that we don't care. We could always
>         unlock this separately later.
>
>  (2) Device-exclusive entries. We would have converted to non-exclusive.
>      -> See make_device_exclusive(), we cannot tolerate PMD entries and
>         have to split them through FOLL_SPLIT_PMD. As popped up during
>         a recent discussion, collapsing here is actually
>         counter-productive, because the next conversion will PTE-map
>         it again.
>      -> Ok to not collapse.
>
>  (3) Device-private entries. We would have migrated to RAM.
>      -> Device-private still does not support THPs, so collapsing right
>         now just means that the next device access would split the
>         folio again.
>      -> Ok to not collapse.
>
>  (4) HWPoison entries
>      -> Cannot collapse
>
>  (5) Markers
>      -> Cannot collapse
>
>First, this patch adds an early check for these non-swap entries. If
>any one is found, the scan is aborted immediately with the
>SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
>work.
>
>Second, as Wei pointed out[3], we may have a chance to get a non-swap
>entry, since we will drop and re-acquire the mmap lock before
>__collapse_huge_page_swapin(). To handle this, we also add a
>non_swap_entry() check there.
>
>Note that we can unlock later what we really need, and not account it
>towards max_swap_ptes.
>
>[1] https://lore.kernel.org/linux-mm/09eaca7b-9988-41c7-8d6e-4802055b3f1e@redhat.com
>[2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680-dcd55219c8bd@lucifer.local
>[3] https://lore.kernel.org/linux-mm/20251005010511.ysek2nqojebqngf3@master
>```
>
>I also think it makes sense to fold the change that adds the
>non_swap_entry() check in __collapse_huge_page_swapin() into
>this patch, rather than creating a new patch just for that :)
>

Agree.

>Hmmm... one thing I'm not sure about: regarding the uffd-wp
>race you mentioned, is the pte_swp_uffd_wp() check needed
>after non_swap_entry()? It seems like it might not be ...
>
>```
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index f4f57ba69d72..bec3e268dc76 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -1020,6 +1020,11 @@ static int __collapse_huge_page_swapin(struct
>mm_struct *mm,
> 		if (!is_swap_pte(vmf.orig_pte))
> 			continue;
>
>+		if (non_swap_entry(pte_to_swp_entry(vmf.orig_pte))) {
>+			result = SCAN_PTE_NON_PRESENT;
>+			goto out;
>+		}
>+
> 		vmf.pte = pte;
> 		vmf.ptl = ptl;
> 		ret = do_swap_page(&vmf);
>```
>
>@David does that sound good to you?

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries
  2025-10-07 10:25           ` Lance Yang
  2025-10-08  1:37             ` Wei Yang
@ 2025-10-08  9:00             ` David Hildenbrand
  1 sibling, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2025-10-08  9:00 UTC (permalink / raw)
  To: Lance Yang, Wei Yang
  Cc: akpm, lorenzo.stoakes, Liam.Howlett, baohua, baolin.wang,
	dev.jain, hughd, ioworker0, kirill, linux-kernel, linux-mm,
	mpenttil, npache, ryan.roberts, ziy

> Second, as Wei pointed out[3], we may have a chance to get a non-swap
> entry, since we will drop and re-acquire the mmap lock before
> __collapse_huge_page_swapin(). To handle this, we also add a
> non_swap_entry() check there.
> 
> Note that we can unlock later what we really need, and not account it
> towards max_swap_ptes.
> 
> [1]
> https://lore.kernel.org/linux-mm/09eaca7b-9988-41c7-8d6e-4802055b3f1e@redhat.com
> [2]
> https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680-dcd55219c8bd@lucifer.local
> [3] https://lore.kernel.org/linux-mm/20251005010511.ysek2nqojebqngf3@master
> ```
> 

I replied to the new version.

> I also think it makes sense to fold the change that adds the
> non_swap_entry() check in __collapse_huge_page_swapin() into
> this patch, rather than creating a new patch just for that :)
> 
> Hmmm... one thing I'm not sure about: regarding the uffd-wp
> race you mentioned, is the pte_swp_uffd_wp() check needed
> after non_swap_entry()? It seems like it might not be ...
> 

If we check non_swap_entry(), there is no need to check for other 
markers, agreed.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-10-08  9:00 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-01  3:22 [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries Lance Yang
2025-10-01  8:31 ` David Hildenbrand
2025-10-01  9:38   ` Lance Yang
2025-10-01  8:54 ` Wei Yang
2025-10-01  9:15   ` David Hildenbrand
2025-10-01 10:05   ` Lance Yang
2025-10-01 13:52     ` Wei Yang
2025-10-05  1:05     ` Wei Yang
2025-10-05  2:12       ` Lance Yang
2025-10-06 14:18         ` David Hildenbrand
2025-10-06 15:02           ` Lance Yang
2025-10-07 10:25           ` Lance Yang
2025-10-08  1:37             ` Wei Yang
2025-10-08  9:00             ` David Hildenbrand
2025-10-01 10:20 ` Dev Jain
2025-10-01 10:48   ` Lance Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox