[PATCH V2 0/2] mm/khugepaged: fix dirty page handling for MADV

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V2 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE
@ 2025-11-20  6:50 Shivank Garg
  2025-11-20  6:50 ` [PATCH V2 1/2] mm/khugepaged: do synchronous writeback " Shivank Garg
  2025-11-20  6:50 ` [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN Shivank Garg
  0 siblings, 2 replies; 13+ messages in thread
From: Shivank Garg @ 2025-11-20  6:50 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
	linux-kernel, linux-trace-kernel, Shivank Garg

MADV_COLLAPSE on file-backed mappings fails with -EINVAL when TEXT pages
are dirty. This may affect real scenarios: package/container updates,
executing binaries immediately after writing them, etc.

The issue is that collapse_file() triggers async writeback and returns
SCAN_FAIL (maps to -EINVAL), expecting khugepaged to revisit later. But
MADV_COLLAPSE is synchronous and userspace expects immediate success or a
clear retry signal.

Reproduction:
 - Copy 2MB-aligned executable to freshly mounted XFS/ext4
 - Call MADV_COLLAPSE on .text section
 - First call fails with -EINVAL (text pages dirty from copy)
 - Second call succeeds (async writeback completed)

Issue Report:
https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com

v2:
- Move writeback to madvise_collapse() (better abstraction, proper
  mmap_lock handling and does VMA revalidation after I/O) (Lorenzo)
- Rename to SCAN_PAGE_DIRTY to SCAN_PAGE_NOT_CLEAN and extend its use
  for all dirty/writeback folio cases that previously returned incorrect
  results (Dev) 

v1: https://lore.kernel.org/all/20251110113254.77822-1-shivankg@amd.com

Shivank Garg (2):
  mm/khugepaged: do synchronous writeback for MADV_COLLAPSE
  mm/khugepaged: map dirty/writeback pages failures to EAGAIN

 include/trace/events/huge_memory.h |  3 ++-
 mm/khugepaged.c                    | 34 +++++++++++++++++++++++++++---
 2 files changed, 33 insertions(+), 4 deletions(-)


base-commit: 4a3f8fc3adb7046e44bd1feb2f5c5fe95296894f
-- 
2.43.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH V2 1/2] mm/khugepaged: do synchronous writeback for MADV_COLLAPSE
  2025-11-20  6:50 [PATCH V2 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE Shivank Garg
@ 2025-11-20  6:50 ` Shivank Garg
  2025-11-20 13:01   ` Lance Yang
  2025-11-20 13:35   ` David Hildenbrand (Red Hat)
  2025-11-20  6:50 ` [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN Shivank Garg
  1 sibling, 2 replies; 13+ messages in thread
From: Shivank Garg @ 2025-11-20  6:50 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
	linux-kernel, linux-trace-kernel, Shivank Garg, Branden Moore

When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
text sections), the pages may still be dirty from recent writes and
cause collapse to fail with -EINVAL. This is particularly problematic
for freshly copied executables on filesystems, where page cache folios
remain dirty until background writeback completes.

The current code in collapse_file() triggers async writeback via
filemap_flush() and expects khugepaged to revisit the page later.
However, MADV_COLLAPSE is a synchronous operation where userspace
expects immediate results.

Perform synchronous writeback in madvise_collapse() before attempting
collapse to avoid failing on first attempt.

Reported-by: Branden Moore <Branden.Moore@amd.com>
Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 mm/khugepaged.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 97d1b2824386..066a332c76ad 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -22,6 +22,7 @@
 #include <linux/dax.h>
 #include <linux/ksm.h>
 #include <linux/pgalloc.h>
+#include <linux/backing-dev.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -2784,6 +2785,31 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
 	hend = end & HPAGE_PMD_MASK;
 
+	/*
+	 * For file-backed VMAs, perform synchronous writeback to ensure
+	 * dirty folios are flushed before attempting collapse. This avoids
+	 * failing on the first attempt when freshly-written executable text
+	 * is still dirty in the page cache.
+	 */
+	if (!vma_is_anonymous(vma) && vma->vm_file) {
+		struct address_space *mapping = vma->vm_file->f_mapping;
+
+		if (mapping_can_writeback(mapping)) {
+			pgoff_t pgoff_start = linear_page_index(vma, hstart);
+			pgoff_t pgoff_end = linear_page_index(vma, hend);
+			loff_t lstart = (loff_t)pgoff_start << PAGE_SHIFT;
+			loff_t lend = ((loff_t)pgoff_end << PAGE_SHIFT) - 1;
+
+			mmap_read_unlock(mm);
+			mmap_locked = false;
+
+			if (filemap_write_and_wait_range(mapping, lstart, lend)) {
+				last_fail = SCAN_FAIL;
+				goto out_maybelock;
+			}
+		}
+	}
+
 	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
 		int result = SCAN_FAIL;
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN
  2025-11-20  6:50 [PATCH V2 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE Shivank Garg
  2025-11-20  6:50 ` [PATCH V2 1/2] mm/khugepaged: do synchronous writeback " Shivank Garg
@ 2025-11-20  6:50 ` Shivank Garg
  2025-11-20  8:03   ` Dev Jain
  1 sibling, 1 reply; 13+ messages in thread
From: Shivank Garg @ 2025-11-20  6:50 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
	linux-kernel, linux-trace-kernel, Shivank Garg

When collapse_file encounters dirty or writeback pages in file-backed
mappings, it currently SCAN_FAIL which maps to -EINVAL. This is
misleading as EINVAL suggests invalid arguments, whereas dirty/writeback
pages represent transient conditions that may resolve on retry.

Introduce SCAN_PAGE_NOT_CLEAN to cover both dirty and writeback states,
mapping it to -EAGAIN. For MADV_COLLAPSE, this provides userspace with
a clear signal that retry may succeed after writeback completes, making
-EAGAIN semantically correct. For khugepaged, this is harmless as it
will naturally revisit the range during periodic scans after async
writeback completes.

Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 include/trace/events/huge_memory.h | 3 ++-
 mm/khugepaged.c                    | 8 +++++---
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 4cde53b45a85..1caf24b951e1 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -37,7 +37,8 @@
 	EM( SCAN_PAGE_HAS_PRIVATE,	"page_has_private")		\
 	EM( SCAN_STORE_FAILED,		"store_failed")			\
 	EM( SCAN_COPY_MC,		"copy_poisoned_page")		\
-	EMe(SCAN_PAGE_FILLED,		"page_filled")
+	EM( SCAN_PAGE_FILLED,		"page_filled")			\
+	EMe(SCAN_PAGE_NOT_CLEAN,	"page_not_clean")
 
 #undef EM
 #undef EMe
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 066a332c76ad..282b413d17e8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -59,6 +59,7 @@ enum scan_result {
 	SCAN_STORE_FAILED,
 	SCAN_COPY_MC,
 	SCAN_PAGE_FILLED,
+	SCAN_PAGE_NOT_CLEAN,
 };
 
 #define CREATE_TRACE_POINTS
@@ -1968,11 +1969,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 				 */
 				xas_unlock_irq(&xas);
 				filemap_flush(mapping);
-				result = SCAN_FAIL;
+				result = SCAN_PAGE_NOT_CLEAN;
 				goto xa_unlocked;
 			} else if (folio_test_writeback(folio)) {
 				xas_unlock_irq(&xas);
-				result = SCAN_FAIL;
+				result = SCAN_PAGE_NOT_CLEAN;
 				goto xa_unlocked;
 			} else if (folio_trylock(folio)) {
 				folio_get(folio);
@@ -2019,7 +2020,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 			 * folio is dirty because it hasn't been flushed
 			 * since first write.
 			 */
-			result = SCAN_FAIL;
+			result = SCAN_PAGE_NOT_CLEAN;
 			goto out_unlock;
 		}
 
@@ -2748,6 +2749,7 @@ static int madvise_collapse_errno(enum scan_result r)
 	case SCAN_PAGE_LRU:
 	case SCAN_DEL_PAGE_LRU:
 	case SCAN_PAGE_FILLED:
+	case SCAN_PAGE_NOT_CLEAN:
 		return -EAGAIN;
 	/*
 	 * Other: Trying again likely not to succeed / error intrinsic to
-- 
2.43.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN
  2025-11-20  6:50 ` [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN Shivank Garg
@ 2025-11-20  8:03   ` Dev Jain
  2025-11-20  8:17     ` Garg, Shivank
  0 siblings, 1 reply; 13+ messages in thread
From: Dev Jain @ 2025-11-20  8:03 UTC (permalink / raw)
  To: Shivank Garg, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Barry Song, Lance Yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel,
	linux-trace-kernel


On 20/11/25 12:20 pm, Shivank Garg wrote:
> When collapse_file encounters dirty or writeback pages in file-backed
> mappings, it currently SCAN_FAIL which maps to -EINVAL. This is
> misleading as EINVAL suggests invalid arguments, whereas dirty/writeback
> pages represent transient conditions that may resolve on retry.
>
> Introduce SCAN_PAGE_NOT_CLEAN to cover both dirty and writeback states,
> mapping it to -EAGAIN. For MADV_COLLAPSE, this provides userspace with
> a clear signal that retry may succeed after writeback completes, making
> -EAGAIN semantically correct. For khugepaged, this is harmless as it
> will naturally revisit the range during periodic scans after async
> writeback completes.
>
> Signed-off-by: Shivank Garg <shivankg@amd.com>
> ---
>   include/trace/events/huge_memory.h | 3 ++-
>   mm/khugepaged.c                    | 8 +++++---
>   2 files changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 4cde53b45a85..1caf24b951e1 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -37,7 +37,8 @@
>   	EM( SCAN_PAGE_HAS_PRIVATE,	"page_has_private")		\
>   	EM( SCAN_STORE_FAILED,		"store_failed")			\
>   	EM( SCAN_COPY_MC,		"copy_poisoned_page")		\
> -	EMe(SCAN_PAGE_FILLED,		"page_filled")
> +	EM( SCAN_PAGE_FILLED,		"page_filled")			\
> +	EMe(SCAN_PAGE_NOT_CLEAN,	"page_not_clean")
>   
>   #undef EM
>   #undef EMe
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 066a332c76ad..282b413d17e8 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -59,6 +59,7 @@ enum scan_result {
>   	SCAN_STORE_FAILED,
>   	SCAN_COPY_MC,
>   	SCAN_PAGE_FILLED,
> +	SCAN_PAGE_NOT_CLEAN,
>   };
>   
>   #define CREATE_TRACE_POINTS
> @@ -1968,11 +1969,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>   				 */
>   				xas_unlock_irq(&xas);
>   				filemap_flush(mapping);
> -				result = SCAN_FAIL;
> +				result = SCAN_PAGE_NOT_CLEAN;
>   				goto xa_unlocked;
>   			} else if (folio_test_writeback(folio)) {
>   				xas_unlock_irq(&xas);
> -				result = SCAN_FAIL;
> +				result = SCAN_PAGE_NOT_CLEAN;
>   				goto xa_unlocked;
>   			} else if (folio_trylock(folio)) {
>   				folio_get(folio);
> @@ -2019,7 +2020,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>   			 * folio is dirty because it hasn't been flushed
>   			 * since first write.
>   			 */
> -			result = SCAN_FAIL;
> +			result = SCAN_PAGE_NOT_CLEAN;
>   			goto out_unlock;
>   		}
>   
> @@ -2748,6 +2749,7 @@ static int madvise_collapse_errno(enum scan_result r)
>   	case SCAN_PAGE_LRU:
>   	case SCAN_DEL_PAGE_LRU:
>   	case SCAN_PAGE_FILLED:
> +	case SCAN_PAGE_NOT_CLEAN:
>   		return -EAGAIN;
>   	/*
>   	 * Other: Trying again likely not to succeed / error intrinsic to

SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY?
Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of
the folio being dirty, maybe just SCAN_PAGE_DIRTY can do.

Reviewed-by: Dev Jain <dev.jain@arm.com>



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN
  2025-11-20  8:03   ` Dev Jain
@ 2025-11-20  8:17     ` Garg, Shivank
  2025-11-20  9:55       ` Dev Jain
  2025-11-20 12:24       ` Lance Yang
  0 siblings, 2 replies; 13+ messages in thread
From: Garg, Shivank @ 2025-11-20  8:17 UTC (permalink / raw)
  To: Dev Jain, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Barry Song, Lance Yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel,
	linux-trace-kernel



On 11/20/2025 1:33 PM, Dev Jain wrote:
> 
> On 20/11/25 12:20 pm, Shivank Garg wrote:

> SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY?
> Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of
> the folio being dirty, maybe just SCAN_PAGE_DIRTY can do.
> 
> Reviewed-by: Dev Jain <dev.jain@arm.com>
> 
Thanks for the review.

I chose not to use SCAN_PAGE_DIRTY because dirty and writeback have different meanings[1]:

    Dirty: Memory that is waiting to be written back to disk
Writeback: Memory that is actively being written back to disk

[1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt

IIUC, a page under writeback is no longer dirty, so using SCAN_PAGE_DIRTY would be misleading
for pages in the writeback state.

I considered SCAN_PAGE_DIRTY_OR_WRITEBACK initially but felt it was too long.

SCAN_PAGE_NOT_CLEAN covers both states that indicate the page is not in a clean/stable
state suitable for collapse.

[1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt

Thanks,
Shivank


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN
  2025-11-20  8:17     ` Garg, Shivank
@ 2025-11-20  9:55       ` Dev Jain
  2025-11-20 12:24       ` Lance Yang
  1 sibling, 0 replies; 13+ messages in thread
From: Dev Jain @ 2025-11-20  9:55 UTC (permalink / raw)
  To: Garg, Shivank, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Barry Song, Lance Yang, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel,
	linux-trace-kernel


On 20/11/25 1:47 pm, Garg, Shivank wrote:
>
> On 11/20/2025 1:33 PM, Dev Jain wrote:
>> On 20/11/25 12:20 pm, Shivank Garg wrote:
>> SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY?
>> Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of
>> the folio being dirty, maybe just SCAN_PAGE_DIRTY can do.
>>
>> Reviewed-by: Dev Jain <dev.jain@arm.com>
>>
> Thanks for the review.
>
> I chose not to use SCAN_PAGE_DIRTY because dirty and writeback have different meanings[1]:
>
>      Dirty: Memory that is waiting to be written back to disk
> Writeback: Memory that is actively being written back to disk
>
> [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt
>
> IIUC, a page under writeback is no longer dirty, so using SCAN_PAGE_DIRTY would be misleading
> for pages in the writeback state.
>
> I considered SCAN_PAGE_DIRTY_OR_WRITEBACK initially but felt it was too long.
>
> SCAN_PAGE_NOT_CLEAN covers both states that indicate the page is not in a clean/stable
> state suitable for collapse.
>
> [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt

Alright, makes sense.

>
> Thanks,
> Shivank


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN
  2025-11-20  8:17     ` Garg, Shivank
  2025-11-20  9:55       ` Dev Jain
@ 2025-11-20 12:24       ` Lance Yang
  2025-11-20 13:29         ` David Hildenbrand (Red Hat)
  1 sibling, 1 reply; 13+ messages in thread
From: Lance Yang @ 2025-11-20 12:24 UTC (permalink / raw)
  To: Garg, Shivank, Dev Jain
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, David Hildenbrand,
	Nico Pache, Ryan Roberts, Barry Song, Andrew Morton,
	Steven Rostedt, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel,
	linux-trace-kernel



On 2025/11/20 16:17, Garg, Shivank wrote:
> 
> 
> On 11/20/2025 1:33 PM, Dev Jain wrote:
>>
>> On 20/11/25 12:20 pm, Shivank Garg wrote:
> 
>> SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY?
>> Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of
>> the folio being dirty, maybe just SCAN_PAGE_DIRTY can do.
>>
>> Reviewed-by: Dev Jain <dev.jain@arm.com>
>>
> Thanks for the review.
> 
> I chose not to use SCAN_PAGE_DIRTY because dirty and writeback have different meanings[1]:
> 
>      Dirty: Memory that is waiting to be written back to disk
> Writeback: Memory that is actively being written back to disk
> 
> [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt
> 
> IIUC, a page under writeback is no longer dirty, so using SCAN_PAGE_DIRTY would be misleading
> for pages in the writeback state.
> 
> I considered SCAN_PAGE_DIRTY_OR_WRITEBACK initially but felt it was too long.

Nit: If SCAN_PAGE_DIRTY_OR_WRITEBACK is too verbose, how about 
SCAN_PAGE_DIRTY_WB?

It keeps the specificity without the length, and is arguably more 
descriptive
than NOT_CLEAN ;)

That said, LGTM.

Reviewed-by: Lance Yang <lance.yang@linux.dev>

> 
> SCAN_PAGE_NOT_CLEAN covers both states that indicate the page is not in a clean/stable
> state suitable for collapse.
> 
> [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt
> 
> Thanks,
> Shivank



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 1/2] mm/khugepaged: do synchronous writeback for MADV_COLLAPSE
  2025-11-20  6:50 ` [PATCH V2 1/2] mm/khugepaged: do synchronous writeback " Shivank Garg
@ 2025-11-20 13:01   ` Lance Yang
  2025-11-21  6:27     ` Garg, Shivank
  2025-11-20 13:35   ` David Hildenbrand (Red Hat)
  1 sibling, 1 reply; 13+ messages in thread
From: Lance Yang @ 2025-11-20 13:01 UTC (permalink / raw)
  To: Shivank Garg
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Steven Rostedt, David Hildenbrand,
	Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
	Andrew Morton, linux-kernel, linux-trace-kernel, Branden Moore,
	Lorenzo Stoakes



On 2025/11/20 14:50, Shivank Garg wrote:
> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
> text sections), the pages may still be dirty from recent writes and
> cause collapse to fail with -EINVAL. This is particularly problematic
> for freshly copied executables on filesystems, where page cache folios
> remain dirty until background writeback completes.
> 
> The current code in collapse_file() triggers async writeback via
> filemap_flush() and expects khugepaged to revisit the page later.
> However, MADV_COLLAPSE is a synchronous operation where userspace
> expects immediate results.
> 
> Perform synchronous writeback in madvise_collapse() before attempting
> collapse to avoid failing on first attempt.

Thanks!

> 
> Reported-by: Branden Moore <Branden.Moore@amd.com>
> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
> Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Shivank Garg <shivankg@amd.com>
> ---
>   mm/khugepaged.c | 26 ++++++++++++++++++++++++++
>   1 file changed, 26 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 97d1b2824386..066a332c76ad 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -22,6 +22,7 @@
>   #include <linux/dax.h>
>   #include <linux/ksm.h>
>   #include <linux/pgalloc.h>
> +#include <linux/backing-dev.h>
>   
>   #include <asm/tlb.h>
>   #include "internal.h"
> @@ -2784,6 +2785,31 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>   	hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
>   	hend = end & HPAGE_PMD_MASK;
>   
> +	/*
> +	 * For file-backed VMAs, perform synchronous writeback to ensure
> +	 * dirty folios are flushed before attempting collapse. This avoids
> +	 * failing on the first attempt when freshly-written executable text
> +	 * is still dirty in the page cache.
> +	 */
> +	if (!vma_is_anonymous(vma) && vma->vm_file) {
> +		struct address_space *mapping = vma->vm_file->f_mapping;
> +
> +		if (mapping_can_writeback(mapping)) {
> +			pgoff_t pgoff_start = linear_page_index(vma, hstart);
> +			pgoff_t pgoff_end = linear_page_index(vma, hend);
> +			loff_t lstart = (loff_t)pgoff_start << PAGE_SHIFT;
> +			loff_t lend = ((loff_t)pgoff_end << PAGE_SHIFT) - 1;

It looks like we need to hold a reference to the file here before
dropping the mmap lock :)

			file = get_file(vma->vm_file);

Without it, the vma could be destroyed by a concurrent munmap() while
we are waiting in filemap_write_and_wait_range(), leading to a UAF
on mapping, IIUC ...

> +
> +			mmap_read_unlock(mm);
> +			mmap_locked = false;
> +
> +			if (filemap_write_and_wait_range(mapping, lstart, lend)) {

And drop the reference :)

				fput(file);


> +				last_fail = SCAN_FAIL;
> +				goto out_maybelock;
> +			}

Same here :)

			fput(file);


> +		}
> +	}
> +
>   	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>   		int result = SCAN_FAIL;
>   

Cheers,
Lance


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN
  2025-11-20 12:24       ` Lance Yang
@ 2025-11-20 13:29         ` David Hildenbrand (Red Hat)
  2025-11-21  6:15           ` Garg, Shivank
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-20 13:29 UTC (permalink / raw)
  To: Lance Yang, Garg, Shivank, Dev Jain
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Barry Song, Andrew Morton, Steven Rostedt, Lorenzo Stoakes,
	Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
	linux-kernel, linux-trace-kernel

On 11/20/25 13:24, Lance Yang wrote:
> 
> 
> On 2025/11/20 16:17, Garg, Shivank wrote:
>>
>>
>> On 11/20/2025 1:33 PM, Dev Jain wrote:
>>>
>>> On 20/11/25 12:20 pm, Shivank Garg wrote:
>>
>>> SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY?
>>> Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of
>>> the folio being dirty, maybe just SCAN_PAGE_DIRTY can do.
>>>
>>> Reviewed-by: Dev Jain <dev.jain@arm.com>
>>>
>> Thanks for the review.
>>
>> I chose not to use SCAN_PAGE_DIRTY because dirty and writeback have different meanings[1]:
>>
>>       Dirty: Memory that is waiting to be written back to disk
>> Writeback: Memory that is actively being written back to disk
>>
>> [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt
>>
>> IIUC, a page under writeback is no longer dirty, so using SCAN_PAGE_DIRTY would be misleading
>> for pages in the writeback state.
>>
>> I considered SCAN_PAGE_DIRTY_OR_WRITEBACK initially but felt it was too long.
> 
> Nit: If SCAN_PAGE_DIRTY_OR_WRITEBACK 

I would prefer that here.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 1/2] mm/khugepaged: do synchronous writeback for MADV_COLLAPSE
  2025-11-20  6:50 ` [PATCH V2 1/2] mm/khugepaged: do synchronous writeback " Shivank Garg
  2025-11-20 13:01   ` Lance Yang
@ 2025-11-20 13:35   ` David Hildenbrand (Red Hat)
  2025-11-21  6:27     ` Garg, Shivank
  1 sibling, 1 reply; 13+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-20 13:35 UTC (permalink / raw)
  To: Shivank Garg, Andrew Morton, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
	linux-kernel, linux-trace-kernel, Branden Moore

On 11/20/25 07:50, Shivank Garg wrote:
> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
> text sections), the pages may still be dirty from recent writes and
> cause collapse to fail with -EINVAL. This is particularly problematic
> for freshly copied executables on filesystems, where page cache folios
> remain dirty until background writeback completes.
> 
> The current code in collapse_file() triggers async writeback via
> filemap_flush() and expects khugepaged to revisit the page later.
> However, MADV_COLLAPSE is a synchronous operation where userspace
> expects immediate results.
> 
> Perform synchronous writeback in madvise_collapse() before attempting
> collapse to avoid failing on first attempt.
> 
> Reported-by: Branden Moore <Branden.Moore@amd.com>
> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
> Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Shivank Garg <shivankg@amd.com>
> ---
>   mm/khugepaged.c | 26 ++++++++++++++++++++++++++
>   1 file changed, 26 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 97d1b2824386..066a332c76ad 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -22,6 +22,7 @@
>   #include <linux/dax.h>
>   #include <linux/ksm.h>
>   #include <linux/pgalloc.h>
> +#include <linux/backing-dev.h>
>   
>   #include <asm/tlb.h>
>   #include "internal.h"
> @@ -2784,6 +2785,31 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>   	hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
>   	hend = end & HPAGE_PMD_MASK;
>   
> +	/*
> +	 * For file-backed VMAs, perform synchronous writeback to ensure
> +	 * dirty folios are flushed before attempting collapse. This avoids
> +	 * failing on the first attempt when freshly-written executable text
> +	 * is still dirty in the page cache.
> +	 */
> +	if (!vma_is_anonymous(vma) && vma->vm_file) {
> +		struct address_space *mapping = vma->vm_file->f_mapping;
> +
> +		if (mapping_can_writeback(mapping)) {
> +			pgoff_t pgoff_start = linear_page_index(vma, hstart);
> +			pgoff_t pgoff_end = linear_page_index(vma, hend);
> +			loff_t lstart = (loff_t)pgoff_start << PAGE_SHIFT;
> +			loff_t lend = ((loff_t)pgoff_end << PAGE_SHIFT) - 1;
> +

Hm, so we always do that, without any indication that there actually is 
something dirty there.

Internally filemap_write_and_wait_range() uses something called 
mapping_needs_writeback(), but it also applies to the complete file, not 
a range.

Wouldn't it be better do do that only if we detect that there is 
actually a dirty folio in the range?

That is, if we find any dirty folio in hpage_collapse_scan_file() and we 
are in madvise, do that dance here and retry?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN
  2025-11-20 13:29         ` David Hildenbrand (Red Hat)
@ 2025-11-21  6:15           ` Garg, Shivank
  0 siblings, 0 replies; 13+ messages in thread
From: Garg, Shivank @ 2025-11-21  6:15 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), Lance Yang, Dev Jain
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Barry Song, Andrew Morton, Steven Rostedt, Lorenzo Stoakes,
	Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
	linux-kernel, linux-trace-kernel



On 11/20/2025 6:59 PM, David Hildenbrand (Red Hat) wrote:
> On 11/20/25 13:24, Lance Yang wrote:
>>
>>
>> On 2025/11/20 16:17, Garg, Shivank wrote:
>>>
>>>
>>> On 11/20/2025 1:33 PM, Dev Jain wrote:
>>>>
>>>> On 20/11/25 12:20 pm, Shivank Garg wrote:
>>>
>>>> SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY?
>>>> Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of
>>>> the folio being dirty, maybe just SCAN_PAGE_DIRTY can do.
>>>>
>>>> Reviewed-by: Dev Jain <dev.jain@arm.com>
>>>>
>>> Thanks for the review.
>>>
>>> I chose not to use SCAN_PAGE_DIRTY because dirty and writeback have different meanings[1]:
>>>
>>>       Dirty: Memory that is waiting to be written back to disk
>>> Writeback: Memory that is actively being written back to disk
>>>
>>> [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt
>>>
>>> IIUC, a page under writeback is no longer dirty, so using SCAN_PAGE_DIRTY would be misleading
>>> for pages in the writeback state.
>>>
>>> I considered SCAN_PAGE_DIRTY_OR_WRITEBACK initially but felt it was too long.
>>
>> Nit: If SCAN_PAGE_DIRTY_OR_WRITEBACK 
> 
> I would prefer that here.
> 

I agree on this. 
If the consensus is SCAN_PAGE_DIRTY_OR_WRITEBACK, I'll use it in v3.

Thanks,
Shivank


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 1/2] mm/khugepaged: do synchronous writeback for MADV_COLLAPSE
  2025-11-20 13:01   ` Lance Yang
@ 2025-11-21  6:27     ` Garg, Shivank
  0 siblings, 0 replies; 13+ messages in thread
From: Garg, Shivank @ 2025-11-21  6:27 UTC (permalink / raw)
  To: Lance Yang
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Steven Rostedt, David Hildenbrand,
	Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
	Andrew Morton, linux-kernel, linux-trace-kernel, Branden Moore,
	Lorenzo Stoakes



On 11/20/2025 6:31 PM, Lance Yang wrote:
> 
> 
> On 2025/11/20 14:50, Shivank Garg wrote:
>> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
>> text sections), the pages may still be dirty from recent writes and
>> cause collapse to fail with -EINVAL. This is particularly problematic
>> for freshly copied executables on filesystems, where page cache folios
>> remain dirty until background writeback completes.
>>
>> The current code in collapse_file() triggers async writeback via
>> filemap_flush() and expects khugepaged to revisit the page later.
>> However, MADV_COLLAPSE is a synchronous operation where userspace
>> expects immediate results.
>>
>> Perform synchronous writeback in madvise_collapse() before attempting
>> collapse to avoid failing on first attempt.
> 
> Thanks!
> 
>>
>> Reported-by: Branden Moore <Branden.Moore@amd.com>
>> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
>> Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
>> Suggested-by: David Hildenbrand <david@kernel.org>
>> Signed-off-by: Shivank Garg <shivankg@amd.com>
>> ---
>>   mm/khugepaged.c | 26 ++++++++++++++++++++++++++
>>   1 file changed, 26 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 97d1b2824386..066a332c76ad 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -22,6 +22,7 @@
>>   #include <linux/dax.h>
>>   #include <linux/ksm.h>
>>   #include <linux/pgalloc.h>
>> +#include <linux/backing-dev.h>
>>     #include <asm/tlb.h>
>>   #include "internal.h"
>> @@ -2784,6 +2785,31 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>       hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
>>       hend = end & HPAGE_PMD_MASK;
>>   +    /*
>> +     * For file-backed VMAs, perform synchronous writeback to ensure
>> +     * dirty folios are flushed before attempting collapse. This avoids
>> +     * failing on the first attempt when freshly-written executable text
>> +     * is still dirty in the page cache.
>> +     */
>> +    if (!vma_is_anonymous(vma) && vma->vm_file) {
>> +        struct address_space *mapping = vma->vm_file->f_mapping;
>> +
>> +        if (mapping_can_writeback(mapping)) {
>> +            pgoff_t pgoff_start = linear_page_index(vma, hstart);
>> +            pgoff_t pgoff_end = linear_page_index(vma, hend);
>> +            loff_t lstart = (loff_t)pgoff_start << PAGE_SHIFT;
>> +            loff_t lend = ((loff_t)pgoff_end << PAGE_SHIFT) - 1;
> 
> It looks like we need to hold a reference to the file here before
> dropping the mmap lock :)
> 
>             file = get_file(vma->vm_file);
> 
> Without it, the vma could be destroyed by a concurrent munmap() while
> we are waiting in filemap_write_and_wait_range(), leading to a UAF
> on mapping, IIUC ...

Excellent catch!
Thanks for saving me from this nasty bug. I'll be more careful on file ref
handling in next version.

Best Regards,
Shivank
> 
>> +
>> +            mmap_read_unlock(mm);
>> +            mmap_locked = false;
>> +
>> +            if (filemap_write_and_wait_range(mapping, lstart, lend)) {
> 
> And drop the reference :)
> 
>                 fput(file);
> 
> 
>> +                last_fail = SCAN_FAIL;
>> +                goto out_maybelock;
>> +            }
> 
> Same here :)
> 
>             fput(file);
> 
> 
>> +        }
>> +    }
>> +
>>       for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>>           int result = SCAN_FAIL;
>>   
> 
> Cheers,
> Lance



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 1/2] mm/khugepaged: do synchronous writeback for MADV_COLLAPSE
  2025-11-20 13:35   ` David Hildenbrand (Red Hat)
@ 2025-11-21  6:27     ` Garg, Shivank
  0 siblings, 0 replies; 13+ messages in thread
From: Garg, Shivank @ 2025-11-21  6:27 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), Andrew Morton, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
	linux-kernel, linux-trace-kernel, Branden Moore



On 11/20/2025 7:05 PM, David Hildenbrand (Red Hat) wrote:
> On 11/20/25 07:50, Shivank Garg wrote:
>> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
>> text sections), the pages may still be dirty from recent writes and
>> cause collapse to fail with -EINVAL. This is particularly problematic
>> for freshly copied executables on filesystems, where page cache folios
>> remain dirty until background writeback completes.
>>
>> The current code in collapse_file() triggers async writeback via
>> filemap_flush() and expects khugepaged to revisit the page later.
>> However, MADV_COLLAPSE is a synchronous operation where userspace
>> expects immediate results.
>>
>> Perform synchronous writeback in madvise_collapse() before attempting
>> collapse to avoid failing on first attempt.
>>
>> Reported-by: Branden Moore <Branden.Moore@amd.com>
>> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
>> Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
>> Suggested-by: David Hildenbrand <david@kernel.org>
>> Signed-off-by: Shivank Garg <shivankg@amd.com>
>> ---
>>   mm/khugepaged.c | 26 ++++++++++++++++++++++++++
>>   1 file changed, 26 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 97d1b2824386..066a332c76ad 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -22,6 +22,7 @@
>>   #include <linux/dax.h>
>>   #include <linux/ksm.h>
>>   #include <linux/pgalloc.h>
>> +#include <linux/backing-dev.h>
>>     #include <asm/tlb.h>
>>   #include "internal.h"
>> @@ -2784,6 +2785,31 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>       hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
>>       hend = end & HPAGE_PMD_MASK;
>>   +    /*
>> +     * For file-backed VMAs, perform synchronous writeback to ensure
>> +     * dirty folios are flushed before attempting collapse. This avoids
>> +     * failing on the first attempt when freshly-written executable text
>> +     * is still dirty in the page cache.
>> +     */
>> +    if (!vma_is_anonymous(vma) && vma->vm_file) {
>> +        struct address_space *mapping = vma->vm_file->f_mapping;
>> +
>> +        if (mapping_can_writeback(mapping)) {
>> +            pgoff_t pgoff_start = linear_page_index(vma, hstart);
>> +            pgoff_t pgoff_end = linear_page_index(vma, hend);
>> +            loff_t lstart = (loff_t)pgoff_start << PAGE_SHIFT;
>> +            loff_t lend = ((loff_t)pgoff_end << PAGE_SHIFT) - 1;
>> +
> 
> Hm, so we always do that, without any indication that there actually is something dirty there.
> 
> Internally filemap_write_and_wait_range() uses something called mapping_needs_writeback(), but it also applies to the complete file, not a range.
> 
> Wouldn't it be better do do that only if we detect that there is actually a dirty folio in the range?
> 
> That is, if we find any dirty folio in hpage_collapse_scan_file() and we are in madvise, do that dance here and retry?
> 

Good point! This makes sense to me.
I'll send V3 with this approach.

Thanks,
Shivank


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-11-21  6:28 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-20  6:50 [PATCH V2 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE Shivank Garg
2025-11-20  6:50 ` [PATCH V2 1/2] mm/khugepaged: do synchronous writeback " Shivank Garg
2025-11-20 13:01   ` Lance Yang
2025-11-21  6:27     ` Garg, Shivank
2025-11-20 13:35   ` David Hildenbrand (Red Hat)
2025-11-21  6:27     ` Garg, Shivank
2025-11-20  6:50 ` [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN Shivank Garg
2025-11-20  8:03   ` Dev Jain
2025-11-20  8:17     ` Garg, Shivank
2025-11-20  9:55       ` Dev Jain
2025-11-20 12:24       ` Lance Yang
2025-11-20 13:29         ` David Hildenbrand (Red Hat)
2025-11-21  6:15           ` Garg, Shivank

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox