[PATCH V5 0/2] mm/khugepaged: fix dirty page handling for MADV

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V5 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE
@ 2026-01-18 19:09 Shivank Garg
  2026-01-18 19:09 ` [PATCH V5 1/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN Shivank Garg
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Shivank Garg @ 2026-01-18 19:09 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Masami Hiramatsu,
	Steven Rostedt, linux-trace-kernel, Mathieu Desnoyers,
	Zach O'Keefe, linux-mm, linux-kernel, Stephen Rothwell,
	shivankg

MADV_COLLAPSE on file-backed mappings fails with -EINVAL when TEXT pages
are dirty. This affects scenarios like package/container updates or
executing binaries immediately after writing them, etc.

The issue is that collapse_file() triggers async writeback and returns
SCAN_FAIL (maps to -EINVAL), expecting khugepaged to revisit later. But
MADV_COLLAPSE is synchronous and userspace expects immediate success or
a clear retry signal.

Reproduction:
 - Compile or copy 2MB-aligned executable to XFS/ext4 FS
 - Call MADV_COLLAPSE on .text section
 - First call fails with -EINVAL (text pages dirty from copy)
 - Second call succeeds (async writeback completed)

Issue Report:
https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com


Hi Andrew,

This V5 series incorporates David's feedback for simplifying the retry
logic. To apply on mm-new, please drop:
- 20251215084615.5283-3-shivankg@amd.com:
  [PATCH V4 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE
- 20251224111351.41042-4-shivankg@amd.com:
  [PATCH V2 0/5] mm/khugepaged: cleanups and scan limit fix
  (merge conflicts with this series; V3 with review fixes posting soon)

Thank you :)


Changelog:
V5:
- In patch 2/2, Simplify dirty writeback retry logic (David)

V4:
- https://lore.kernel.org/all/20251215084615.5283-3-shivankg@amd.com
- Rebase on mm-new
- Fix spurious blank line (Lance)

V3:
- https://lore.kernel.org/all/20251201185604.210634-6-shivankg@amd.com
- Reordered patches: Enum definition comes first as the retry logic depends on it
- Renamed SCAN_PAGE_NOT_CLEAN to SCAN_PAGE_DIRTY_OR_WRITEBACK (Dev, Lance, David)
- Changed writeback logic: Only trigger synchronous writeback and retry if the
  initial collapse attempt failed specifically due to dirty/writeback pages,
  rather than blindly flushing all file-backed VMAs (David)
- Added proper file reference counting (get_file/fput) around the unlock window
  to prevent UAF (Lance)

V2:
- https://lore.kernel.org/all/20251120065043.41738-6-shivankg@amd.com
- Move writeback to madvise_collapse() (better abstraction, proper
  mmap_lock handling and does VMA revalidation after I/O) (Lorenzo)
- Rename to SCAN_PAGE_DIRTY to SCAN_PAGE_NOT_CLEAN and extend its use
  for all dirty/writeback folio cases that previously returned incorrect
  results (Dev) 

V1: https://lore.kernel.org/all/20251110113254.77822-1-shivankg@amd.com

Thanks,

Shivank Garg (2):
  mm/khugepaged: map dirty/writeback pages failures to EAGAIN
  mm/khugepaged: retry with sync writeback for MADV_COLLAPSE

 include/trace/events/huge_memory.h |  3 ++-
 mm/khugepaged.c                    | 23 ++++++++++++++++++++---
 2 files changed, 22 insertions(+), 4 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH V5 1/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN
  2026-01-18 19:09 [PATCH V5 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE Shivank Garg
@ 2026-01-18 19:09 ` Shivank Garg
  2026-01-18 19:09 ` [PATCH V5 2/2] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE Shivank Garg
  2026-01-18 20:22 ` [PATCH V5 0/2] mm/khugepaged: fix dirty page handling " Andrew Morton
  2 siblings, 0 replies; 10+ messages in thread
From: Shivank Garg @ 2026-01-18 19:09 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Masami Hiramatsu,
	Steven Rostedt, linux-trace-kernel, Mathieu Desnoyers,
	Zach O'Keefe, linux-mm, linux-kernel, Stephen Rothwell,
	shivankg, Branden Moore, wang lian

When collapse_file encounters dirty or writeback pages in file-backed
mappings, it currently returns SCAN_FAIL which maps to -EINVAL.  This is
misleading as EINVAL suggests invalid arguments, whereas dirty/writeback
pages represent transient conditions that may resolve on retry.

Introduce SCAN_PAGE_DIRTY_OR_WRITEBACK to cover both dirty and writeback
states, mapping it to -EAGAIN.  For MADV_COLLAPSE, this provides userspace
with a clear signal that retry may succeed after writeback completes.  For
khugepaged, this is harmless as it will naturally revisit the range during
periodic scans after async writeback completes.

Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
Reported-by: Branden Moore <Branden.Moore@amd.com>
Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: wang lian <lianux.mm@gmail.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 include/trace/events/huge_memory.h | 3 ++-
 mm/khugepaged.c                    | 8 +++++---
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 4cde53b45a85..4e41bff31888 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -37,7 +37,8 @@
 	EM( SCAN_PAGE_HAS_PRIVATE,	"page_has_private")		\
 	EM( SCAN_STORE_FAILED,		"store_failed")			\
 	EM( SCAN_COPY_MC,		"copy_poisoned_page")		\
-	EMe(SCAN_PAGE_FILLED,		"page_filled")
+	EM( SCAN_PAGE_FILLED,		"page_filled")			\
+	EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
 
 #undef EM
 #undef EMe
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 97d1b2824386..219dfa2e523c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -58,6 +58,7 @@ enum scan_result {
 	SCAN_STORE_FAILED,
 	SCAN_COPY_MC,
 	SCAN_PAGE_FILLED,
+	SCAN_PAGE_DIRTY_OR_WRITEBACK,
 };
 
 #define CREATE_TRACE_POINTS
@@ -1967,11 +1968,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 				 */
 				xas_unlock_irq(&xas);
 				filemap_flush(mapping);
-				result = SCAN_FAIL;
+				result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
 				goto xa_unlocked;
 			} else if (folio_test_writeback(folio)) {
 				xas_unlock_irq(&xas);
-				result = SCAN_FAIL;
+				result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
 				goto xa_unlocked;
 			} else if (folio_trylock(folio)) {
 				folio_get(folio);
@@ -2018,7 +2019,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 			 * folio is dirty because it hasn't been flushed
 			 * since first write.
 			 */
-			result = SCAN_FAIL;
+			result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
 			goto out_unlock;
 		}
 
@@ -2747,6 +2748,7 @@ static int madvise_collapse_errno(enum scan_result r)
 	case SCAN_PAGE_LRU:
 	case SCAN_DEL_PAGE_LRU:
 	case SCAN_PAGE_FILLED:
+	case SCAN_PAGE_DIRTY_OR_WRITEBACK:
 		return -EAGAIN;
 	/*
 	 * Other: Trying again likely not to succeed / error intrinsic to
-- 
2.43.0



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH V5 2/2] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE
  2026-01-18 19:09 [PATCH V5 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE Shivank Garg
  2026-01-18 19:09 ` [PATCH V5 1/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN Shivank Garg
@ 2026-01-18 19:09 ` Shivank Garg
  2026-01-19 10:08   ` David Hildenbrand (Red Hat)
  2026-01-22 11:07   ` Dev Jain
  2026-01-18 20:22 ` [PATCH V5 0/2] mm/khugepaged: fix dirty page handling " Andrew Morton
  2 siblings, 2 replies; 10+ messages in thread
From: Shivank Garg @ 2026-01-18 19:09 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Masami Hiramatsu,
	Steven Rostedt, linux-trace-kernel, Mathieu Desnoyers,
	Zach O'Keefe, linux-mm, linux-kernel, Stephen Rothwell,
	shivankg, Branden Moore

When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
text sections), the pages may still be dirty from recent writes.
collapse_file() will trigger async writeback and fail with
SCAN_PAGE_DIRTY_OR_WRITEBACK (-EAGAIN).

MADV_COLLAPSE is a synchronous operation where userspace expects
immediate results. If the collapse fails due to dirty pages, perform
synchronous writeback on the specific range and retry once.

This avoids spurious failures for freshly written executables while
avoiding unnecessary synchronous I/O for mappings that are already clean.

Reported-by: Branden Moore <Branden.Moore@amd.com>
Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
Suggested-by: David Hildenbrand <david@kernel.org>
Tested-by: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 mm/khugepaged.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 219dfa2e523c..16582bdcb6ff 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -22,6 +22,7 @@
 #include <linux/dax.h>
 #include <linux/ksm.h>
 #include <linux/pgalloc.h>
+#include <linux/backing-dev.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -2788,7 +2789,9 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 
 	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
 		int result = SCAN_FAIL;
+		bool triggered_wb = false;
 
+retry:
 		if (!mmap_locked) {
 			cond_resched();
 			mmap_read_lock(mm);
@@ -2809,8 +2812,20 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 
 			mmap_read_unlock(mm);
 			mmap_locked = false;
+			*lock_dropped = true;
 			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
 							  cc);
+
+			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
+			    mapping_can_writeback(file->f_mapping)) {
+				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
+				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
+
+				filemap_write_and_wait_range(file->f_mapping, lstart, lend);
+				triggered_wb = true;
+				fput(file);
+				goto retry;
+			}
 			fput(file);
 		} else {
 			result = hpage_collapse_scan_pmd(mm, vma, addr,
-- 
2.43.0



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V5 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE
  2026-01-18 19:09 [PATCH V5 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE Shivank Garg
  2026-01-18 19:09 ` [PATCH V5 1/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN Shivank Garg
  2026-01-18 19:09 ` [PATCH V5 2/2] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE Shivank Garg
@ 2026-01-18 20:22 ` Andrew Morton
  2026-01-19  5:07   ` Garg, Shivank
  2 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2026-01-18 20:22 UTC (permalink / raw)
  To: Shivank Garg
  Cc: David Hildenbrand, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Masami Hiramatsu, Steven Rostedt, linux-trace-kernel,
	Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel,
	Stephen Rothwell

On Sun, 18 Jan 2026 19:09:38 +0000 Shivank Garg <shivankg@amd.com> wrote:

> MADV_COLLAPSE on file-backed mappings fails with -EINVAL when TEXT pages
> are dirty. This affects scenarios like package/container updates or
> executing binaries immediately after writing them, etc.
> 
> The issue is that collapse_file() triggers async writeback and returns
> SCAN_FAIL (maps to -EINVAL), expecting khugepaged to revisit later. But
> MADV_COLLAPSE is synchronous and userspace expects immediate success or
> a clear retry signal.
> 
> Reproduction:
>  - Compile or copy 2MB-aligned executable to XFS/ext4 FS
>  - Call MADV_COLLAPSE on .text section
>  - First call fails with -EINVAL (text pages dirty from copy)
>  - Second call succeeds (async writeback completed)
> 
> Issue Report:
> https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com

Updated, thanks.

Please tolerate a little whining about the timeliess here.  We're at
-rc6, v4 was added to mm.git over a month ago, had quite a lot of
review, this is very close to being moved into the mm-stable branch and now
we get v5.  Argh.

> V5:
> - In patch 2/2, Simplify dirty writeback retry logic (David)

Are you sure this is the only change?  It looks like a lot for a
simplification and I'm wondering if we should retain the v4 series and
defer a simplification for separate consideration during the next
cycle.

Below is how this updated altered mm.git.  Could reviewers please check
this fairly soon?


--- a/mm/khugepaged.c~b
+++ a/mm/khugepaged.c
@@ -2788,11 +2788,11 @@ int madvise_collapse(struct vm_area_stru
 	hend = end & HPAGE_PMD_MASK;
 
 	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
-		bool retried = false;
 		int result = SCAN_FAIL;
+		bool triggered_wb = false;
 
-		if (!mmap_locked) {
 retry:
+		if (!mmap_locked) {
 			cond_resched();
 			mmap_read_lock(mm);
 			mmap_locked = true;
@@ -2812,52 +2812,27 @@ retry:
 
 			mmap_read_unlock(mm);
 			mmap_locked = false;
+			*lock_dropped = true;
 			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
 							  cc);
-			fput(file);
-		} else {
-			result = hpage_collapse_scan_pmd(mm, vma, addr,
-							 &mmap_locked, cc);
-		}
-		if (!mmap_locked)
-			*lock_dropped = true;
-
-		/*
-		 * If the file-backed VMA has dirty pages, the scan triggers
-		 * async writeback and returns SCAN_PAGE_DIRTY_OR_WRITEBACK.
-		 * Since MADV_COLLAPSE is sync, we force sync writeback and
-		 * retry once.
-		 */
-		if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !retried) {
-			/*
-			 * File scan drops the lock. We must re-acquire it to
-			 * safely inspect the VMA and hold the file reference.
-			 */
-			if (!mmap_locked) {
-				cond_resched();
-				mmap_read_lock(mm);
-				mmap_locked = true;
-				result = hugepage_vma_revalidate(mm, addr, false, &vma, cc);
-				if (result != SCAN_SUCCEED)
-					goto handle_result;
-			}
 
-			if (!vma_is_anonymous(vma) && vma->vm_file &&
-			    mapping_can_writeback(vma->vm_file->f_mapping)) {
-				struct file *file = get_file(vma->vm_file);
-				pgoff_t pgoff = linear_page_index(vma, addr);
+			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
+			    mapping_can_writeback(file->f_mapping)) {
 				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
 				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
 
-				mmap_read_unlock(mm);
-				mmap_locked = false;
-				*lock_dropped = true;
 				filemap_write_and_wait_range(file->f_mapping, lstart, lend);
+				triggered_wb = true;
 				fput(file);
-				retried = true;
 				goto retry;
 			}
+			fput(file);
+		} else {
+			result = hpage_collapse_scan_pmd(mm, vma, addr,
+							 &mmap_locked, cc);
 		}
+		if (!mmap_locked)
+			*lock_dropped = true;
 
 handle_result:
 		switch (result) {
_



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V5 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE
  2026-01-18 20:22 ` [PATCH V5 0/2] mm/khugepaged: fix dirty page handling " Andrew Morton
@ 2026-01-19  5:07   ` Garg, Shivank
  2026-01-19  9:58     ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 10+ messages in thread
From: Garg, Shivank @ 2026-01-19  5:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Masami Hiramatsu, Steven Rostedt, linux-trace-kernel,
	Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel,
	Stephen Rothwell



On 1/19/2026 1:52 AM, Andrew Morton wrote:
> Please tolerate a little whining about the timeliess here.  We're at
> -rc6, v4 was added to mm.git over a month ago, had quite a lot of
> review, this is very close to being moved into the mm-stable branch and now
> we get v5.  Argh.
> 
I sincerely apologize for this.

I had this doubt on sending an incremental patch or V5:
https://lore.kernel.org/linux-mm/7a42515f-ae57-4f4d-831c-87689930a797@amd.com


>> V5:
>> - In patch 2/2, Simplify dirty writeback retry logic (David)
> Are you sure this is the only change?  It looks like a lot for a
> simplification and I'm wondering if we should retain the v4 series and
> defer a simplification for separate consideration during the next
> cycle.

Yes, patch 1/2 is unchanged and patch 2/2 is the only change. 
The diff looks larger due to code movement but logic is actually simpler now.

I completely understand if you prefer to keep V4 and defer this 
refactoring. I'm sorry for creating this late-cycle churn. Please let me 
know what you'd prefer and I'll follow your guidance.
Thank you for your patience.

Regards,
Shivank


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V5 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE
  2026-01-19  5:07   ` Garg, Shivank
@ 2026-01-19  9:58     ` David Hildenbrand (Red Hat)
  2026-01-19 10:54       ` Lorenzo Stoakes
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-19  9:58 UTC (permalink / raw)
  To: Garg, Shivank, Andrew Morton
  Cc: Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R . Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang,
	Masami Hiramatsu, Steven Rostedt, linux-trace-kernel,
	Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel,
	Stephen Rothwell

On 1/19/26 06:07, Garg, Shivank wrote:
> 
> 
> On 1/19/2026 1:52 AM, Andrew Morton wrote:
>> Please tolerate a little whining about the timeliess here.  We're at
>> -rc6, v4 was added to mm.git over a month ago, had quite a lot of
>> review, this is very close to being moved into the mm-stable branch and now
>> we get v5.  Argh.
>>
> I sincerely apologize for this.
> 
> I had this doubt on sending an incremental patch or V5:
> https://lore.kernel.org/linux-mm/7a42515f-ae57-4f4d-831c-87689930a797@amd.com
> 
> 
>>> V5:
>>> - In patch 2/2, Simplify dirty writeback retry logic (David)
>> Are you sure this is the only change?  It looks like a lot for a
>> simplification and I'm wondering if we should retain the v4 series and
>> defer a simplification for separate consideration during the next
>> cycle.
> 
> Yes, patch 1/2 is unchanged and patch 2/2 is the only change.
> The diff looks larger due to code movement but logic is actually simpler now.
> 
> I completely understand if you prefer to keep V4 and defer this
> refactoring. I'm sorry for creating this late-cycle churn. Please let me
> know what you'd prefer and I'll follow your guidance.

There is no reason to rush any of this. :)

Review was not over and due to multiple factors my review on v4 came in 
later than I wanted.

Andrew, if you prefer, I can start sending as a reply to every patch set 
that I want to review so you are aware that it is on my review backlog.

Unfortunately we can't have nice things (sub-maintainer acks).

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V5 2/2] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE
  2026-01-18 19:09 ` [PATCH V5 2/2] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE Shivank Garg
@ 2026-01-19 10:08   ` David Hildenbrand (Red Hat)
  2026-01-22 11:07   ` Dev Jain
  1 sibling, 0 replies; 10+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-19 10:08 UTC (permalink / raw)
  To: Shivank Garg, Andrew Morton, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Masami Hiramatsu,
	Steven Rostedt, linux-trace-kernel, Mathieu Desnoyers,
	Zach O'Keefe, linux-mm, linux-kernel, Stephen Rothwell,
	Branden Moore

On 1/18/26 20:09, Shivank Garg wrote:
> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
> text sections), the pages may still be dirty from recent writes.
> collapse_file() will trigger async writeback and fail with
> SCAN_PAGE_DIRTY_OR_WRITEBACK (-EAGAIN).
> 
> MADV_COLLAPSE is a synchronous operation where userspace expects
> immediate results. If the collapse fails due to dirty pages, perform
> synchronous writeback on the specific range and retry once.
> 
> This avoids spurious failures for freshly written executables while
> avoiding unnecessary synchronous I/O for mappings that are already clean.
> 
> Reported-by: Branden Moore <Branden.Moore@amd.com>
> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
> Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
> Suggested-by: David Hildenbrand <david@kernel.org>
> Tested-by: Lance Yang <lance.yang@linux.dev>
> Signed-off-by: Shivank Garg <shivankg@amd.com>
> ---

Looks good to me now, thanks Shivank!

Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V5 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE
  2026-01-19  9:58     ` David Hildenbrand (Red Hat)
@ 2026-01-19 10:54       ` Lorenzo Stoakes
  0 siblings, 0 replies; 10+ messages in thread
From: Lorenzo Stoakes @ 2026-01-19 10:54 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Garg, Shivank, Andrew Morton, Zi Yan, Baolin Wang,
	Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Masami Hiramatsu, Steven Rostedt, linux-trace-kernel,
	Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel,
	Stephen Rothwell

On Mon, Jan 19, 2026 at 10:58:39AM +0100, David Hildenbrand (Red Hat) wrote:
> On 1/19/26 06:07, Garg, Shivank wrote:
> >
> >
> > On 1/19/2026 1:52 AM, Andrew Morton wrote:
> > > Please tolerate a little whining about the timeliess here.  We're at
> > > -rc6, v4 was added to mm.git over a month ago, had quite a lot of
> > > review, this is very close to being moved into the mm-stable branch and now
> > > we get v5.  Argh.

As co-maintainer of THP:

Can I explicitly request that you not merge anything in THP without EXPLICIT
sub-M sign-off, i.e. either me or David.

Obviously I have publicly and privately request that we do this for all of mm,
but in THP especially, we CANNOT accept series landing without this.

THP workload is one of the worst in mm and is completely overwhelming.

Again - David is working whilst being on holiday to handle the load because by
default we auto-merge.

And here you are complaining that you can't auto-merge (...!), it's completely
unworkable.

People can just _wait_ for things to land sometimes. It's OK. It's part of why
the whole rc- approach was added - I think Greg KH explicitly wrote about this
elsewhere - people can just wait for the next cycle and not be too disappointed,
it realy isn't that long.

Stability trumps speed of merge.

I don't want to have to write a script to auto-NAK anything that doesn't match
this requirement, but I'm starting to feel like I have to... Let's try to avoid
that.

> > >
> > I sincerely apologize for this.

You don't need to apologise Shivank, you just might have to wait a cycle.

> >
> > I had this doubt on sending an incremental patch or V5:
> > https://lore.kernel.org/linux-mm/7a42515f-ae57-4f4d-831c-87689930a797@amd.com
> >
> >
> > > > V5:
> > > > - In patch 2/2, Simplify dirty writeback retry logic (David)
> > > Are you sure this is the only change?  It looks like a lot for a
> > > simplification and I'm wondering if we should retain the v4 series and
> > > defer a simplification for separate consideration during the next
> > > cycle.
> >
> > Yes, patch 1/2 is unchanged and patch 2/2 is the only change.
> > The diff looks larger due to code movement but logic is actually simpler now.
> >
> > I completely understand if you prefer to keep V4 and defer this
> > refactoring. I'm sorry for creating this late-cycle churn. Please let me
> > know what you'd prefer and I'll follow your guidance.
>
> There is no reason to rush any of this. :)

Yes absolutely.

It's patently INSANE and in fact a ongoing security risk to rush things in mm in
general, and we really need to stop this.

>
> Review was not over and due to multiple factors my review on v4 came in
> later than I wanted.

Also you're on leave from work :)

I have taken a step back from review somewhat due to this insanity, but now feel
like I have to step it back up... due to this insanity.

Or maybe write a script...

>
> Andrew, if you prefer, I can start sending as a reply to every patch set
> that I want to review so you are aware that it is on my review backlog.

This is ridiculous and shouldn't be necessary.

>
> Unfortunately we can't have nice things (sub-maintainer acks).

Well we can start having less nice things like a NAK script if this unworkable
situation continues.

>
> --
> Cheers
>
> David

Again Andrew - as THP co-maintainer - I politely ask that, while you might not
want to implement a sub-M A-b/R-b requirement across mm (though you should),
that you implement one for THP explicitly.

If you don't then I don't really see any other option than to start NAK'ing
everything in THP that hits mm-stable which either David or I have not
explicitly tagged.

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V5 2/2] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE
  2026-01-18 19:09 ` [PATCH V5 2/2] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE Shivank Garg
  2026-01-19 10:08   ` David Hildenbrand (Red Hat)
@ 2026-01-22 11:07   ` Dev Jain
  2026-01-22 22:34     ` David Hildenbrand (Red Hat)
  1 sibling, 1 reply; 10+ messages in thread
From: Dev Jain @ 2026-01-22 11:07 UTC (permalink / raw)
  To: Shivank Garg, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Barry Song, Lance Yang, Masami Hiramatsu, Steven Rostedt,
	linux-trace-kernel, Mathieu Desnoyers, Zach O'Keefe,
	linux-mm, linux-kernel, Stephen Rothwell, Branden Moore


On 19/01/26 12:39 am, Shivank Garg wrote:
> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
> text sections), the pages may still be dirty from recent writes.
> collapse_file() will trigger async writeback and fail with
> SCAN_PAGE_DIRTY_OR_WRITEBACK (-EAGAIN).
>
> MADV_COLLAPSE is a synchronous operation where userspace expects
> immediate results. If the collapse fails due to dirty pages, perform
> synchronous writeback on the specific range and retry once.
>
> This avoids spurious failures for freshly written executables while
> avoiding unnecessary synchronous I/O for mappings that are already clean.
>
> Reported-by: Branden Moore <Branden.Moore@amd.com>
> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
> Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
> Suggested-by: David Hildenbrand <david@kernel.org>
> Tested-by: Lance Yang <lance.yang@linux.dev>
> Signed-off-by: Shivank Garg <shivankg@amd.com>
> ---
>  mm/khugepaged.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 219dfa2e523c..16582bdcb6ff 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -22,6 +22,7 @@
>  #include <linux/dax.h>
>  #include <linux/ksm.h>
>  #include <linux/pgalloc.h>
> +#include <linux/backing-dev.h>
>  
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -2788,7 +2789,9 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  
>  	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>  		int result = SCAN_FAIL;
> +		bool triggered_wb = false;
>  
> +retry:
>  		if (!mmap_locked) {
>  			cond_resched();
>  			mmap_read_lock(mm);
> @@ -2809,8 +2812,20 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  
>  			mmap_read_unlock(mm);
>  			mmap_locked = false;
> +			*lock_dropped = true;
>  			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
>  							  cc);
> +
> +			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
> +			    mapping_can_writeback(file->f_mapping)) {
> +				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> +				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> +
> +				filemap_write_and_wait_range(file->f_mapping, lstart, lend);

So we don't care about the return value here because this is best-effort.
I really wish we had in our coding-style.rst to typecast such things to (void),
so we know explicitly that we are ignoring the return value, and not that the
function itself returns void.

Nothing jumps out at me,

Reviewed-by: Dev Jain <dev.jain@arm.com>

> +				triggered_wb = true;
> +				fput(file);
> +				goto retry;
> +			}
>  			fput(file);
>  		} else {
>  			result = hpage_collapse_scan_pmd(mm, vma, addr,


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V5 2/2] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE
  2026-01-22 11:07   ` Dev Jain
@ 2026-01-22 22:34     ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-22 22:34 UTC (permalink / raw)
  To: Dev Jain, Shivank Garg, Andrew Morton, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Barry Song, Lance Yang, Masami Hiramatsu, Steven Rostedt,
	linux-trace-kernel, Mathieu Desnoyers, Zach O'Keefe,
	linux-mm, linux-kernel, Stephen Rothwell, Branden Moore

On 1/22/26 12:07, Dev Jain wrote:
> 
> On 19/01/26 12:39 am, Shivank Garg wrote:
>> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
>> text sections), the pages may still be dirty from recent writes.
>> collapse_file() will trigger async writeback and fail with
>> SCAN_PAGE_DIRTY_OR_WRITEBACK (-EAGAIN).
>>
>> MADV_COLLAPSE is a synchronous operation where userspace expects
>> immediate results. If the collapse fails due to dirty pages, perform
>> synchronous writeback on the specific range and retry once.
>>
>> This avoids spurious failures for freshly written executables while
>> avoiding unnecessary synchronous I/O for mappings that are already clean.
>>
>> Reported-by: Branden Moore <Branden.Moore@amd.com>
>> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
>> Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
>> Suggested-by: David Hildenbrand <david@kernel.org>
>> Tested-by: Lance Yang <lance.yang@linux.dev>
>> Signed-off-by: Shivank Garg <shivankg@amd.com>
>> ---
>>   mm/khugepaged.c | 15 +++++++++++++++
>>   1 file changed, 15 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 219dfa2e523c..16582bdcb6ff 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -22,6 +22,7 @@
>>   #include <linux/dax.h>
>>   #include <linux/ksm.h>
>>   #include <linux/pgalloc.h>
>> +#include <linux/backing-dev.h>
>>   
>>   #include <asm/tlb.h>
>>   #include "internal.h"
>> @@ -2788,7 +2789,9 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>   
>>   	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>>   		int result = SCAN_FAIL;
>> +		bool triggered_wb = false;
>>   
>> +retry:
>>   		if (!mmap_locked) {
>>   			cond_resched();
>>   			mmap_read_lock(mm);
>> @@ -2809,8 +2812,20 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>   
>>   			mmap_read_unlock(mm);
>>   			mmap_locked = false;
>> +			*lock_dropped = true;
>>   			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
>>   							  cc);
>> +
>> +			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
>> +			    mapping_can_writeback(file->f_mapping)) {
>> +				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
>> +				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
>> +
>> +				filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> 
> So we don't care about the return value here because this is best-effort.
> I really wish we had in our coding-style.rst to typecast such things to (void),
> so we know explicitly that we are ignoring the return value, and not that the
> function itself returns void.

That makes functions like bitmap_and() hard (and ugly) to use that just 
return some value for the caller's convenience.

For functions where we really want callers to think about this, we can 
enforce such checks through __must_check.

Here, it's rather obvious that we don't care about the result as we only 
retry once to then give up.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-01-22 22:35 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-18 19:09 [PATCH V5 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE Shivank Garg
2026-01-18 19:09 ` [PATCH V5 1/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN Shivank Garg
2026-01-18 19:09 ` [PATCH V5 2/2] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE Shivank Garg
2026-01-19 10:08   ` David Hildenbrand (Red Hat)
2026-01-22 11:07   ` Dev Jain
2026-01-22 22:34     ` David Hildenbrand (Red Hat)
2026-01-18 20:22 ` [PATCH V5 0/2] mm/khugepaged: fix dirty page handling " Andrew Morton
2026-01-19  5:07   ` Garg, Shivank
2026-01-19  9:58     ` David Hildenbrand (Red Hat)
2026-01-19 10:54       ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox