* [PATCH V2 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE
@ 2025-11-20 6:50 Shivank Garg
2025-11-20 6:50 ` [PATCH V2 1/2] mm/khugepaged: do synchronous writeback " Shivank Garg
2025-11-20 6:50 ` [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN Shivank Garg
0 siblings, 2 replies; 13+ messages in thread
From: Shivank Garg @ 2025-11-20 6:50 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
linux-kernel, linux-trace-kernel, Shivank Garg
MADV_COLLAPSE on file-backed mappings fails with -EINVAL when TEXT pages
are dirty. This may affect real scenarios: package/container updates,
executing binaries immediately after writing them, etc.
The issue is that collapse_file() triggers async writeback and returns
SCAN_FAIL (maps to -EINVAL), expecting khugepaged to revisit later. But
MADV_COLLAPSE is synchronous and userspace expects immediate success or a
clear retry signal.
Reproduction:
- Copy 2MB-aligned executable to freshly mounted XFS/ext4
- Call MADV_COLLAPSE on .text section
- First call fails with -EINVAL (text pages dirty from copy)
- Second call succeeds (async writeback completed)
Issue Report:
https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
v2:
- Move writeback to madvise_collapse() (better abstraction, proper
mmap_lock handling and does VMA revalidation after I/O) (Lorenzo)
- Rename to SCAN_PAGE_DIRTY to SCAN_PAGE_NOT_CLEAN and extend its use
for all dirty/writeback folio cases that previously returned incorrect
results (Dev)
v1: https://lore.kernel.org/all/20251110113254.77822-1-shivankg@amd.com
Shivank Garg (2):
mm/khugepaged: do synchronous writeback for MADV_COLLAPSE
mm/khugepaged: map dirty/writeback pages failures to EAGAIN
include/trace/events/huge_memory.h | 3 ++-
mm/khugepaged.c | 34 +++++++++++++++++++++++++++---
2 files changed, 33 insertions(+), 4 deletions(-)
base-commit: 4a3f8fc3adb7046e44bd1feb2f5c5fe95296894f
--
2.43.0
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH V2 1/2] mm/khugepaged: do synchronous writeback for MADV_COLLAPSE
2025-11-20 6:50 [PATCH V2 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE Shivank Garg
@ 2025-11-20 6:50 ` Shivank Garg
2025-11-20 13:01 ` Lance Yang
2025-11-20 13:35 ` David Hildenbrand (Red Hat)
2025-11-20 6:50 ` [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN Shivank Garg
1 sibling, 2 replies; 13+ messages in thread
From: Shivank Garg @ 2025-11-20 6:50 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
linux-kernel, linux-trace-kernel, Shivank Garg, Branden Moore
When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
text sections), the pages may still be dirty from recent writes and
cause collapse to fail with -EINVAL. This is particularly problematic
for freshly copied executables on filesystems, where page cache folios
remain dirty until background writeback completes.
The current code in collapse_file() triggers async writeback via
filemap_flush() and expects khugepaged to revisit the page later.
However, MADV_COLLAPSE is a synchronous operation where userspace
expects immediate results.
Perform synchronous writeback in madvise_collapse() before attempting
collapse to avoid failing on first attempt.
Reported-by: Branden Moore <Branden.Moore@amd.com>
Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
mm/khugepaged.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 97d1b2824386..066a332c76ad 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -22,6 +22,7 @@
#include <linux/dax.h>
#include <linux/ksm.h>
#include <linux/pgalloc.h>
+#include <linux/backing-dev.h>
#include <asm/tlb.h>
#include "internal.h"
@@ -2784,6 +2785,31 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
hend = end & HPAGE_PMD_MASK;
+ /*
+ * For file-backed VMAs, perform synchronous writeback to ensure
+ * dirty folios are flushed before attempting collapse. This avoids
+ * failing on the first attempt when freshly-written executable text
+ * is still dirty in the page cache.
+ */
+ if (!vma_is_anonymous(vma) && vma->vm_file) {
+ struct address_space *mapping = vma->vm_file->f_mapping;
+
+ if (mapping_can_writeback(mapping)) {
+ pgoff_t pgoff_start = linear_page_index(vma, hstart);
+ pgoff_t pgoff_end = linear_page_index(vma, hend);
+ loff_t lstart = (loff_t)pgoff_start << PAGE_SHIFT;
+ loff_t lend = ((loff_t)pgoff_end << PAGE_SHIFT) - 1;
+
+ mmap_read_unlock(mm);
+ mmap_locked = false;
+
+ if (filemap_write_and_wait_range(mapping, lstart, lend)) {
+ last_fail = SCAN_FAIL;
+ goto out_maybelock;
+ }
+ }
+ }
+
for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
int result = SCAN_FAIL;
--
2.43.0
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN
2025-11-20 6:50 [PATCH V2 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE Shivank Garg
2025-11-20 6:50 ` [PATCH V2 1/2] mm/khugepaged: do synchronous writeback " Shivank Garg
@ 2025-11-20 6:50 ` Shivank Garg
2025-11-20 8:03 ` Dev Jain
1 sibling, 1 reply; 13+ messages in thread
From: Shivank Garg @ 2025-11-20 6:50 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
linux-kernel, linux-trace-kernel, Shivank Garg
When collapse_file encounters dirty or writeback pages in file-backed
mappings, it currently SCAN_FAIL which maps to -EINVAL. This is
misleading as EINVAL suggests invalid arguments, whereas dirty/writeback
pages represent transient conditions that may resolve on retry.
Introduce SCAN_PAGE_NOT_CLEAN to cover both dirty and writeback states,
mapping it to -EAGAIN. For MADV_COLLAPSE, this provides userspace with
a clear signal that retry may succeed after writeback completes, making
-EAGAIN semantically correct. For khugepaged, this is harmless as it
will naturally revisit the range during periodic scans after async
writeback completes.
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
include/trace/events/huge_memory.h | 3 ++-
mm/khugepaged.c | 8 +++++---
2 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 4cde53b45a85..1caf24b951e1 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -37,7 +37,8 @@
EM( SCAN_PAGE_HAS_PRIVATE, "page_has_private") \
EM( SCAN_STORE_FAILED, "store_failed") \
EM( SCAN_COPY_MC, "copy_poisoned_page") \
- EMe(SCAN_PAGE_FILLED, "page_filled")
+ EM( SCAN_PAGE_FILLED, "page_filled") \
+ EMe(SCAN_PAGE_NOT_CLEAN, "page_not_clean")
#undef EM
#undef EMe
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 066a332c76ad..282b413d17e8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -59,6 +59,7 @@ enum scan_result {
SCAN_STORE_FAILED,
SCAN_COPY_MC,
SCAN_PAGE_FILLED,
+ SCAN_PAGE_NOT_CLEAN,
};
#define CREATE_TRACE_POINTS
@@ -1968,11 +1969,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
*/
xas_unlock_irq(&xas);
filemap_flush(mapping);
- result = SCAN_FAIL;
+ result = SCAN_PAGE_NOT_CLEAN;
goto xa_unlocked;
} else if (folio_test_writeback(folio)) {
xas_unlock_irq(&xas);
- result = SCAN_FAIL;
+ result = SCAN_PAGE_NOT_CLEAN;
goto xa_unlocked;
} else if (folio_trylock(folio)) {
folio_get(folio);
@@ -2019,7 +2020,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
* folio is dirty because it hasn't been flushed
* since first write.
*/
- result = SCAN_FAIL;
+ result = SCAN_PAGE_NOT_CLEAN;
goto out_unlock;
}
@@ -2748,6 +2749,7 @@ static int madvise_collapse_errno(enum scan_result r)
case SCAN_PAGE_LRU:
case SCAN_DEL_PAGE_LRU:
case SCAN_PAGE_FILLED:
+ case SCAN_PAGE_NOT_CLEAN:
return -EAGAIN;
/*
* Other: Trying again likely not to succeed / error intrinsic to
--
2.43.0
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN
2025-11-20 6:50 ` [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN Shivank Garg
@ 2025-11-20 8:03 ` Dev Jain
2025-11-20 8:17 ` Garg, Shivank
0 siblings, 1 reply; 13+ messages in thread
From: Dev Jain @ 2025-11-20 8:03 UTC (permalink / raw)
To: Shivank Garg, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Barry Song, Lance Yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel,
linux-trace-kernel
On 20/11/25 12:20 pm, Shivank Garg wrote:
> When collapse_file encounters dirty or writeback pages in file-backed
> mappings, it currently SCAN_FAIL which maps to -EINVAL. This is
> misleading as EINVAL suggests invalid arguments, whereas dirty/writeback
> pages represent transient conditions that may resolve on retry.
>
> Introduce SCAN_PAGE_NOT_CLEAN to cover both dirty and writeback states,
> mapping it to -EAGAIN. For MADV_COLLAPSE, this provides userspace with
> a clear signal that retry may succeed after writeback completes, making
> -EAGAIN semantically correct. For khugepaged, this is harmless as it
> will naturally revisit the range during periodic scans after async
> writeback completes.
>
> Signed-off-by: Shivank Garg <shivankg@amd.com>
> ---
> include/trace/events/huge_memory.h | 3 ++-
> mm/khugepaged.c | 8 +++++---
> 2 files changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 4cde53b45a85..1caf24b951e1 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -37,7 +37,8 @@
> EM( SCAN_PAGE_HAS_PRIVATE, "page_has_private") \
> EM( SCAN_STORE_FAILED, "store_failed") \
> EM( SCAN_COPY_MC, "copy_poisoned_page") \
> - EMe(SCAN_PAGE_FILLED, "page_filled")
> + EM( SCAN_PAGE_FILLED, "page_filled") \
> + EMe(SCAN_PAGE_NOT_CLEAN, "page_not_clean")
>
> #undef EM
> #undef EMe
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 066a332c76ad..282b413d17e8 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -59,6 +59,7 @@ enum scan_result {
> SCAN_STORE_FAILED,
> SCAN_COPY_MC,
> SCAN_PAGE_FILLED,
> + SCAN_PAGE_NOT_CLEAN,
> };
>
> #define CREATE_TRACE_POINTS
> @@ -1968,11 +1969,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> */
> xas_unlock_irq(&xas);
> filemap_flush(mapping);
> - result = SCAN_FAIL;
> + result = SCAN_PAGE_NOT_CLEAN;
> goto xa_unlocked;
> } else if (folio_test_writeback(folio)) {
> xas_unlock_irq(&xas);
> - result = SCAN_FAIL;
> + result = SCAN_PAGE_NOT_CLEAN;
> goto xa_unlocked;
> } else if (folio_trylock(folio)) {
> folio_get(folio);
> @@ -2019,7 +2020,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> * folio is dirty because it hasn't been flushed
> * since first write.
> */
> - result = SCAN_FAIL;
> + result = SCAN_PAGE_NOT_CLEAN;
> goto out_unlock;
> }
>
> @@ -2748,6 +2749,7 @@ static int madvise_collapse_errno(enum scan_result r)
> case SCAN_PAGE_LRU:
> case SCAN_DEL_PAGE_LRU:
> case SCAN_PAGE_FILLED:
> + case SCAN_PAGE_NOT_CLEAN:
> return -EAGAIN;
> /*
> * Other: Trying again likely not to succeed / error intrinsic to
SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY?
Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of
the folio being dirty, maybe just SCAN_PAGE_DIRTY can do.
Reviewed-by: Dev Jain <dev.jain@arm.com>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN
2025-11-20 8:03 ` Dev Jain
@ 2025-11-20 8:17 ` Garg, Shivank
2025-11-20 9:55 ` Dev Jain
2025-11-20 12:24 ` Lance Yang
0 siblings, 2 replies; 13+ messages in thread
From: Garg, Shivank @ 2025-11-20 8:17 UTC (permalink / raw)
To: Dev Jain, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Barry Song, Lance Yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel,
linux-trace-kernel
On 11/20/2025 1:33 PM, Dev Jain wrote:
>
> On 20/11/25 12:20 pm, Shivank Garg wrote:
> SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY?
> Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of
> the folio being dirty, maybe just SCAN_PAGE_DIRTY can do.
>
> Reviewed-by: Dev Jain <dev.jain@arm.com>
>
Thanks for the review.
I chose not to use SCAN_PAGE_DIRTY because dirty and writeback have different meanings[1]:
Dirty: Memory that is waiting to be written back to disk
Writeback: Memory that is actively being written back to disk
[1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt
IIUC, a page under writeback is no longer dirty, so using SCAN_PAGE_DIRTY would be misleading
for pages in the writeback state.
I considered SCAN_PAGE_DIRTY_OR_WRITEBACK initially but felt it was too long.
SCAN_PAGE_NOT_CLEAN covers both states that indicate the page is not in a clean/stable
state suitable for collapse.
[1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt
Thanks,
Shivank
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN
2025-11-20 8:17 ` Garg, Shivank
@ 2025-11-20 9:55 ` Dev Jain
2025-11-20 12:24 ` Lance Yang
1 sibling, 0 replies; 13+ messages in thread
From: Dev Jain @ 2025-11-20 9:55 UTC (permalink / raw)
To: Garg, Shivank, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Barry Song, Lance Yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel,
linux-trace-kernel
On 20/11/25 1:47 pm, Garg, Shivank wrote:
>
> On 11/20/2025 1:33 PM, Dev Jain wrote:
>> On 20/11/25 12:20 pm, Shivank Garg wrote:
>> SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY?
>> Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of
>> the folio being dirty, maybe just SCAN_PAGE_DIRTY can do.
>>
>> Reviewed-by: Dev Jain <dev.jain@arm.com>
>>
> Thanks for the review.
>
> I chose not to use SCAN_PAGE_DIRTY because dirty and writeback have different meanings[1]:
>
> Dirty: Memory that is waiting to be written back to disk
> Writeback: Memory that is actively being written back to disk
>
> [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt
>
> IIUC, a page under writeback is no longer dirty, so using SCAN_PAGE_DIRTY would be misleading
> for pages in the writeback state.
>
> I considered SCAN_PAGE_DIRTY_OR_WRITEBACK initially but felt it was too long.
>
> SCAN_PAGE_NOT_CLEAN covers both states that indicate the page is not in a clean/stable
> state suitable for collapse.
>
> [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt
Alright, makes sense.
>
> Thanks,
> Shivank
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN
2025-11-20 8:17 ` Garg, Shivank
2025-11-20 9:55 ` Dev Jain
@ 2025-11-20 12:24 ` Lance Yang
2025-11-20 13:29 ` David Hildenbrand (Red Hat)
1 sibling, 1 reply; 13+ messages in thread
From: Lance Yang @ 2025-11-20 12:24 UTC (permalink / raw)
To: Garg, Shivank, Dev Jain
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, David Hildenbrand,
Nico Pache, Ryan Roberts, Barry Song, Andrew Morton,
Steven Rostedt, Lorenzo Stoakes, Masami Hiramatsu,
Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel,
linux-trace-kernel
On 2025/11/20 16:17, Garg, Shivank wrote:
>
>
> On 11/20/2025 1:33 PM, Dev Jain wrote:
>>
>> On 20/11/25 12:20 pm, Shivank Garg wrote:
>
>> SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY?
>> Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of
>> the folio being dirty, maybe just SCAN_PAGE_DIRTY can do.
>>
>> Reviewed-by: Dev Jain <dev.jain@arm.com>
>>
> Thanks for the review.
>
> I chose not to use SCAN_PAGE_DIRTY because dirty and writeback have different meanings[1]:
>
> Dirty: Memory that is waiting to be written back to disk
> Writeback: Memory that is actively being written back to disk
>
> [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt
>
> IIUC, a page under writeback is no longer dirty, so using SCAN_PAGE_DIRTY would be misleading
> for pages in the writeback state.
>
> I considered SCAN_PAGE_DIRTY_OR_WRITEBACK initially but felt it was too long.
Nit: If SCAN_PAGE_DIRTY_OR_WRITEBACK is too verbose, how about
SCAN_PAGE_DIRTY_WB?
It keeps the specificity without the length, and is arguably more
descriptive
than NOT_CLEAN ;)
That said, LGTM.
Reviewed-by: Lance Yang <lance.yang@linux.dev>
>
> SCAN_PAGE_NOT_CLEAN covers both states that indicate the page is not in a clean/stable
> state suitable for collapse.
>
> [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt
>
> Thanks,
> Shivank
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 1/2] mm/khugepaged: do synchronous writeback for MADV_COLLAPSE
2025-11-20 6:50 ` [PATCH V2 1/2] mm/khugepaged: do synchronous writeback " Shivank Garg
@ 2025-11-20 13:01 ` Lance Yang
2025-11-21 6:27 ` Garg, Shivank
2025-11-20 13:35 ` David Hildenbrand (Red Hat)
1 sibling, 1 reply; 13+ messages in thread
From: Lance Yang @ 2025-11-20 13:01 UTC (permalink / raw)
To: Shivank Garg
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Steven Rostedt, David Hildenbrand,
Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
Andrew Morton, linux-kernel, linux-trace-kernel, Branden Moore,
Lorenzo Stoakes
On 2025/11/20 14:50, Shivank Garg wrote:
> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
> text sections), the pages may still be dirty from recent writes and
> cause collapse to fail with -EINVAL. This is particularly problematic
> for freshly copied executables on filesystems, where page cache folios
> remain dirty until background writeback completes.
>
> The current code in collapse_file() triggers async writeback via
> filemap_flush() and expects khugepaged to revisit the page later.
> However, MADV_COLLAPSE is a synchronous operation where userspace
> expects immediate results.
>
> Perform synchronous writeback in madvise_collapse() before attempting
> collapse to avoid failing on first attempt.
Thanks!
>
> Reported-by: Branden Moore <Branden.Moore@amd.com>
> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
> Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Shivank Garg <shivankg@amd.com>
> ---
> mm/khugepaged.c | 26 ++++++++++++++++++++++++++
> 1 file changed, 26 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 97d1b2824386..066a332c76ad 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -22,6 +22,7 @@
> #include <linux/dax.h>
> #include <linux/ksm.h>
> #include <linux/pgalloc.h>
> +#include <linux/backing-dev.h>
>
> #include <asm/tlb.h>
> #include "internal.h"
> @@ -2784,6 +2785,31 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> hend = end & HPAGE_PMD_MASK;
>
> + /*
> + * For file-backed VMAs, perform synchronous writeback to ensure
> + * dirty folios are flushed before attempting collapse. This avoids
> + * failing on the first attempt when freshly-written executable text
> + * is still dirty in the page cache.
> + */
> + if (!vma_is_anonymous(vma) && vma->vm_file) {
> + struct address_space *mapping = vma->vm_file->f_mapping;
> +
> + if (mapping_can_writeback(mapping)) {
> + pgoff_t pgoff_start = linear_page_index(vma, hstart);
> + pgoff_t pgoff_end = linear_page_index(vma, hend);
> + loff_t lstart = (loff_t)pgoff_start << PAGE_SHIFT;
> + loff_t lend = ((loff_t)pgoff_end << PAGE_SHIFT) - 1;
It looks like we need to hold a reference to the file here before
dropping the mmap lock :)
file = get_file(vma->vm_file);
Without it, the vma could be destroyed by a concurrent munmap() while
we are waiting in filemap_write_and_wait_range(), leading to a UAF
on mapping, IIUC ...
> +
> + mmap_read_unlock(mm);
> + mmap_locked = false;
> +
> + if (filemap_write_and_wait_range(mapping, lstart, lend)) {
And drop the reference :)
fput(file);
> + last_fail = SCAN_FAIL;
> + goto out_maybelock;
> + }
Same here :)
fput(file);
> + }
> + }
> +
> for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> int result = SCAN_FAIL;
>
Cheers,
Lance
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN
2025-11-20 12:24 ` Lance Yang
@ 2025-11-20 13:29 ` David Hildenbrand (Red Hat)
2025-11-21 6:15 ` Garg, Shivank
0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-20 13:29 UTC (permalink / raw)
To: Lance Yang, Garg, Shivank, Dev Jain
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Barry Song, Andrew Morton, Steven Rostedt, Lorenzo Stoakes,
Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
linux-kernel, linux-trace-kernel
On 11/20/25 13:24, Lance Yang wrote:
>
>
> On 2025/11/20 16:17, Garg, Shivank wrote:
>>
>>
>> On 11/20/2025 1:33 PM, Dev Jain wrote:
>>>
>>> On 20/11/25 12:20 pm, Shivank Garg wrote:
>>
>>> SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY?
>>> Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of
>>> the folio being dirty, maybe just SCAN_PAGE_DIRTY can do.
>>>
>>> Reviewed-by: Dev Jain <dev.jain@arm.com>
>>>
>> Thanks for the review.
>>
>> I chose not to use SCAN_PAGE_DIRTY because dirty and writeback have different meanings[1]:
>>
>> Dirty: Memory that is waiting to be written back to disk
>> Writeback: Memory that is actively being written back to disk
>>
>> [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt
>>
>> IIUC, a page under writeback is no longer dirty, so using SCAN_PAGE_DIRTY would be misleading
>> for pages in the writeback state.
>>
>> I considered SCAN_PAGE_DIRTY_OR_WRITEBACK initially but felt it was too long.
>
> Nit: If SCAN_PAGE_DIRTY_OR_WRITEBACK
I would prefer that here.
--
Cheers
David
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 1/2] mm/khugepaged: do synchronous writeback for MADV_COLLAPSE
2025-11-20 6:50 ` [PATCH V2 1/2] mm/khugepaged: do synchronous writeback " Shivank Garg
2025-11-20 13:01 ` Lance Yang
@ 2025-11-20 13:35 ` David Hildenbrand (Red Hat)
2025-11-21 6:27 ` Garg, Shivank
1 sibling, 1 reply; 13+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-20 13:35 UTC (permalink / raw)
To: Shivank Garg, Andrew Morton, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
linux-kernel, linux-trace-kernel, Branden Moore
On 11/20/25 07:50, Shivank Garg wrote:
> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
> text sections), the pages may still be dirty from recent writes and
> cause collapse to fail with -EINVAL. This is particularly problematic
> for freshly copied executables on filesystems, where page cache folios
> remain dirty until background writeback completes.
>
> The current code in collapse_file() triggers async writeback via
> filemap_flush() and expects khugepaged to revisit the page later.
> However, MADV_COLLAPSE is a synchronous operation where userspace
> expects immediate results.
>
> Perform synchronous writeback in madvise_collapse() before attempting
> collapse to avoid failing on first attempt.
>
> Reported-by: Branden Moore <Branden.Moore@amd.com>
> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
> Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Shivank Garg <shivankg@amd.com>
> ---
> mm/khugepaged.c | 26 ++++++++++++++++++++++++++
> 1 file changed, 26 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 97d1b2824386..066a332c76ad 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -22,6 +22,7 @@
> #include <linux/dax.h>
> #include <linux/ksm.h>
> #include <linux/pgalloc.h>
> +#include <linux/backing-dev.h>
>
> #include <asm/tlb.h>
> #include "internal.h"
> @@ -2784,6 +2785,31 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> hend = end & HPAGE_PMD_MASK;
>
> + /*
> + * For file-backed VMAs, perform synchronous writeback to ensure
> + * dirty folios are flushed before attempting collapse. This avoids
> + * failing on the first attempt when freshly-written executable text
> + * is still dirty in the page cache.
> + */
> + if (!vma_is_anonymous(vma) && vma->vm_file) {
> + struct address_space *mapping = vma->vm_file->f_mapping;
> +
> + if (mapping_can_writeback(mapping)) {
> + pgoff_t pgoff_start = linear_page_index(vma, hstart);
> + pgoff_t pgoff_end = linear_page_index(vma, hend);
> + loff_t lstart = (loff_t)pgoff_start << PAGE_SHIFT;
> + loff_t lend = ((loff_t)pgoff_end << PAGE_SHIFT) - 1;
> +
Hm, so we always do that, without any indication that there actually is
something dirty there.
Internally filemap_write_and_wait_range() uses something called
mapping_needs_writeback(), but it also applies to the complete file, not
a range.
Wouldn't it be better do do that only if we detect that there is
actually a dirty folio in the range?
That is, if we find any dirty folio in hpage_collapse_scan_file() and we
are in madvise, do that dance here and retry?
--
Cheers
David
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN
2025-11-20 13:29 ` David Hildenbrand (Red Hat)
@ 2025-11-21 6:15 ` Garg, Shivank
0 siblings, 0 replies; 13+ messages in thread
From: Garg, Shivank @ 2025-11-21 6:15 UTC (permalink / raw)
To: David Hildenbrand (Red Hat), Lance Yang, Dev Jain
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Barry Song, Andrew Morton, Steven Rostedt, Lorenzo Stoakes,
Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
linux-kernel, linux-trace-kernel
On 11/20/2025 6:59 PM, David Hildenbrand (Red Hat) wrote:
> On 11/20/25 13:24, Lance Yang wrote:
>>
>>
>> On 2025/11/20 16:17, Garg, Shivank wrote:
>>>
>>>
>>> On 11/20/2025 1:33 PM, Dev Jain wrote:
>>>>
>>>> On 20/11/25 12:20 pm, Shivank Garg wrote:
>>>
>>>> SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY?
>>>> Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of
>>>> the folio being dirty, maybe just SCAN_PAGE_DIRTY can do.
>>>>
>>>> Reviewed-by: Dev Jain <dev.jain@arm.com>
>>>>
>>> Thanks for the review.
>>>
>>> I chose not to use SCAN_PAGE_DIRTY because dirty and writeback have different meanings[1]:
>>>
>>> Dirty: Memory that is waiting to be written back to disk
>>> Writeback: Memory that is actively being written back to disk
>>>
>>> [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt
>>>
>>> IIUC, a page under writeback is no longer dirty, so using SCAN_PAGE_DIRTY would be misleading
>>> for pages in the writeback state.
>>>
>>> I considered SCAN_PAGE_DIRTY_OR_WRITEBACK initially but felt it was too long.
>>
>> Nit: If SCAN_PAGE_DIRTY_OR_WRITEBACK
>
> I would prefer that here.
>
I agree on this.
If the consensus is SCAN_PAGE_DIRTY_OR_WRITEBACK, I'll use it in v3.
Thanks,
Shivank
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 1/2] mm/khugepaged: do synchronous writeback for MADV_COLLAPSE
2025-11-20 13:01 ` Lance Yang
@ 2025-11-21 6:27 ` Garg, Shivank
0 siblings, 0 replies; 13+ messages in thread
From: Garg, Shivank @ 2025-11-21 6:27 UTC (permalink / raw)
To: Lance Yang
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Steven Rostedt, David Hildenbrand,
Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
Andrew Morton, linux-kernel, linux-trace-kernel, Branden Moore,
Lorenzo Stoakes
On 11/20/2025 6:31 PM, Lance Yang wrote:
>
>
> On 2025/11/20 14:50, Shivank Garg wrote:
>> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
>> text sections), the pages may still be dirty from recent writes and
>> cause collapse to fail with -EINVAL. This is particularly problematic
>> for freshly copied executables on filesystems, where page cache folios
>> remain dirty until background writeback completes.
>>
>> The current code in collapse_file() triggers async writeback via
>> filemap_flush() and expects khugepaged to revisit the page later.
>> However, MADV_COLLAPSE is a synchronous operation where userspace
>> expects immediate results.
>>
>> Perform synchronous writeback in madvise_collapse() before attempting
>> collapse to avoid failing on first attempt.
>
> Thanks!
>
>>
>> Reported-by: Branden Moore <Branden.Moore@amd.com>
>> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
>> Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
>> Suggested-by: David Hildenbrand <david@kernel.org>
>> Signed-off-by: Shivank Garg <shivankg@amd.com>
>> ---
>> mm/khugepaged.c | 26 ++++++++++++++++++++++++++
>> 1 file changed, 26 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 97d1b2824386..066a332c76ad 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -22,6 +22,7 @@
>> #include <linux/dax.h>
>> #include <linux/ksm.h>
>> #include <linux/pgalloc.h>
>> +#include <linux/backing-dev.h>
>> #include <asm/tlb.h>
>> #include "internal.h"
>> @@ -2784,6 +2785,31 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>> hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
>> hend = end & HPAGE_PMD_MASK;
>> + /*
>> + * For file-backed VMAs, perform synchronous writeback to ensure
>> + * dirty folios are flushed before attempting collapse. This avoids
>> + * failing on the first attempt when freshly-written executable text
>> + * is still dirty in the page cache.
>> + */
>> + if (!vma_is_anonymous(vma) && vma->vm_file) {
>> + struct address_space *mapping = vma->vm_file->f_mapping;
>> +
>> + if (mapping_can_writeback(mapping)) {
>> + pgoff_t pgoff_start = linear_page_index(vma, hstart);
>> + pgoff_t pgoff_end = linear_page_index(vma, hend);
>> + loff_t lstart = (loff_t)pgoff_start << PAGE_SHIFT;
>> + loff_t lend = ((loff_t)pgoff_end << PAGE_SHIFT) - 1;
>
> It looks like we need to hold a reference to the file here before
> dropping the mmap lock :)
>
> file = get_file(vma->vm_file);
>
> Without it, the vma could be destroyed by a concurrent munmap() while
> we are waiting in filemap_write_and_wait_range(), leading to a UAF
> on mapping, IIUC ...
Excellent catch!
Thanks for saving me from this nasty bug. I'll be more careful on file ref
handling in next version.
Best Regards,
Shivank
>
>> +
>> + mmap_read_unlock(mm);
>> + mmap_locked = false;
>> +
>> + if (filemap_write_and_wait_range(mapping, lstart, lend)) {
>
> And drop the reference :)
>
> fput(file);
>
>
>> + last_fail = SCAN_FAIL;
>> + goto out_maybelock;
>> + }
>
> Same here :)
>
> fput(file);
>
>
>> + }
>> + }
>> +
>> for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>> int result = SCAN_FAIL;
>>
>
> Cheers,
> Lance
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 1/2] mm/khugepaged: do synchronous writeback for MADV_COLLAPSE
2025-11-20 13:35 ` David Hildenbrand (Red Hat)
@ 2025-11-21 6:27 ` Garg, Shivank
0 siblings, 0 replies; 13+ messages in thread
From: Garg, Shivank @ 2025-11-21 6:27 UTC (permalink / raw)
To: David Hildenbrand (Red Hat), Andrew Morton, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm,
linux-kernel, linux-trace-kernel, Branden Moore
On 11/20/2025 7:05 PM, David Hildenbrand (Red Hat) wrote:
> On 11/20/25 07:50, Shivank Garg wrote:
>> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
>> text sections), the pages may still be dirty from recent writes and
>> cause collapse to fail with -EINVAL. This is particularly problematic
>> for freshly copied executables on filesystems, where page cache folios
>> remain dirty until background writeback completes.
>>
>> The current code in collapse_file() triggers async writeback via
>> filemap_flush() and expects khugepaged to revisit the page later.
>> However, MADV_COLLAPSE is a synchronous operation where userspace
>> expects immediate results.
>>
>> Perform synchronous writeback in madvise_collapse() before attempting
>> collapse to avoid failing on first attempt.
>>
>> Reported-by: Branden Moore <Branden.Moore@amd.com>
>> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
>> Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
>> Suggested-by: David Hildenbrand <david@kernel.org>
>> Signed-off-by: Shivank Garg <shivankg@amd.com>
>> ---
>> mm/khugepaged.c | 26 ++++++++++++++++++++++++++
>> 1 file changed, 26 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 97d1b2824386..066a332c76ad 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -22,6 +22,7 @@
>> #include <linux/dax.h>
>> #include <linux/ksm.h>
>> #include <linux/pgalloc.h>
>> +#include <linux/backing-dev.h>
>> #include <asm/tlb.h>
>> #include "internal.h"
>> @@ -2784,6 +2785,31 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>> hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
>> hend = end & HPAGE_PMD_MASK;
>> + /*
>> + * For file-backed VMAs, perform synchronous writeback to ensure
>> + * dirty folios are flushed before attempting collapse. This avoids
>> + * failing on the first attempt when freshly-written executable text
>> + * is still dirty in the page cache.
>> + */
>> + if (!vma_is_anonymous(vma) && vma->vm_file) {
>> + struct address_space *mapping = vma->vm_file->f_mapping;
>> +
>> + if (mapping_can_writeback(mapping)) {
>> + pgoff_t pgoff_start = linear_page_index(vma, hstart);
>> + pgoff_t pgoff_end = linear_page_index(vma, hend);
>> + loff_t lstart = (loff_t)pgoff_start << PAGE_SHIFT;
>> + loff_t lend = ((loff_t)pgoff_end << PAGE_SHIFT) - 1;
>> +
>
> Hm, so we always do that, without any indication that there actually is something dirty there.
>
> Internally filemap_write_and_wait_range() uses something called mapping_needs_writeback(), but it also applies to the complete file, not a range.
>
> Wouldn't it be better do do that only if we detect that there is actually a dirty folio in the range?
>
> That is, if we find any dirty folio in hpage_collapse_scan_file() and we are in madvise, do that dance here and retry?
>
Good point! This makes sense to me.
I'll send V3 with this approach.
Thanks,
Shivank
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-11-21 6:28 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-20 6:50 [PATCH V2 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE Shivank Garg
2025-11-20 6:50 ` [PATCH V2 1/2] mm/khugepaged: do synchronous writeback " Shivank Garg
2025-11-20 13:01 ` Lance Yang
2025-11-21 6:27 ` Garg, Shivank
2025-11-20 13:35 ` David Hildenbrand (Red Hat)
2025-11-21 6:27 ` Garg, Shivank
2025-11-20 6:50 ` [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN Shivank Garg
2025-11-20 8:03 ` Dev Jain
2025-11-20 8:17 ` Garg, Shivank
2025-11-20 9:55 ` Dev Jain
2025-11-20 12:24 ` Lance Yang
2025-11-20 13:29 ` David Hildenbrand (Red Hat)
2025-11-21 6:15 ` Garg, Shivank
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox