* [PATCH V2 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE @ 2025-11-20 6:50 Shivank Garg 2025-11-20 6:50 ` [PATCH V2 1/2] mm/khugepaged: do synchronous writeback " Shivank Garg 2025-11-20 6:50 ` [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN Shivank Garg 0 siblings, 2 replies; 13+ messages in thread From: Shivank Garg @ 2025-11-20 6:50 UTC (permalink / raw) To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel, linux-trace-kernel, Shivank Garg MADV_COLLAPSE on file-backed mappings fails with -EINVAL when TEXT pages are dirty. This may affect real scenarios: package/container updates, executing binaries immediately after writing them, etc. The issue is that collapse_file() triggers async writeback and returns SCAN_FAIL (maps to -EINVAL), expecting khugepaged to revisit later. But MADV_COLLAPSE is synchronous and userspace expects immediate success or a clear retry signal. Reproduction: - Copy 2MB-aligned executable to freshly mounted XFS/ext4 - Call MADV_COLLAPSE on .text section - First call fails with -EINVAL (text pages dirty from copy) - Second call succeeds (async writeback completed) Issue Report: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com v2: - Move writeback to madvise_collapse() (better abstraction, proper mmap_lock handling and does VMA revalidation after I/O) (Lorenzo) - Rename to SCAN_PAGE_DIRTY to SCAN_PAGE_NOT_CLEAN and extend its use for all dirty/writeback folio cases that previously returned incorrect results (Dev) v1: https://lore.kernel.org/all/20251110113254.77822-1-shivankg@amd.com Shivank Garg (2): mm/khugepaged: do synchronous writeback for MADV_COLLAPSE mm/khugepaged: map dirty/writeback pages failures to EAGAIN include/trace/events/huge_memory.h | 3 ++- mm/khugepaged.c | 34 +++++++++++++++++++++++++++--- 2 files changed, 33 insertions(+), 4 deletions(-) base-commit: 4a3f8fc3adb7046e44bd1feb2f5c5fe95296894f -- 2.43.0 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH V2 1/2] mm/khugepaged: do synchronous writeback for MADV_COLLAPSE 2025-11-20 6:50 [PATCH V2 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE Shivank Garg @ 2025-11-20 6:50 ` Shivank Garg 2025-11-20 13:01 ` Lance Yang 2025-11-20 13:35 ` David Hildenbrand (Red Hat) 2025-11-20 6:50 ` [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN Shivank Garg 1 sibling, 2 replies; 13+ messages in thread From: Shivank Garg @ 2025-11-20 6:50 UTC (permalink / raw) To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel, linux-trace-kernel, Shivank Garg, Branden Moore When MADV_COLLAPSE is called on file-backed mappings (e.g., executable text sections), the pages may still be dirty from recent writes and cause collapse to fail with -EINVAL. This is particularly problematic for freshly copied executables on filesystems, where page cache folios remain dirty until background writeback completes. The current code in collapse_file() triggers async writeback via filemap_flush() and expects khugepaged to revisit the page later. However, MADV_COLLAPSE is a synchronous operation where userspace expects immediate results. Perform synchronous writeback in madvise_collapse() before attempting collapse to avoid failing on first attempt. Reported-by: Branden Moore <Branden.Moore@amd.com> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") Suggested-by: David Hildenbrand <david@kernel.org> Signed-off-by: Shivank Garg <shivankg@amd.com> --- mm/khugepaged.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 97d1b2824386..066a332c76ad 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -22,6 +22,7 @@ #include <linux/dax.h> #include <linux/ksm.h> #include <linux/pgalloc.h> +#include <linux/backing-dev.h> #include <asm/tlb.h> #include "internal.h" @@ -2784,6 +2785,31 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start, hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; hend = end & HPAGE_PMD_MASK; + /* + * For file-backed VMAs, perform synchronous writeback to ensure + * dirty folios are flushed before attempting collapse. This avoids + * failing on the first attempt when freshly-written executable text + * is still dirty in the page cache. + */ + if (!vma_is_anonymous(vma) && vma->vm_file) { + struct address_space *mapping = vma->vm_file->f_mapping; + + if (mapping_can_writeback(mapping)) { + pgoff_t pgoff_start = linear_page_index(vma, hstart); + pgoff_t pgoff_end = linear_page_index(vma, hend); + loff_t lstart = (loff_t)pgoff_start << PAGE_SHIFT; + loff_t lend = ((loff_t)pgoff_end << PAGE_SHIFT) - 1; + + mmap_read_unlock(mm); + mmap_locked = false; + + if (filemap_write_and_wait_range(mapping, lstart, lend)) { + last_fail = SCAN_FAIL; + goto out_maybelock; + } + } + } + for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) { int result = SCAN_FAIL; -- 2.43.0 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 1/2] mm/khugepaged: do synchronous writeback for MADV_COLLAPSE 2025-11-20 6:50 ` [PATCH V2 1/2] mm/khugepaged: do synchronous writeback " Shivank Garg @ 2025-11-20 13:01 ` Lance Yang 2025-11-21 6:27 ` Garg, Shivank 2025-11-20 13:35 ` David Hildenbrand (Red Hat) 1 sibling, 1 reply; 13+ messages in thread From: Lance Yang @ 2025-11-20 13:01 UTC (permalink / raw) To: Shivank Garg Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Steven Rostedt, David Hildenbrand, Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm, Andrew Morton, linux-kernel, linux-trace-kernel, Branden Moore, Lorenzo Stoakes On 2025/11/20 14:50, Shivank Garg wrote: > When MADV_COLLAPSE is called on file-backed mappings (e.g., executable > text sections), the pages may still be dirty from recent writes and > cause collapse to fail with -EINVAL. This is particularly problematic > for freshly copied executables on filesystems, where page cache folios > remain dirty until background writeback completes. > > The current code in collapse_file() triggers async writeback via > filemap_flush() and expects khugepaged to revisit the page later. > However, MADV_COLLAPSE is a synchronous operation where userspace > expects immediate results. > > Perform synchronous writeback in madvise_collapse() before attempting > collapse to avoid failing on first attempt. Thanks! > > Reported-by: Branden Moore <Branden.Moore@amd.com> > Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com > Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") > Suggested-by: David Hildenbrand <david@kernel.org> > Signed-off-by: Shivank Garg <shivankg@amd.com> > --- > mm/khugepaged.c | 26 ++++++++++++++++++++++++++ > 1 file changed, 26 insertions(+) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 97d1b2824386..066a332c76ad 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -22,6 +22,7 @@ > #include <linux/dax.h> > #include <linux/ksm.h> > #include <linux/pgalloc.h> > +#include <linux/backing-dev.h> > > #include <asm/tlb.h> > #include "internal.h" > @@ -2784,6 +2785,31 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start, > hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; > hend = end & HPAGE_PMD_MASK; > > + /* > + * For file-backed VMAs, perform synchronous writeback to ensure > + * dirty folios are flushed before attempting collapse. This avoids > + * failing on the first attempt when freshly-written executable text > + * is still dirty in the page cache. > + */ > + if (!vma_is_anonymous(vma) && vma->vm_file) { > + struct address_space *mapping = vma->vm_file->f_mapping; > + > + if (mapping_can_writeback(mapping)) { > + pgoff_t pgoff_start = linear_page_index(vma, hstart); > + pgoff_t pgoff_end = linear_page_index(vma, hend); > + loff_t lstart = (loff_t)pgoff_start << PAGE_SHIFT; > + loff_t lend = ((loff_t)pgoff_end << PAGE_SHIFT) - 1; It looks like we need to hold a reference to the file here before dropping the mmap lock :) file = get_file(vma->vm_file); Without it, the vma could be destroyed by a concurrent munmap() while we are waiting in filemap_write_and_wait_range(), leading to a UAF on mapping, IIUC ... > + > + mmap_read_unlock(mm); > + mmap_locked = false; > + > + if (filemap_write_and_wait_range(mapping, lstart, lend)) { And drop the reference :) fput(file); > + last_fail = SCAN_FAIL; > + goto out_maybelock; > + } Same here :) fput(file); > + } > + } > + > for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) { > int result = SCAN_FAIL; > Cheers, Lance ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 1/2] mm/khugepaged: do synchronous writeback for MADV_COLLAPSE 2025-11-20 13:01 ` Lance Yang @ 2025-11-21 6:27 ` Garg, Shivank 0 siblings, 0 replies; 13+ messages in thread From: Garg, Shivank @ 2025-11-21 6:27 UTC (permalink / raw) To: Lance Yang Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Steven Rostedt, David Hildenbrand, Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm, Andrew Morton, linux-kernel, linux-trace-kernel, Branden Moore, Lorenzo Stoakes On 11/20/2025 6:31 PM, Lance Yang wrote: > > > On 2025/11/20 14:50, Shivank Garg wrote: >> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable >> text sections), the pages may still be dirty from recent writes and >> cause collapse to fail with -EINVAL. This is particularly problematic >> for freshly copied executables on filesystems, where page cache folios >> remain dirty until background writeback completes. >> >> The current code in collapse_file() triggers async writeback via >> filemap_flush() and expects khugepaged to revisit the page later. >> However, MADV_COLLAPSE is a synchronous operation where userspace >> expects immediate results. >> >> Perform synchronous writeback in madvise_collapse() before attempting >> collapse to avoid failing on first attempt. > > Thanks! > >> >> Reported-by: Branden Moore <Branden.Moore@amd.com> >> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com >> Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") >> Suggested-by: David Hildenbrand <david@kernel.org> >> Signed-off-by: Shivank Garg <shivankg@amd.com> >> --- >> mm/khugepaged.c | 26 ++++++++++++++++++++++++++ >> 1 file changed, 26 insertions(+) >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 97d1b2824386..066a332c76ad 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -22,6 +22,7 @@ >> #include <linux/dax.h> >> #include <linux/ksm.h> >> #include <linux/pgalloc.h> >> +#include <linux/backing-dev.h> >> #include <asm/tlb.h> >> #include "internal.h" >> @@ -2784,6 +2785,31 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start, >> hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; >> hend = end & HPAGE_PMD_MASK; >> + /* >> + * For file-backed VMAs, perform synchronous writeback to ensure >> + * dirty folios are flushed before attempting collapse. This avoids >> + * failing on the first attempt when freshly-written executable text >> + * is still dirty in the page cache. >> + */ >> + if (!vma_is_anonymous(vma) && vma->vm_file) { >> + struct address_space *mapping = vma->vm_file->f_mapping; >> + >> + if (mapping_can_writeback(mapping)) { >> + pgoff_t pgoff_start = linear_page_index(vma, hstart); >> + pgoff_t pgoff_end = linear_page_index(vma, hend); >> + loff_t lstart = (loff_t)pgoff_start << PAGE_SHIFT; >> + loff_t lend = ((loff_t)pgoff_end << PAGE_SHIFT) - 1; > > It looks like we need to hold a reference to the file here before > dropping the mmap lock :) > > file = get_file(vma->vm_file); > > Without it, the vma could be destroyed by a concurrent munmap() while > we are waiting in filemap_write_and_wait_range(), leading to a UAF > on mapping, IIUC ... Excellent catch! Thanks for saving me from this nasty bug. I'll be more careful on file ref handling in next version. Best Regards, Shivank > >> + >> + mmap_read_unlock(mm); >> + mmap_locked = false; >> + >> + if (filemap_write_and_wait_range(mapping, lstart, lend)) { > > And drop the reference :) > > fput(file); > > >> + last_fail = SCAN_FAIL; >> + goto out_maybelock; >> + } > > Same here :) > > fput(file); > > >> + } >> + } >> + >> for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) { >> int result = SCAN_FAIL; >> > > Cheers, > Lance ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 1/2] mm/khugepaged: do synchronous writeback for MADV_COLLAPSE 2025-11-20 6:50 ` [PATCH V2 1/2] mm/khugepaged: do synchronous writeback " Shivank Garg 2025-11-20 13:01 ` Lance Yang @ 2025-11-20 13:35 ` David Hildenbrand (Red Hat) 2025-11-21 6:27 ` Garg, Shivank 1 sibling, 1 reply; 13+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-11-20 13:35 UTC (permalink / raw) To: Shivank Garg, Andrew Morton, Lorenzo Stoakes Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel, linux-trace-kernel, Branden Moore On 11/20/25 07:50, Shivank Garg wrote: > When MADV_COLLAPSE is called on file-backed mappings (e.g., executable > text sections), the pages may still be dirty from recent writes and > cause collapse to fail with -EINVAL. This is particularly problematic > for freshly copied executables on filesystems, where page cache folios > remain dirty until background writeback completes. > > The current code in collapse_file() triggers async writeback via > filemap_flush() and expects khugepaged to revisit the page later. > However, MADV_COLLAPSE is a synchronous operation where userspace > expects immediate results. > > Perform synchronous writeback in madvise_collapse() before attempting > collapse to avoid failing on first attempt. > > Reported-by: Branden Moore <Branden.Moore@amd.com> > Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com > Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") > Suggested-by: David Hildenbrand <david@kernel.org> > Signed-off-by: Shivank Garg <shivankg@amd.com> > --- > mm/khugepaged.c | 26 ++++++++++++++++++++++++++ > 1 file changed, 26 insertions(+) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 97d1b2824386..066a332c76ad 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -22,6 +22,7 @@ > #include <linux/dax.h> > #include <linux/ksm.h> > #include <linux/pgalloc.h> > +#include <linux/backing-dev.h> > > #include <asm/tlb.h> > #include "internal.h" > @@ -2784,6 +2785,31 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start, > hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; > hend = end & HPAGE_PMD_MASK; > > + /* > + * For file-backed VMAs, perform synchronous writeback to ensure > + * dirty folios are flushed before attempting collapse. This avoids > + * failing on the first attempt when freshly-written executable text > + * is still dirty in the page cache. > + */ > + if (!vma_is_anonymous(vma) && vma->vm_file) { > + struct address_space *mapping = vma->vm_file->f_mapping; > + > + if (mapping_can_writeback(mapping)) { > + pgoff_t pgoff_start = linear_page_index(vma, hstart); > + pgoff_t pgoff_end = linear_page_index(vma, hend); > + loff_t lstart = (loff_t)pgoff_start << PAGE_SHIFT; > + loff_t lend = ((loff_t)pgoff_end << PAGE_SHIFT) - 1; > + Hm, so we always do that, without any indication that there actually is something dirty there. Internally filemap_write_and_wait_range() uses something called mapping_needs_writeback(), but it also applies to the complete file, not a range. Wouldn't it be better do do that only if we detect that there is actually a dirty folio in the range? That is, if we find any dirty folio in hpage_collapse_scan_file() and we are in madvise, do that dance here and retry? -- Cheers David ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 1/2] mm/khugepaged: do synchronous writeback for MADV_COLLAPSE 2025-11-20 13:35 ` David Hildenbrand (Red Hat) @ 2025-11-21 6:27 ` Garg, Shivank 0 siblings, 0 replies; 13+ messages in thread From: Garg, Shivank @ 2025-11-21 6:27 UTC (permalink / raw) To: David Hildenbrand (Red Hat), Andrew Morton, Lorenzo Stoakes Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel, linux-trace-kernel, Branden Moore On 11/20/2025 7:05 PM, David Hildenbrand (Red Hat) wrote: > On 11/20/25 07:50, Shivank Garg wrote: >> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable >> text sections), the pages may still be dirty from recent writes and >> cause collapse to fail with -EINVAL. This is particularly problematic >> for freshly copied executables on filesystems, where page cache folios >> remain dirty until background writeback completes. >> >> The current code in collapse_file() triggers async writeback via >> filemap_flush() and expects khugepaged to revisit the page later. >> However, MADV_COLLAPSE is a synchronous operation where userspace >> expects immediate results. >> >> Perform synchronous writeback in madvise_collapse() before attempting >> collapse to avoid failing on first attempt. >> >> Reported-by: Branden Moore <Branden.Moore@amd.com> >> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com >> Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") >> Suggested-by: David Hildenbrand <david@kernel.org> >> Signed-off-by: Shivank Garg <shivankg@amd.com> >> --- >> mm/khugepaged.c | 26 ++++++++++++++++++++++++++ >> 1 file changed, 26 insertions(+) >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 97d1b2824386..066a332c76ad 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -22,6 +22,7 @@ >> #include <linux/dax.h> >> #include <linux/ksm.h> >> #include <linux/pgalloc.h> >> +#include <linux/backing-dev.h> >> #include <asm/tlb.h> >> #include "internal.h" >> @@ -2784,6 +2785,31 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start, >> hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; >> hend = end & HPAGE_PMD_MASK; >> + /* >> + * For file-backed VMAs, perform synchronous writeback to ensure >> + * dirty folios are flushed before attempting collapse. This avoids >> + * failing on the first attempt when freshly-written executable text >> + * is still dirty in the page cache. >> + */ >> + if (!vma_is_anonymous(vma) && vma->vm_file) { >> + struct address_space *mapping = vma->vm_file->f_mapping; >> + >> + if (mapping_can_writeback(mapping)) { >> + pgoff_t pgoff_start = linear_page_index(vma, hstart); >> + pgoff_t pgoff_end = linear_page_index(vma, hend); >> + loff_t lstart = (loff_t)pgoff_start << PAGE_SHIFT; >> + loff_t lend = ((loff_t)pgoff_end << PAGE_SHIFT) - 1; >> + > > Hm, so we always do that, without any indication that there actually is something dirty there. > > Internally filemap_write_and_wait_range() uses something called mapping_needs_writeback(), but it also applies to the complete file, not a range. > > Wouldn't it be better do do that only if we detect that there is actually a dirty folio in the range? > > That is, if we find any dirty folio in hpage_collapse_scan_file() and we are in madvise, do that dance here and retry? > Good point! This makes sense to me. I'll send V3 with this approach. Thanks, Shivank ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN 2025-11-20 6:50 [PATCH V2 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE Shivank Garg 2025-11-20 6:50 ` [PATCH V2 1/2] mm/khugepaged: do synchronous writeback " Shivank Garg @ 2025-11-20 6:50 ` Shivank Garg 2025-11-20 8:03 ` Dev Jain 1 sibling, 1 reply; 13+ messages in thread From: Shivank Garg @ 2025-11-20 6:50 UTC (permalink / raw) To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel, linux-trace-kernel, Shivank Garg When collapse_file encounters dirty or writeback pages in file-backed mappings, it currently SCAN_FAIL which maps to -EINVAL. This is misleading as EINVAL suggests invalid arguments, whereas dirty/writeback pages represent transient conditions that may resolve on retry. Introduce SCAN_PAGE_NOT_CLEAN to cover both dirty and writeback states, mapping it to -EAGAIN. For MADV_COLLAPSE, this provides userspace with a clear signal that retry may succeed after writeback completes, making -EAGAIN semantically correct. For khugepaged, this is harmless as it will naturally revisit the range during periodic scans after async writeback completes. Signed-off-by: Shivank Garg <shivankg@amd.com> --- include/trace/events/huge_memory.h | 3 ++- mm/khugepaged.c | 8 +++++--- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h index 4cde53b45a85..1caf24b951e1 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -37,7 +37,8 @@ EM( SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ EM( SCAN_STORE_FAILED, "store_failed") \ EM( SCAN_COPY_MC, "copy_poisoned_page") \ - EMe(SCAN_PAGE_FILLED, "page_filled") + EM( SCAN_PAGE_FILLED, "page_filled") \ + EMe(SCAN_PAGE_NOT_CLEAN, "page_not_clean") #undef EM #undef EMe diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 066a332c76ad..282b413d17e8 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -59,6 +59,7 @@ enum scan_result { SCAN_STORE_FAILED, SCAN_COPY_MC, SCAN_PAGE_FILLED, + SCAN_PAGE_NOT_CLEAN, }; #define CREATE_TRACE_POINTS @@ -1968,11 +1969,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, */ xas_unlock_irq(&xas); filemap_flush(mapping); - result = SCAN_FAIL; + result = SCAN_PAGE_NOT_CLEAN; goto xa_unlocked; } else if (folio_test_writeback(folio)) { xas_unlock_irq(&xas); - result = SCAN_FAIL; + result = SCAN_PAGE_NOT_CLEAN; goto xa_unlocked; } else if (folio_trylock(folio)) { folio_get(folio); @@ -2019,7 +2020,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, * folio is dirty because it hasn't been flushed * since first write. */ - result = SCAN_FAIL; + result = SCAN_PAGE_NOT_CLEAN; goto out_unlock; } @@ -2748,6 +2749,7 @@ static int madvise_collapse_errno(enum scan_result r) case SCAN_PAGE_LRU: case SCAN_DEL_PAGE_LRU: case SCAN_PAGE_FILLED: + case SCAN_PAGE_NOT_CLEAN: return -EAGAIN; /* * Other: Trying again likely not to succeed / error intrinsic to -- 2.43.0 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN 2025-11-20 6:50 ` [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN Shivank Garg @ 2025-11-20 8:03 ` Dev Jain 2025-11-20 8:17 ` Garg, Shivank 0 siblings, 1 reply; 13+ messages in thread From: Dev Jain @ 2025-11-20 8:03 UTC (permalink / raw) To: Shivank Garg, Andrew Morton, David Hildenbrand, Lorenzo Stoakes Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Barry Song, Lance Yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel, linux-trace-kernel On 20/11/25 12:20 pm, Shivank Garg wrote: > When collapse_file encounters dirty or writeback pages in file-backed > mappings, it currently SCAN_FAIL which maps to -EINVAL. This is > misleading as EINVAL suggests invalid arguments, whereas dirty/writeback > pages represent transient conditions that may resolve on retry. > > Introduce SCAN_PAGE_NOT_CLEAN to cover both dirty and writeback states, > mapping it to -EAGAIN. For MADV_COLLAPSE, this provides userspace with > a clear signal that retry may succeed after writeback completes, making > -EAGAIN semantically correct. For khugepaged, this is harmless as it > will naturally revisit the range during periodic scans after async > writeback completes. > > Signed-off-by: Shivank Garg <shivankg@amd.com> > --- > include/trace/events/huge_memory.h | 3 ++- > mm/khugepaged.c | 8 +++++--- > 2 files changed, 7 insertions(+), 4 deletions(-) > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h > index 4cde53b45a85..1caf24b951e1 100644 > --- a/include/trace/events/huge_memory.h > +++ b/include/trace/events/huge_memory.h > @@ -37,7 +37,8 @@ > EM( SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ > EM( SCAN_STORE_FAILED, "store_failed") \ > EM( SCAN_COPY_MC, "copy_poisoned_page") \ > - EMe(SCAN_PAGE_FILLED, "page_filled") > + EM( SCAN_PAGE_FILLED, "page_filled") \ > + EMe(SCAN_PAGE_NOT_CLEAN, "page_not_clean") > > #undef EM > #undef EMe > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 066a332c76ad..282b413d17e8 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -59,6 +59,7 @@ enum scan_result { > SCAN_STORE_FAILED, > SCAN_COPY_MC, > SCAN_PAGE_FILLED, > + SCAN_PAGE_NOT_CLEAN, > }; > > #define CREATE_TRACE_POINTS > @@ -1968,11 +1969,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > */ > xas_unlock_irq(&xas); > filemap_flush(mapping); > - result = SCAN_FAIL; > + result = SCAN_PAGE_NOT_CLEAN; > goto xa_unlocked; > } else if (folio_test_writeback(folio)) { > xas_unlock_irq(&xas); > - result = SCAN_FAIL; > + result = SCAN_PAGE_NOT_CLEAN; > goto xa_unlocked; > } else if (folio_trylock(folio)) { > folio_get(folio); > @@ -2019,7 +2020,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > * folio is dirty because it hasn't been flushed > * since first write. > */ > - result = SCAN_FAIL; > + result = SCAN_PAGE_NOT_CLEAN; > goto out_unlock; > } > > @@ -2748,6 +2749,7 @@ static int madvise_collapse_errno(enum scan_result r) > case SCAN_PAGE_LRU: > case SCAN_DEL_PAGE_LRU: > case SCAN_PAGE_FILLED: > + case SCAN_PAGE_NOT_CLEAN: > return -EAGAIN; > /* > * Other: Trying again likely not to succeed / error intrinsic to SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY? Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of the folio being dirty, maybe just SCAN_PAGE_DIRTY can do. Reviewed-by: Dev Jain <dev.jain@arm.com> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN 2025-11-20 8:03 ` Dev Jain @ 2025-11-20 8:17 ` Garg, Shivank 2025-11-20 9:55 ` Dev Jain 2025-11-20 12:24 ` Lance Yang 0 siblings, 2 replies; 13+ messages in thread From: Garg, Shivank @ 2025-11-20 8:17 UTC (permalink / raw) To: Dev Jain, Andrew Morton, David Hildenbrand, Lorenzo Stoakes Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Barry Song, Lance Yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel, linux-trace-kernel On 11/20/2025 1:33 PM, Dev Jain wrote: > > On 20/11/25 12:20 pm, Shivank Garg wrote: > SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY? > Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of > the folio being dirty, maybe just SCAN_PAGE_DIRTY can do. > > Reviewed-by: Dev Jain <dev.jain@arm.com> > Thanks for the review. I chose not to use SCAN_PAGE_DIRTY because dirty and writeback have different meanings[1]: Dirty: Memory that is waiting to be written back to disk Writeback: Memory that is actively being written back to disk [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt IIUC, a page under writeback is no longer dirty, so using SCAN_PAGE_DIRTY would be misleading for pages in the writeback state. I considered SCAN_PAGE_DIRTY_OR_WRITEBACK initially but felt it was too long. SCAN_PAGE_NOT_CLEAN covers both states that indicate the page is not in a clean/stable state suitable for collapse. [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt Thanks, Shivank ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN 2025-11-20 8:17 ` Garg, Shivank @ 2025-11-20 9:55 ` Dev Jain 2025-11-20 12:24 ` Lance Yang 1 sibling, 0 replies; 13+ messages in thread From: Dev Jain @ 2025-11-20 9:55 UTC (permalink / raw) To: Garg, Shivank, Andrew Morton, David Hildenbrand, Lorenzo Stoakes Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Barry Song, Lance Yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel, linux-trace-kernel On 20/11/25 1:47 pm, Garg, Shivank wrote: > > On 11/20/2025 1:33 PM, Dev Jain wrote: >> On 20/11/25 12:20 pm, Shivank Garg wrote: >> SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY? >> Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of >> the folio being dirty, maybe just SCAN_PAGE_DIRTY can do. >> >> Reviewed-by: Dev Jain <dev.jain@arm.com> >> > Thanks for the review. > > I chose not to use SCAN_PAGE_DIRTY because dirty and writeback have different meanings[1]: > > Dirty: Memory that is waiting to be written back to disk > Writeback: Memory that is actively being written back to disk > > [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt > > IIUC, a page under writeback is no longer dirty, so using SCAN_PAGE_DIRTY would be misleading > for pages in the writeback state. > > I considered SCAN_PAGE_DIRTY_OR_WRITEBACK initially but felt it was too long. > > SCAN_PAGE_NOT_CLEAN covers both states that indicate the page is not in a clean/stable > state suitable for collapse. > > [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt Alright, makes sense. > > Thanks, > Shivank ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN 2025-11-20 8:17 ` Garg, Shivank 2025-11-20 9:55 ` Dev Jain @ 2025-11-20 12:24 ` Lance Yang 2025-11-20 13:29 ` David Hildenbrand (Red Hat) 1 sibling, 1 reply; 13+ messages in thread From: Lance Yang @ 2025-11-20 12:24 UTC (permalink / raw) To: Garg, Shivank, Dev Jain Cc: Zi Yan, Baolin Wang, Liam R . Howlett, David Hildenbrand, Nico Pache, Ryan Roberts, Barry Song, Andrew Morton, Steven Rostedt, Lorenzo Stoakes, Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel, linux-trace-kernel On 2025/11/20 16:17, Garg, Shivank wrote: > > > On 11/20/2025 1:33 PM, Dev Jain wrote: >> >> On 20/11/25 12:20 pm, Shivank Garg wrote: > >> SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY? >> Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of >> the folio being dirty, maybe just SCAN_PAGE_DIRTY can do. >> >> Reviewed-by: Dev Jain <dev.jain@arm.com> >> > Thanks for the review. > > I chose not to use SCAN_PAGE_DIRTY because dirty and writeback have different meanings[1]: > > Dirty: Memory that is waiting to be written back to disk > Writeback: Memory that is actively being written back to disk > > [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt > > IIUC, a page under writeback is no longer dirty, so using SCAN_PAGE_DIRTY would be misleading > for pages in the writeback state. > > I considered SCAN_PAGE_DIRTY_OR_WRITEBACK initially but felt it was too long. Nit: If SCAN_PAGE_DIRTY_OR_WRITEBACK is too verbose, how about SCAN_PAGE_DIRTY_WB? It keeps the specificity without the length, and is arguably more descriptive than NOT_CLEAN ;) That said, LGTM. Reviewed-by: Lance Yang <lance.yang@linux.dev> > > SCAN_PAGE_NOT_CLEAN covers both states that indicate the page is not in a clean/stable > state suitable for collapse. > > [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt > > Thanks, > Shivank ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN 2025-11-20 12:24 ` Lance Yang @ 2025-11-20 13:29 ` David Hildenbrand (Red Hat) 2025-11-21 6:15 ` Garg, Shivank 0 siblings, 1 reply; 13+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-11-20 13:29 UTC (permalink / raw) To: Lance Yang, Garg, Shivank, Dev Jain Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Barry Song, Andrew Morton, Steven Rostedt, Lorenzo Stoakes, Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel, linux-trace-kernel On 11/20/25 13:24, Lance Yang wrote: > > > On 2025/11/20 16:17, Garg, Shivank wrote: >> >> >> On 11/20/2025 1:33 PM, Dev Jain wrote: >>> >>> On 20/11/25 12:20 pm, Shivank Garg wrote: >> >>> SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY? >>> Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of >>> the folio being dirty, maybe just SCAN_PAGE_DIRTY can do. >>> >>> Reviewed-by: Dev Jain <dev.jain@arm.com> >>> >> Thanks for the review. >> >> I chose not to use SCAN_PAGE_DIRTY because dirty and writeback have different meanings[1]: >> >> Dirty: Memory that is waiting to be written back to disk >> Writeback: Memory that is actively being written back to disk >> >> [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt >> >> IIUC, a page under writeback is no longer dirty, so using SCAN_PAGE_DIRTY would be misleading >> for pages in the writeback state. >> >> I considered SCAN_PAGE_DIRTY_OR_WRITEBACK initially but felt it was too long. > > Nit: If SCAN_PAGE_DIRTY_OR_WRITEBACK I would prefer that here. -- Cheers David ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN 2025-11-20 13:29 ` David Hildenbrand (Red Hat) @ 2025-11-21 6:15 ` Garg, Shivank 0 siblings, 0 replies; 13+ messages in thread From: Garg, Shivank @ 2025-11-21 6:15 UTC (permalink / raw) To: David Hildenbrand (Red Hat), Lance Yang, Dev Jain Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Barry Song, Andrew Morton, Steven Rostedt, Lorenzo Stoakes, Masami Hiramatsu, Mathieu Desnoyers, Zach O'Keefe, linux-mm, linux-kernel, linux-trace-kernel On 11/20/2025 6:59 PM, David Hildenbrand (Red Hat) wrote: > On 11/20/25 13:24, Lance Yang wrote: >> >> >> On 2025/11/20 16:17, Garg, Shivank wrote: >>> >>> >>> On 11/20/2025 1:33 PM, Dev Jain wrote: >>>> >>>> On 20/11/25 12:20 pm, Shivank Garg wrote: >>> >>>> SCAN_PAGE_NOT_CLEAN is confusing - NOT_CLEAN literally means dirty, so why not SCAN_PAGE_DIRTY? >>>> Or SCAN_PAGE_DIRTY_OR_UNDER_WRITEBACK? Since folio_test_writeback() is true as a result of >>>> the folio being dirty, maybe just SCAN_PAGE_DIRTY can do. >>>> >>>> Reviewed-by: Dev Jain <dev.jain@arm.com> >>>> >>> Thanks for the review. >>> >>> I chose not to use SCAN_PAGE_DIRTY because dirty and writeback have different meanings[1]: >>> >>> Dirty: Memory that is waiting to be written back to disk >>> Writeback: Memory that is actively being written back to disk >>> >>> [1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt >>> >>> IIUC, a page under writeback is no longer dirty, so using SCAN_PAGE_DIRTY would be misleading >>> for pages in the writeback state. >>> >>> I considered SCAN_PAGE_DIRTY_OR_WRITEBACK initially but felt it was too long. >> >> Nit: If SCAN_PAGE_DIRTY_OR_WRITEBACK > > I would prefer that here. > I agree on this. If the consensus is SCAN_PAGE_DIRTY_OR_WRITEBACK, I'll use it in v3. Thanks, Shivank ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-11-21 6:28 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-11-20 6:50 [PATCH V2 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE Shivank Garg 2025-11-20 6:50 ` [PATCH V2 1/2] mm/khugepaged: do synchronous writeback " Shivank Garg 2025-11-20 13:01 ` Lance Yang 2025-11-21 6:27 ` Garg, Shivank 2025-11-20 13:35 ` David Hildenbrand (Red Hat) 2025-11-21 6:27 ` Garg, Shivank 2025-11-20 6:50 ` [PATCH V2 2/2] mm/khugepaged: map dirty/writeback pages failures to EAGAIN Shivank Garg 2025-11-20 8:03 ` Dev Jain 2025-11-20 8:17 ` Garg, Shivank 2025-11-20 9:55 ` Dev Jain 2025-11-20 12:24 ` Lance Yang 2025-11-20 13:29 ` David Hildenbrand (Red Hat) 2025-11-21 6:15 ` Garg, Shivank
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox