madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
@ 2025-11-06 12:16 Garg, Shivank
  2025-11-06 12:55 ` Lance Yang
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Garg, Shivank @ 2025-11-06 12:16 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Vlastimil Babka, Jann Horn, zokeefe
  Cc: linux-mm, linux-kernel, shivankg

Hi All,

I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
current behavior and improvements.

Problem:
When attempting to collapse read-only file-backed TEXT sections into THPs
using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
are marked dirty.
madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22

Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise 
attempt triggers filemap_flush() which initiates async writeback of the dirty folios.

Root Cause:
The failure occurs in mm/khugepaged.c:collapse_file():
} else if (folio_test_dirty(folio)) {
    /*
     * khugepaged only works on read-only fd,
     * so this page is dirty because it hasn't
     * been flushed since first write. There
     * won't be new dirty pages.
     *
     * Trigger async flush here and hope the
     * writeback is done when khugepaged
     * revisits this page.
     */
    xas_unlock_irq(&xas);
    filemap_flush(mapping);
    result = SCAN_FAIL;
    goto xa_unlocked;
}

Why the text pages are dirty?
It initially seemed unusual for a read-only text section to be marked as dirty, but
this was actually confirmed by /proc/pid/smaps.

55bc90200000-55bc91200000 r-xp 00400000 07:00 133                        /mnt/xfs-mnt/large_binary_thp
Size:              16384 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                 256 kB
Pss:                 256 kB
Pss_Dirty:           256 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:       256 kB

/proc/pid/smaps (before calling MADV_COLLAPSE) showing Private_Dirty pages in r-xp mappings.
This may be due to dynamic linker and relocations that occurred during program loading.

Reproduction using XFS/EXT4:

1. Compile a test binary with madvise(MADV_COLLAPSE), ensuring the load TEXT segment is
   2MB-aligned and sized to a multiple of 2MB. 
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
LOAD           0x400000 0x0000000000400000 0x0000000000400000 0x1000000 0x1000000 R E 0x200000

2. Create and mount the XFS/EXT4 fs:
   dd if=/dev/zero of=/tmp/xfs-test.img bs=1M count=1024
   losetup -f --show /tmp/xfs-test.img  # output: /dev/loop0
   mkfs.xfs -f /dev/loop0
   mkdir -p /mnt/xfs-mnt
   mount /dev/loop0 /mnt/xfs-mnt
3. Copy the binaries to /mnt/xfs-mnt and execute.
4. Returns -EINVAL on first run, then run successfully on subsequent run. (100% reproducible)
5. To reproduce again; reboot/kexec and repeat from step 2. 

Workaround:
1. Manually flush dirty pages before calling madvise(MADV_COLLAPSE):
	int fd = open("/proc/self/exe", O_RDONLY);
	if (fd >= 0) {
		fsync(fd);
		close(fd);
	}
	// Now madvise(MADV_COLLAPSE) succeeds
2. Alternatively, retrying madvise_collapse on EINVAL failure also work.

Problems with Current Behavior:
1. Confusing Error Code: The syscall returns EINVAL which typically indicates invalid arguments
   rather than a transient condition that could succeed on retry.

2. Non-Transparent Handling: Users are unaware they need to flush dirty pages manually. Current
   madvise_collapse assumes the caller is khugepaged (as per code snippet comment) which will revisit
   the page. However, when called via madvise(MADV_COLLAPSE), the userspace program typically don't
   retry, making the async flush ineffective. Should we differentiate between madvise and khugepaged
   behavior for MADV_COLLAPSE?

Would appreciate thoughts on the best approach to address this issue.

Thanks,
Shivank

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
  2025-11-06 12:16 madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages Garg, Shivank
@ 2025-11-06 12:55 ` Lance Yang
  2025-11-06 13:03   ` Nico Pache
  2025-11-06 16:32 ` Ryan Roberts
  2025-11-06 20:32 ` Yang Shi
  2 siblings, 1 reply; 16+ messages in thread
From: Lance Yang @ 2025-11-06 12:55 UTC (permalink / raw)
  To: Garg, Shivank
  Cc: linux-mm, Lorenzo Stoakes, Nico Pache, Ryan Roberts,
	linux-kernel, Zi Yan, Dev Jain, Baolin Wang, Jann Horn,
	David Hildenbrand, Liam R. Howlett, Barry Song, Andrew Morton,
	zokeefe, Vlastimil Babka



On 2025/11/6 20:16, Garg, Shivank wrote:
> Hi All,

Hi Shivank,

Good catch and a really clear analysis - thanks!
> 
> I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
> when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
> current behavior and improvements.
> 
> Problem:
> When attempting to collapse read-only file-backed TEXT sections into THPs
> using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
> are marked dirty.
> madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22
> 
> Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise
> attempt triggers filemap_flush() which initiates async writeback of the dirty folios.
> 
> Root Cause:
> The failure occurs in mm/khugepaged.c:collapse_file():
> } else if (folio_test_dirty(folio)) {
>      /*
>       * khugepaged only works on read-only fd,
>       * so this page is dirty because it hasn't
>       * been flushed since first write. There
>       * won't be new dirty pages.
>       *
>       * Trigger async flush here and hope the
>       * writeback is done when khugepaged
>       * revisits this page.
>       */
>      xas_unlock_irq(&xas);
>      filemap_flush(mapping);
>      result = SCAN_FAIL;
>      goto xa_unlocked;
> }
> 
> Why the text pages are dirty?
> It initially seemed unusual for a read-only text section to be marked as dirty, but
> this was actually confirmed by /proc/pid/smaps.
> 
> 55bc90200000-55bc91200000 r-xp 00400000 07:00 133                        /mnt/xfs-mnt/large_binary_thp
> Size:              16384 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Rss:                 256 kB
> Pss:                 256 kB
> Pss_Dirty:           256 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:       256 kB
> 
> /proc/pid/smaps (before calling MADV_COLLAPSE) showing Private_Dirty pages in r-xp mappings.
> This may be due to dynamic linker and relocations that occurred during program loading.
> 
> Reproduction using XFS/EXT4:
> 
> 1. Compile a test binary with madvise(MADV_COLLAPSE), ensuring the load TEXT segment is
>     2MB-aligned and sized to a multiple of 2MB.
>    Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
> LOAD           0x400000 0x0000000000400000 0x0000000000400000 0x1000000 0x1000000 R E 0x200000
> 
> 2. Create and mount the XFS/EXT4 fs:
>     dd if=/dev/zero of=/tmp/xfs-test.img bs=1M count=1024
>     losetup -f --show /tmp/xfs-test.img  # output: /dev/loop0
>     mkfs.xfs -f /dev/loop0
>     mkdir -p /mnt/xfs-mnt
>     mount /dev/loop0 /mnt/xfs-mnt
> 3. Copy the binaries to /mnt/xfs-mnt and execute.
> 4. Returns -EINVAL on first run, then run successfully on subsequent run. (100% reproducible)
> 5. To reproduce again; reboot/kexec and repeat from step 2.
> 
> Workaround:
> 1. Manually flush dirty pages before calling madvise(MADV_COLLAPSE):
> 	int fd = open("/proc/self/exe", O_RDONLY);
> 	if (fd >= 0) {
> 		fsync(fd);
> 		close(fd);
> 	}
> 	// Now madvise(MADV_COLLAPSE) succeeds
> 2. Alternatively, retrying madvise_collapse on EINVAL failure also work.
> 
> Problems with Current Behavior:
> 1. Confusing Error Code: The syscall returns EINVAL which typically indicates invalid arguments
>     rather than a transient condition that could succeed on retry.
> 
> 2. Non-Transparent Handling: Users are unaware they need to flush dirty pages manually. Current
>     madvise_collapse assumes the caller is khugepaged (as per code snippet comment) which will revisit
>     the page. However, when called via madvise(MADV_COLLAPSE), the userspace program typically don't
>     retry, making the async flush ineffective. Should we differentiate between madvise and khugepaged
>     behavior for MADV_COLLAPSE?
> 
> Would appreciate thoughts on the best approach to address this issue.

Just throwing out a couple of ideas ...

We could just switch the return code to EAGAIN in the MADV_COLLAPSE 
path. At least that
gives the right hint that retrying is an option ;)

Or, what if we just handle it inside the syscall? When we hit a dirty 
page, we wait for
the writeback to finish and then try again right away. The call might be 
a little slower,
but MADV_COLLAPSE is best effort, right? That seems worth the trouble ...

Cheers,
Lance


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
  2025-11-06 12:55 ` Lance Yang
@ 2025-11-06 13:03   ` Nico Pache
  0 siblings, 0 replies; 16+ messages in thread
From: Nico Pache @ 2025-11-06 13:03 UTC (permalink / raw)
  To: Lance Yang
  Cc: Garg, Shivank, linux-mm, Lorenzo Stoakes, Ryan Roberts,
	linux-kernel, Zi Yan, Dev Jain, Baolin Wang, Jann Horn,
	David Hildenbrand, Liam R. Howlett, Barry Song, Andrew Morton,
	zokeefe, Vlastimil Babka

On Thu, Nov 6, 2025 at 5:55 AM Lance Yang <lance.yang@linux.dev> wrote:
>
>
>
> On 2025/11/6 20:16, Garg, Shivank wrote:
> > Hi All,
>
> Hi Shivank,
>
> Good catch and a really clear analysis - thanks!
+1!
> >
> > I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
> > when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
> > current behavior and improvements.
> >
> > Problem:
> > When attempting to collapse read-only file-backed TEXT sections into THPs
> > using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
> > are marked dirty.
> > madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22
> >
> > Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise
> > attempt triggers filemap_flush() which initiates async writeback of the dirty folios.
> >
> > Root Cause:
> > The failure occurs in mm/khugepaged.c:collapse_file():
> > } else if (folio_test_dirty(folio)) {
> >      /*
> >       * khugepaged only works on read-only fd,
> >       * so this page is dirty because it hasn't
> >       * been flushed since first write. There
> >       * won't be new dirty pages.
> >       *
> >       * Trigger async flush here and hope the
> >       * writeback is done when khugepaged
> >       * revisits this page.
> >       */
> >      xas_unlock_irq(&xas);
> >      filemap_flush(mapping);
> >      result = SCAN_FAIL;
> >      goto xa_unlocked;
> > }
> >
> > Why the text pages are dirty?
> > It initially seemed unusual for a read-only text section to be marked as dirty, but
> > this was actually confirmed by /proc/pid/smaps.
> >
> > 55bc90200000-55bc91200000 r-xp 00400000 07:00 133                        /mnt/xfs-mnt/large_binary_thp
> > Size:              16384 kB
> > KernelPageSize:        4 kB
> > MMUPageSize:           4 kB
> > Rss:                 256 kB
> > Pss:                 256 kB
> > Pss_Dirty:           256 kB
> > Shared_Clean:          0 kB
> > Shared_Dirty:          0 kB
> > Private_Clean:         0 kB
> > Private_Dirty:       256 kB
> >
> > /proc/pid/smaps (before calling MADV_COLLAPSE) showing Private_Dirty pages in r-xp mappings.
> > This may be due to dynamic linker and relocations that occurred during program loading.
> >
> > Reproduction using XFS/EXT4:
> >
> > 1. Compile a test binary with madvise(MADV_COLLAPSE), ensuring the load TEXT segment is
> >     2MB-aligned and sized to a multiple of 2MB.
> >    Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
> > LOAD           0x400000 0x0000000000400000 0x0000000000400000 0x1000000 0x1000000 R E 0x200000
> >
> > 2. Create and mount the XFS/EXT4 fs:
> >     dd if=/dev/zero of=/tmp/xfs-test.img bs=1M count=1024
> >     losetup -f --show /tmp/xfs-test.img  # output: /dev/loop0
> >     mkfs.xfs -f /dev/loop0
> >     mkdir -p /mnt/xfs-mnt
> >     mount /dev/loop0 /mnt/xfs-mnt
> > 3. Copy the binaries to /mnt/xfs-mnt and execute.
> > 4. Returns -EINVAL on first run, then run successfully on subsequent run. (100% reproducible)
> > 5. To reproduce again; reboot/kexec and repeat from step 2.
> >
> > Workaround:
> > 1. Manually flush dirty pages before calling madvise(MADV_COLLAPSE):
> >       int fd = open("/proc/self/exe", O_RDONLY);
> >       if (fd >= 0) {
> >               fsync(fd);
> >               close(fd);
> >       }
> >       // Now madvise(MADV_COLLAPSE) succeeds
> > 2. Alternatively, retrying madvise_collapse on EINVAL failure also work.
> >
> > Problems with Current Behavior:
> > 1. Confusing Error Code: The syscall returns EINVAL which typically indicates invalid arguments
> >     rather than a transient condition that could succeed on retry.
> >
> > 2. Non-Transparent Handling: Users are unaware they need to flush dirty pages manually. Current
> >     madvise_collapse assumes the caller is khugepaged (as per code snippet comment) which will revisit
> >     the page. However, when called via madvise(MADV_COLLAPSE), the userspace program typically don't
> >     retry, making the async flush ineffective. Should we differentiate between madvise and khugepaged
> >     behavior for MADV_COLLAPSE?
> >
> > Would appreciate thoughts on the best approach to address this issue.
>
> Just throwing out a couple of ideas ...
>
> We could just switch the return code to EAGAIN in the MADV_COLLAPSE
> path. At least that
> gives the right hint that retrying is an option ;)

Hey! I agree with Lance here, it seems the solution would be to return
something other than SCAN_FAIL in collapse_file(), then in
madvise_collapse_errno() catch this error and return EAGAIN. We could
use SCAN_PAGE_COUNT which will cause a EAGAIN, or we could create a
new result enum.

Cheers,
-- Nico
>
> Or, what if we just handle it inside the syscall? When we hit a dirty
> page, we wait for
> the writeback to finish and then try again right away. The call might be
> a little slower,
> but MADV_COLLAPSE is best effort, right? That seems worth the trouble ...
>
> Cheers,
> Lance
>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
  2025-11-06 12:16 madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages Garg, Shivank
  2025-11-06 12:55 ` Lance Yang
@ 2025-11-06 16:32 ` Ryan Roberts
  2025-11-06 16:55   ` Liam R. Howlett
  2025-11-06 20:32 ` Yang Shi
  2 siblings, 1 reply; 16+ messages in thread
From: Ryan Roberts @ 2025-11-06 16:32 UTC (permalink / raw)
  To: Garg, Shivank, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Dev Jain,
	Barry Song, Lance Yang, Vlastimil Babka, Jann Horn, zokeefe
  Cc: linux-mm, linux-kernel

On 06/11/2025 12:16, Garg, Shivank wrote:
> Hi All,
> 
> I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
> when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
> current behavior and improvements.
> 
> Problem:
> When attempting to collapse read-only file-backed TEXT sections into THPs
> using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
> are marked dirty.
> madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22
> 
> Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise 
> attempt triggers filemap_flush() which initiates async writeback of the dirty folios.
> 
> Root Cause:
> The failure occurs in mm/khugepaged.c:collapse_file():
> } else if (folio_test_dirty(folio)) {
>     /*
>      * khugepaged only works on read-only fd,
>      * so this page is dirty because it hasn't
>      * been flushed since first write. There
>      * won't be new dirty pages.
>      *
>      * Trigger async flush here and hope the
>      * writeback is done when khugepaged
>      * revisits this page.
>      */
>     xas_unlock_irq(&xas);
>     filemap_flush(mapping);
>     result = SCAN_FAIL;
>     goto xa_unlocked;
> }
> 
> Why the text pages are dirty?

This is the real question to to answer, I think...

What architecture are you running on?


> It initially seemed unusual for a read-only text section to be marked as dirty, but
> this was actually confirmed by /proc/pid/smaps.
> 
> 55bc90200000-55bc91200000 r-xp 00400000 07:00 133                        /mnt/xfs-mnt/large_binary_thp
> Size:              16384 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Rss:                 256 kB
> Pss:                 256 kB
> Pss_Dirty:           256 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:       256 kB
> 
> /proc/pid/smaps (before calling MADV_COLLAPSE) showing Private_Dirty pages in r-xp mappings.
> This may be due to dynamic linker and relocations that occurred during program loading.

On arm64 at least, I wouldn't expect the text to be modified. Relocations should
be handled in data. But given you have private dirty pages here, they must have
been cow'ed and are therefore anonymous? In which case, where is writeback
actually going?

> 
> Reproduction using XFS/EXT4:
> 
> 1. Compile a test binary with madvise(MADV_COLLAPSE), ensuring the load TEXT segment is
>    2MB-aligned and sized to a multiple of 2MB. 
>   Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
> LOAD           0x400000 0x0000000000400000 0x0000000000400000 0x1000000 0x1000000 R E 0x200000
> 
> 2. Create and mount the XFS/EXT4 fs:
>    dd if=/dev/zero of=/tmp/xfs-test.img bs=1M count=1024
>    losetup -f --show /tmp/xfs-test.img  # output: /dev/loop0
>    mkfs.xfs -f /dev/loop0
>    mkdir -p /mnt/xfs-mnt
>    mount /dev/loop0 /mnt/xfs-mnt
> 3. Copy the binaries to /mnt/xfs-mnt and execute.
> 4. Returns -EINVAL on first run, then run successfully on subsequent run. (100% reproducible)
> 5. To reproduce again; reboot/kexec and repeat from step 2. 
> 
> Workaround:
> 1. Manually flush dirty pages before calling madvise(MADV_COLLAPSE):
> 	int fd = open("/proc/self/exe", O_RDONLY);
> 	if (fd >= 0) {
> 		fsync(fd);
> 		close(fd);
> 	}
> 	// Now madvise(MADV_COLLAPSE) succeeds
> 2. Alternatively, retrying madvise_collapse on EINVAL failure also work.
> 
> Problems with Current Behavior:
> 1. Confusing Error Code: The syscall returns EINVAL which typically indicates invalid arguments
>    rather than a transient condition that could succeed on retry.
> 
> 2. Non-Transparent Handling: Users are unaware they need to flush dirty pages manually. Current
>    madvise_collapse assumes the caller is khugepaged (as per code snippet comment) which will revisit
>    the page. However, when called via madvise(MADV_COLLAPSE), the userspace program typically don't
>    retry, making the async flush ineffective. Should we differentiate between madvise and khugepaged
>    behavior for MADV_COLLAPSE?
> 
> Would appreciate thoughts on the best approach to address this issue.
> 
> Thanks,
> Shivank



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
  2025-11-06 16:32 ` Ryan Roberts
@ 2025-11-06 16:55   ` Liam R. Howlett
  2025-11-06 17:17     ` Lorenzo Stoakes
  0 siblings, 1 reply; 16+ messages in thread
From: Liam R. Howlett @ 2025-11-06 16:55 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Garg, Shivank, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Nico Pache, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Jann Horn, zokeefe, linux-mm,
	linux-kernel

* Ryan Roberts <ryan.roberts@arm.com> [251106 11:33]:
> On 06/11/2025 12:16, Garg, Shivank wrote:
> > Hi All,
> > 
> > I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
> > when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
> > current behavior and improvements.
> > 
> > Problem:
> > When attempting to collapse read-only file-backed TEXT sections into THPs
> > using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
> > are marked dirty.
> > madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22
> > 
> > Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise 
> > attempt triggers filemap_flush() which initiates async writeback of the dirty folios.
> > 
> > Root Cause:
> > The failure occurs in mm/khugepaged.c:collapse_file():
> > } else if (folio_test_dirty(folio)) {
> >     /*
> >      * khugepaged only works on read-only fd,
> >      * so this page is dirty because it hasn't
> >      * been flushed since first write. There
> >      * won't be new dirty pages.
> >      *
> >      * Trigger async flush here and hope the
> >      * writeback is done when khugepaged
> >      * revisits this page.
> >      */
> >     xas_unlock_irq(&xas);
> >     filemap_flush(mapping);
> >     result = SCAN_FAIL;
> >     goto xa_unlocked;
> > }
> > 
> > Why the text pages are dirty?
> 
> This is the real question to to answer, I think...

Agree with Ryan here, let's stop things from being marked dirty if they
are not.

> 
> What architecture are you running on?
> 
> 
> > It initially seemed unusual for a read-only text section to be marked as dirty, but
> > this was actually confirmed by /proc/pid/smaps.
> > 
> > 55bc90200000-55bc91200000 r-xp 00400000 07:00 133                        /mnt/xfs-mnt/large_binary_thp
> > Size:              16384 kB
> > KernelPageSize:        4 kB
> > MMUPageSize:           4 kB
> > Rss:                 256 kB
> > Pss:                 256 kB
> > Pss_Dirty:           256 kB
> > Shared_Clean:          0 kB
> > Shared_Dirty:          0 kB
> > Private_Clean:         0 kB
> > Private_Dirty:       256 kB
> > 
> > /proc/pid/smaps (before calling MADV_COLLAPSE) showing Private_Dirty pages in r-xp mappings.
> > This may be due to dynamic linker and relocations that occurred during program loading.
> 
> On arm64 at least, I wouldn't expect the text to be modified. Relocations should
> be handled in data. But given you have private dirty pages here, they must have
> been cow'ed and are therefore anonymous? In which case, where is writeback
> actually going?
> 
> > 
> > Reproduction using XFS/EXT4:
> > 
> > 1. Compile a test binary with madvise(MADV_COLLAPSE), ensuring the load TEXT segment is
> >    2MB-aligned and sized to a multiple of 2MB. 
> >   Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
> > LOAD           0x400000 0x0000000000400000 0x0000000000400000 0x1000000 0x1000000 R E 0x200000
> > 
> > 2. Create and mount the XFS/EXT4 fs:
> >    dd if=/dev/zero of=/tmp/xfs-test.img bs=1M count=1024
> >    losetup -f --show /tmp/xfs-test.img  # output: /dev/loop0
> >    mkfs.xfs -f /dev/loop0
> >    mkdir -p /mnt/xfs-mnt
> >    mount /dev/loop0 /mnt/xfs-mnt
> > 3. Copy the binaries to /mnt/xfs-mnt and execute.
> > 4. Returns -EINVAL on first run, then run successfully on subsequent run. (100% reproducible)
> > 5. To reproduce again; reboot/kexec and repeat from step 2. 
> > 
> > Workaround:
> > 1. Manually flush dirty pages before calling madvise(MADV_COLLAPSE):
> > 	int fd = open("/proc/self/exe", O_RDONLY);
> > 	if (fd >= 0) {
> > 		fsync(fd);
> > 		close(fd);
> > 	}
> > 	// Now madvise(MADV_COLLAPSE) succeeds
> > 2. Alternatively, retrying madvise_collapse on EINVAL failure also work.
> > 
> > Problems with Current Behavior:
> > 1. Confusing Error Code: The syscall returns EINVAL which typically indicates invalid arguments
> >    rather than a transient condition that could succeed on retry.

This is also an issue though.  Reading the documentation on my system,
EINVAL with collapse has two meanings:
        EINVAL addr is not page-aligned or length is negative.
        EINVAL advice is not a valid.
Neither are right here.

EAGAIN seems to make sense, but the documentation would need to be
changed too:
        EAGAIN A kernel resource was temporarily unavailable.

> > 
> > 2. Non-Transparent Handling: Users are unaware they need to flush dirty pages manually. Current
> >    madvise_collapse assumes the caller is khugepaged (as per code snippet comment) which will revisit
> >    the page. However, when called via madvise(MADV_COLLAPSE), the userspace program typically don't
> >    retry, making the async flush ineffective. Should we differentiate between madvise and khugepaged
> >    behavior for MADV_COLLAPSE?

The collapse documentation states that it works on the existing state of
the system memory, so it is doing what it says but the EINVAL return on
dirty pages is not documented, afaict?

> > 
> > Would appreciate thoughts on the best approach to address this issue.
> > 
> > Thanks,
> > Shivank
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
  2025-11-06 16:55   ` Liam R. Howlett
@ 2025-11-06 17:17     ` Lorenzo Stoakes
  2025-11-06 21:05       ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-11-06 17:17 UTC (permalink / raw)
  To: Liam R. Howlett, Ryan Roberts, Garg, Shivank, Andrew Morton,
	David Hildenbrand, Zi Yan, Baolin Wang, Nico Pache, Dev Jain,
	Barry Song, Lance Yang, Vlastimil Babka, Jann Horn, zokeefe,
	linux-mm, linux-kernel

On Thu, Nov 06, 2025 at 11:55:05AM -0500, Liam R. Howlett wrote:
> * Ryan Roberts <ryan.roberts@arm.com> [251106 11:33]:
> > On 06/11/2025 12:16, Garg, Shivank wrote:
> > > Hi All,
> > >
> > > I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
> > > when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
> > > current behavior and improvements.
> > >
> > > Problem:
> > > When attempting to collapse read-only file-backed TEXT sections into THPs
> > > using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
> > > are marked dirty.
> > > madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22
> > >
> > > Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise
> > > attempt triggers filemap_flush() which initiates async writeback of the dirty folios.
> > >
> > > Root Cause:
> > > The failure occurs in mm/khugepaged.c:collapse_file():
> > > } else if (folio_test_dirty(folio)) {
> > >     /*
> > >      * khugepaged only works on read-only fd,
> > >      * so this page is dirty because it hasn't
> > >      * been flushed since first write. There
> > >      * won't be new dirty pages.
> > >      *
> > >      * Trigger async flush here and hope the
> > >      * writeback is done when khugepaged
> > >      * revisits this page.
> > >      */
> > >     xas_unlock_irq(&xas);
> > >     filemap_flush(mapping);
> > >     result = SCAN_FAIL;
> > >     goto xa_unlocked;
> > > }
> > >
> > > Why the text pages are dirty?
> >
> > This is the real question to to answer, I think...
>
> Agree with Ryan here, let's stop things from being marked dirty if they
> are not.

Hmm I wonder if we have some broken assumptions in khugepaged for MAP_PRIVATE
mappings.

collapse_single_pmd()
-> collapse_scan_file() if not vma_is_anonymous() (it won't be)
-> collapse_file()
-> the snippet above.

But that could be running on an anon folio...

Yup given it's CONFIG_READY_ONLY_THP_FOR_FS that is strange. We are confounding
expectations here surely?

Presumably it's because these are MAP_PRIVATE mappings, so this is an anon folio
but then collapse_file() goes into the snippet above and gets very confused.

Do we need to add a folio_test_anon() here?

Unless I'm missing something... (very possible, am only glancing over the code
here)

>
> >
> > What architecture are you running on?
> >
> >
> > > It initially seemed unusual for a read-only text section to be marked as dirty, but
> > > this was actually confirmed by /proc/pid/smaps.
> > >
> > > 55bc90200000-55bc91200000 r-xp 00400000 07:00 133                        /mnt/xfs-mnt/large_binary_thp
> > > Size:              16384 kB
> > > KernelPageSize:        4 kB
> > > MMUPageSize:           4 kB
> > > Rss:                 256 kB
> > > Pss:                 256 kB
> > > Pss_Dirty:           256 kB
> > > Shared_Clean:          0 kB
> > > Shared_Dirty:          0 kB
> > > Private_Clean:         0 kB
> > > Private_Dirty:       256 kB
> > >
> > > /proc/pid/smaps (before calling MADV_COLLAPSE) showing Private_Dirty pages in r-xp mappings.
> > > This may be due to dynamic linker and relocations that occurred during program loading.
> >
> > On arm64 at least, I wouldn't expect the text to be modified. Relocations should
> > be handled in data. But given you have private dirty pages here, they must have
> > been cow'ed and are therefore anonymous? In which case, where is writeback
> > actually going?

Well ther won't be any right? I mean it's fairly normal to modify these
MAP_PRIVATE mapping isn't it for relocations etc.?

You clipped the Anonymous line here, could you share it?

> >
> > >
> > > Reproduction using XFS/EXT4:
> > >
> > > 1. Compile a test binary with madvise(MADV_COLLAPSE), ensuring the load TEXT segment is
> > >    2MB-aligned and sized to a multiple of 2MB.
> > >   Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
> > > LOAD           0x400000 0x0000000000400000 0x0000000000400000 0x1000000 0x1000000 R E 0x200000
> > >
> > > 2. Create and mount the XFS/EXT4 fs:
> > >    dd if=/dev/zero of=/tmp/xfs-test.img bs=1M count=1024
> > >    losetup -f --show /tmp/xfs-test.img  # output: /dev/loop0
> > >    mkfs.xfs -f /dev/loop0
> > >    mkdir -p /mnt/xfs-mnt
> > >    mount /dev/loop0 /mnt/xfs-mnt
> > > 3. Copy the binaries to /mnt/xfs-mnt and execute.
> > > 4. Returns -EINVAL on first run, then run successfully on subsequent run. (100% reproducible)
> > > 5. To reproduce again; reboot/kexec and repeat from step 2.
> > >
> > > Workaround:
> > > 1. Manually flush dirty pages before calling madvise(MADV_COLLAPSE):
> > > 	int fd = open("/proc/self/exe", O_RDONLY);
> > > 	if (fd >= 0) {
> > > 		fsync(fd);
> > > 		close(fd);
> > > 	}

Are you literally madvise()'ing the text portion of the executable?

It's strange this would make a difference in that case hmm...

> > > 	// Now madvise(MADV_COLLAPSE) succeeds
> > > 2. Alternatively, retrying madvise_collapse on EINVAL failure also work.

Hmm that's strange.

> > >
> > > Problems with Current Behavior:
> > > 1. Confusing Error Code: The syscall returns EINVAL which typically indicates invalid arguments
> > >    rather than a transient condition that could succeed on retry.
>
> This is also an issue though.  Reading the documentation on my system,
> EINVAL with collapse has two meanings:
>         EINVAL addr is not page-aligned or length is negative.
>         EINVAL advice is not a valid.
> Neither are right here.
>
> EAGAIN seems to make sense, but the documentation would need to be
> changed too:
>         EAGAIN A kernel resource was temporarily unavailable.

I think these documented error codes are a total fantasy anyway in general for
all system calls, and it'd be silly to try to list every single possible failure
case in the man page. I really wish we didn't even try but there's horrible
inconsistencies and missing entries for _tonnes_ of system calls.

So not sure if it's worth updating.

But obviously be good to have more information than less...

>
> > >
> > > 2. Non-Transparent Handling: Users are unaware they need to flush dirty pages manually. Current
> > >    madvise_collapse assumes the caller is khugepaged (as per code snippet comment) which will revisit
> > >    the page. However, when called via madvise(MADV_COLLAPSE), the userspace program typically don't
> > >    retry, making the async flush ineffective. Should we differentiate between madvise and khugepaged
> > >    behavior for MADV_COLLAPSE?
>
> The collapse documentation states that it works on the existing state of
> the system memory, so it is doing what it says but the EINVAL return on
> dirty pages is not documented, afaict?

It'd be good to document that this will fail if there are dirty pages yes.

>
> > >
> > > Would appreciate thoughts on the best approach to address this issue.
> > >
> > > Thanks,
> > > Shivank
> >

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
  2025-11-06 17:17     ` Lorenzo Stoakes
@ 2025-11-06 21:05       ` David Hildenbrand (Red Hat)
  2025-11-07  8:51         ` Garg, Shivank
  2025-11-07 10:09         ` Lorenzo Stoakes
  0 siblings, 2 replies; 16+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-06 21:05 UTC (permalink / raw)
  To: Lorenzo Stoakes, Liam R. Howlett, Ryan Roberts, Garg, Shivank,
	Andrew Morton, Zi Yan, Baolin Wang, Nico Pache, Dev Jain,
	Barry Song, Lance Yang, Vlastimil Babka, Jann Horn, zokeefe,
	linux-mm, linux-kernel

On 06.11.25 18:17, Lorenzo Stoakes wrote:
> On Thu, Nov 06, 2025 at 11:55:05AM -0500, Liam R. Howlett wrote:
>> * Ryan Roberts <ryan.roberts@arm.com> [251106 11:33]:
>>> On 06/11/2025 12:16, Garg, Shivank wrote:
>>>> Hi All,
>>>>
>>>> I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
>>>> when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
>>>> current behavior and improvements.
>>>>
>>>> Problem:
>>>> When attempting to collapse read-only file-backed TEXT sections into THPs
>>>> using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
>>>> are marked dirty.
>>>> madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22
>>>>
>>>> Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise
>>>> attempt triggers filemap_flush() which initiates async writeback of the dirty folios.
>>>>
>>>> Root Cause:
>>>> The failure occurs in mm/khugepaged.c:collapse_file():
>>>> } else if (folio_test_dirty(folio)) {
>>>>      /*
>>>>       * khugepaged only works on read-only fd,
>>>>       * so this page is dirty because it hasn't
>>>>       * been flushed since first write. There
>>>>       * won't be new dirty pages.
>>>>       *
>>>>       * Trigger async flush here and hope the
>>>>       * writeback is done when khugepaged
>>>>       * revisits this page.
>>>>       */
>>>>      xas_unlock_irq(&xas);
>>>>      filemap_flush(mapping);
>>>>      result = SCAN_FAIL;
>>>>      goto xa_unlocked;
>>>> }
>>>>
>>>> Why the text pages are dirty?
>>>
>>> This is the real question to to answer, I think...
>>
>> Agree with Ryan here, let's stop things from being marked dirty if they
>> are not.
> 
> Hmm I wonder if we have some broken assumptions in khugepaged for MAP_PRIVATE
> mappings.
> 
> collapse_single_pmd()
> -> collapse_scan_file() if not vma_is_anonymous() (it won't be)
> -> collapse_file()
> -> the snippet above.
> 
> But that could be running on an anon folio...
> 
> Yup given it's CONFIG_READY_ONLY_THP_FOR_FS that is strange. We are confounding
> expectations here surely?
> 
> Presumably it's because these are MAP_PRIVATE mappings, so this is an anon folio
> but then collapse_file() goes into the snippet above and gets very confused.
> 
> Do we need to add a folio_test_anon() here?
> 
> Unless I'm missing something... (very possible, am only glancing over the code
> here)

collapse_file() operates exclusively on the pagecache.

I think we only start working on the actual page tables when calling
retract_page_tables().

In there, we have this code, when iterating over page tables belonging
to the mapping:

		/*
		 * The lock of new_folio is still held, we will be blocked in
		 * the page fault path, which prevents the pte entries from
		 * being set again. So even though the old empty PTE page may be
		 * concurrently freed and a new PTE page is filled into the pmd
		 * entry, it is still empty and can be removed.
		 *
		 * So here we only need to recheck if the state of pmd entry
		 * still meets our requirements, rather than checking pmd_same()
		 * like elsewhere.
		 */
		if (check_pmd_state(pmd) != SCAN_SUCCEED)
			goto drop_pml;
		ptl = pte_lockptr(mm, pmd);
		if (ptl != pml)
			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);

		/*
		 * Huge page lock is still held, so normally the page table
		 * must remain empty; and we have already skipped anon_vma
		 * and userfaultfd_wp() vmas.  But since the mmap_lock is not
		 * held, it is still possible for a racing userfaultfd_ioctl()
		 * to have inserted ptes or markers.  Now that we hold ptlock,
		 * repeating the anon_vma check protects from one category,
		 * and repeating the userfaultfd_wp() check from another.
		 */
		if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
			pmdp_get_lockless_sync();
			success = true;
		}

Given !vma->anon_vma, we cannot have anon folios in there.

Given !userfaultfd_wp(vma), we cannot have uffd-wp markers in there.

Given that all folios in the range we are collapsing where unmapped, we cannot have
them mapped there.

So the conclusion is that the page table must be empty and can be removed.


Could guard markers be in there?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
  2025-11-06 21:05       ` David Hildenbrand (Red Hat)
@ 2025-11-07  8:51         ` Garg, Shivank
  2025-11-07  9:12           ` David Hildenbrand (Red Hat)
  2025-11-07 10:09         ` Lorenzo Stoakes
  1 sibling, 1 reply; 16+ messages in thread
From: Garg, Shivank @ 2025-11-07  8:51 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat),
	Lorenzo Stoakes, Liam R. Howlett, Ryan Roberts, Andrew Morton,
	Zi Yan, Baolin Wang, Nico Pache, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Jann Horn, zokeefe, linux-mm,
	linux-kernel



On 11/7/2025 2:35 AM, David Hildenbrand (Red Hat) wrote:
> On 06.11.25 18:17, Lorenzo Stoakes wrote:
>> On Thu, Nov 06, 2025 at 11:55:05AM -0500, Liam R. Howlett wrote:
>>> * Ryan Roberts <ryan.roberts@arm.com> [251106 11:33]:
>>>> On 06/11/2025 12:16, Garg, Shivank wrote:
>>>>> Hi All,


Hi all,

Thank you for the quick responses and suggestions!
Information asked in this thread:
1. Architecture: X86_64

2. I want to emphasize that the error occurs specifically on a fresh mount after copying the binary.
   Binary can either be freshly compiled or previously compiled. The key factor is the fresh
   mount and copy operation.

3. For workaround:
   I'm calling fsync(fd) from inside the executable before madvise().
   Alternatively, I just tried that running sync from the shell after copying the binary
   also works, as it clears the Private_Dirty pages shown in smaps.

4. readelf --wide --segments large_binary_thp_s_withoutfsync
Elf file type is DYN (Position-Independent Executable file)
Entry point 0x4012e0
There are 13 program headers, starting at offset 64

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x0002d8 0x0002d8 R   0x8
  INTERP         0x000318 0x0000000000000318 0x0000000000000318 0x00001c 0x00001c R   0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x24aa38 0x24aa38 R   0x1000
  LOAD           0x400000 0x0000000000400000 0x0000000000400000 0x1000000 0x1000000 R E 0x200000
  LOAD           0x1400000 0x0000000001400000 0x0000000001400000 0x53c750 0x53c750 R   0x1000
  LOAD           0x193cd10 0x000000000193dd10 0x000000000193dd10 0x0c3810 0x0c3820 RW  0x1000
  DYNAMIC        0x193cd28 0x000000000193dd28 0x000000000193dd28 0x0001f0 0x0001f0 RW  0x8
  NOTE           0x000338 0x0000000000000338 0x0000000000000338 0x000030 0x000030 R   0x8
  NOTE           0x000368 0x0000000000000368 0x0000000000000368 0x000044 0x000044 R   0x4
  GNU_PROPERTY   0x000338 0x0000000000000338 0x0000000000000338 0x000030 0x000030 R   0x8
  GNU_EH_FRAME   0x156bc5c 0x000000000156bc5c 0x000000000156bc5c 0x0c356c 0x0c356c R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x193cd10 0x000000000193dd10 0x000000000193dd10 0x0002f0 0x0002f0 R   0x1

 Section to Segment mapping:
  Segment Sections...
   00
   01     .interp
   02     .interp .note.gnu.property .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt
   03     .align_load_begin .init .plt .plt.got .plt.sec .text .fini .align_load_end
   04     .rodata .eh_frame_hdr .eh_frame
   05     .init_array .fini_array .dynamic .got .data .bss
   06     .dynamic
   07     .note.gnu.property
   08     .note.gnu.build-id .note.ABI-tag
   09     .note.gnu.property
   10     .eh_frame_hdr
   11
   12     .init_array .fini_array .dynamic .got

4. Logs from --- Before Collapse --- 

smaps:
55d436a00000-55d437a00000 r-xp 00400000 07:00 135                        /mnt/xfs-mnt/large_binary_thp_s_withoutfsync
Size:              16384 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                 256 kB
Pss:                 256 kB
Pss_Dirty:           256 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:       256 kB
Referenced:          256 kB
Anonymous:             0 kB
KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:           0
ProtectionKey:         0
VmFlags: rd ex mr mw me sd

numa_maps:
55d436a00000 default file=/mnt/xfs-mnt/large_binary_thp_s_withoutfsync dirty=64 active=0 N1=64 kernelpagesize_kB=4

Additional logs inside the kernel:
[  129.257258] collapse_file: ENTER addr=55d436a00000 start=1024 end=1536 is_shmem=0
[  129.257266] collapse_file: allocated new_folio successfully
[  129.257267] collapse_file: XArray slots created, starting page scan
[  129.257268] collapse_file: scanning index=1024 folio=00000000be1a13db
[  129.257270] collapse_file: folio_test_dirty index=1024
[  129.257271]   folio=00000000be1a13db, flags=0x57ffffc8000078
[  129.257272]   mapping=000000004df7b047, inode=000000003395e5a1
[  129.257273]   folio_test_large=1
[  129.257273]   inode mode=0100755, i_writecount=-1 inode_is_open_for_write(inode)=0

[  129.257279]   VMA #2: 000055d436a00000-000055d437a00000 flags=0x8000075 PID=5268 comm=large_binary_th <-- CONTAINS DIRTY FOLIO
                Perms: r-xp  MAYWRITE MAYEXEC
[  129.257281]     File offset range: 0x400000 - 0x1400000
[  129.257282]     Page index range: 1024 - 5120

[  129.257289]   Total VMAs: 5, Writable VMAs: 0
[  129.257290]   Page details:
[  129.257290]     PG_dirty=1
[  129.257290]     PG_writeback=0
[  129.257291]     PG_uptodate=1
[  129.257291]     PG_locked=0
[  129.257292]     refcount=64
[  129.257292]     mapcount=32
[  129.260652] collapse_file: folio_test_dirty FAILED index=1024
[  129.260655] collapse_file: FAILED result=0, going to rollback
[  129.260656] collapse_file: ROLLBACK result=0
[  129.260661] collapse_file: EXIT result=0 
[  129.260661] collapse_file 0
[  129.260662] default 0
[  129.260663] madvise_collapse_errno: -22 last_fail: 0
[  129.260665] thps 0 ((hend - hstart) >> HPAGE_PMD_SHIFT) 8

Note: result=0 is SCAN_FAIL

Now, after the failure on first attempt, when I run the executable again:

-- success run --
Region is 0x56185f800000 to 0x561860800000 - length 16777216
56185f800000-561860800000 r-xp 00400000 07:00 135                        /mnt/xfs-mnt/large_binary_thp_s_withoutfsync
Size:              16384 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                 256 kB
Pss:                 256 kB
Pss_Dirty:             0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:       256 kB
Private_Dirty:         0 kB
Referenced:          256 kB
Anonymous:             0 kB
KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:           0
ProtectionKey:         0
VmFlags: rd ex mr mw me sd

56185f800000 default file=/mnt/xfs-mnt/large_binary_thp_s_withoutfsync mapped=64 active=0 N1=64 kernelpagesize_kB=4

  Start: 0x56185f800000
  End:   0x561860800000
  Size:  16777216 bytes (16.00 MB)
  Hugepages: 8 x 2MB

Calling madvise(MADV_COLLAPSE)...
Successfully collapsed text section into hugepages!


5. Yes, I'm calling madvise(MADV_COLLAPSE) on the text portion of the executable, using the address
   range obtained from /proc/self/maps. IIUC, this should benefit applications by reducing ITLB pressure.

I agree with the suggestions to either Return EAGAIN instead of EINVAL or At minimum, document the
EINVAL return for dirty pages. I'm happy to work on a patch.

Please let me know if any other information is needed for debugging.

Thanks,
Shivank


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
  2025-11-07  8:51         ` Garg, Shivank
@ 2025-11-07  9:12           ` David Hildenbrand (Red Hat)
  2025-11-07 10:09             ` Lance Yang
  2025-11-07 10:10             ` Lorenzo Stoakes
  0 siblings, 2 replies; 16+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-07  9:12 UTC (permalink / raw)
  To: Garg, Shivank, Lorenzo Stoakes, Liam R. Howlett, Ryan Roberts,
	Andrew Morton, Zi Yan, Baolin Wang, Nico Pache, Dev Jain,
	Barry Song, Lance Yang, Vlastimil Babka, Jann Horn, zokeefe,
	linux-mm, linux-kernel


> 
> 5. Yes, I'm calling madvise(MADV_COLLAPSE) on the text portion of the executable, using the address
>     range obtained from /proc/self/maps. IIUC, this should benefit applications by reducing ITLB pressure.
> 
> I agree with the suggestions to either Return EAGAIN instead of EINVAL or At minimum, document the
> EINVAL return for dirty pages. I'm happy to work on a patch.

Of course, we could detect that we are in MADV_COLLAPSE and simply writeback ourselves. After all,
user space asked for a collapse, and it's not khugepaged that will simple revisit it later.

I did something similar in

commit ab73b29efd36f8916c6cc9954e912c4723c9a1b0
Author: David Hildenbrand <david@redhat.com>
Date:   Fri May 16 14:39:46 2025 +0200

     s390/uv: Improve splitting of large folios that cannot be split while dirty
     
     Currently, starting a PV VM on an iomap-based filesystem with large
     folio support, such as XFS, will not work. We'll be stuck in
     unpack_one()->gmap_make_secure(), because we can't seem to make progress
     splitting the large folio.

Where I effectively use filemap_write_and_wait_range().

It could be used early to writeback the whole range to collapse once, possibly.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
  2025-11-07  9:12           ` David Hildenbrand (Red Hat)
@ 2025-11-07 10:09             ` Lance Yang
  2025-11-07 10:10             ` Lorenzo Stoakes
  1 sibling, 0 replies; 16+ messages in thread
From: Lance Yang @ 2025-11-07 10:09 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat),
	Garg, Shivank, Lorenzo Stoakes, Liam R. Howlett, Ryan Roberts,
	Andrew Morton, Zi Yan, Baolin Wang, Nico Pache, Dev Jain,
	Barry Song, Vlastimil Babka, Jann Horn, zokeefe, linux-mm,
	linux-kernel



On 2025/11/7 17:12, David Hildenbrand (Red Hat) wrote:
> 
>>
>> 5. Yes, I'm calling madvise(MADV_COLLAPSE) on the text portion of the 
>> executable, using the address
>>     range obtained from /proc/self/maps. IIUC, this should benefit 
>> applications by reducing ITLB pressure.
>>
>> I agree with the suggestions to either Return EAGAIN instead of EINVAL 
>> or At minimum, document the
>> EINVAL return for dirty pages. I'm happy to work on a patch.
> 
> Of course, we could detect that we are in MADV_COLLAPSE and simply 
> writeback ourselves. After all,
> user space asked for a collapse, and it's not khugepaged that will 
> simple revisit it later.
> 
> I did something similar in
> 
> commit ab73b29efd36f8916c6cc9954e912c4723c9a1b0
> Author: David Hildenbrand <david@redhat.com>
> Date:   Fri May 16 14:39:46 2025 +0200
> 
>      s390/uv: Improve splitting of large folios that cannot be split 
> while dirty
>      Currently, starting a PV VM on an iomap-based filesystem with large
>      folio support, such as XFS, will not work. We'll be stuck in
>      unpack_one()->gmap_make_secure(), because we can't seem to make 
> progress
>      splitting the large folio.
> 
> Where I effectively use filemap_write_and_wait_range().
> 
> It could be used early to writeback the whole range to collapse once, 
> possibly.

Exactly!

Since MADV_COLLAPSE is a best-effort thing, having the kernel use
something like filemap_write_and_wait_range() to writeback the pages
before collapsing is likely what users would expect.

Anyway, they just want to get a THP, whether the pages are dirty or
clean :)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
  2025-11-07  9:12           ` David Hildenbrand (Red Hat)
  2025-11-07 10:09             ` Lance Yang
@ 2025-11-07 10:10             ` Lorenzo Stoakes
  2025-11-07 12:46               ` Garg, Shivank
  1 sibling, 1 reply; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-11-07 10:10 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Garg, Shivank, Liam R. Howlett, Ryan Roberts, Andrew Morton,
	Zi Yan, Baolin Wang, Nico Pache, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Jann Horn, zokeefe, linux-mm,
	linux-kernel

On Fri, Nov 07, 2025 at 10:12:02AM +0100, David Hildenbrand (Red Hat) wrote:
>
> >
> > 5. Yes, I'm calling madvise(MADV_COLLAPSE) on the text portion of the executable, using the address
> >     range obtained from /proc/self/maps. IIUC, this should benefit applications by reducing ITLB pressure.
> >
> > I agree with the suggestions to either Return EAGAIN instead of EINVAL or At minimum, document the
> > EINVAL return for dirty pages. I'm happy to work on a patch.
>
> Of course, we could detect that we are in MADV_COLLAPSE and simply writeback ourselves. After all,
> user space asked for a collapse, and it's not khugepaged that will simple revisit it later.
>
> I did something similar in
>
> commit ab73b29efd36f8916c6cc9954e912c4723c9a1b0
> Author: David Hildenbrand <david@redhat.com>
> Date:   Fri May 16 14:39:46 2025 +0200
>
>     s390/uv: Improve splitting of large folios that cannot be split while dirty
>     Currently, starting a PV VM on an iomap-based filesystem with large
>     folio support, such as XFS, will not work. We'll be stuck in
>     unpack_one()->gmap_make_secure(), because we can't seem to make progress
>     splitting the large folio.
>
> Where I effectively use filemap_write_and_wait_range().
>
> It could be used early to writeback the whole range to collapse once, possibly.

I agree, let's just do a sync flush unconditionally and fix this that way.

This is simpler than I thought, the key bit of information is that we have
freshly written the executable so it sits in the page cache but dirty.

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
  2025-11-07 10:10             ` Lorenzo Stoakes
@ 2025-11-07 12:46               ` Garg, Shivank
  0 siblings, 0 replies; 16+ messages in thread
From: Garg, Shivank @ 2025-11-07 12:46 UTC (permalink / raw)
  To: Lorenzo Stoakes, David Hildenbrand (Red Hat), Lance Yang
  Cc: Liam R. Howlett, Ryan Roberts, Andrew Morton, Zi Yan,
	Baolin Wang, Nico Pache, Dev Jain, Barry Song, Lance Yang,
	Vlastimil Babka, Jann Horn, zokeefe, linux-mm, linux-kernel



On 11/7/2025 3:40 PM, Lorenzo Stoakes wrote:
> On Fri, Nov 07, 2025 at 10:12:02AM +0100, David Hildenbrand (Red Hat) wrote:
>>
>>>
>>> 5. Yes, I'm calling madvise(MADV_COLLAPSE) on the text portion of the executable, using the address
>>>     range obtained from /proc/self/maps. IIUC, this should benefit applications by reducing ITLB pressure.
>>>
>>> I agree with the suggestions to either Return EAGAIN instead of EINVAL or At minimum, document the
>>> EINVAL return for dirty pages. I'm happy to work on a patch.
>>
>> Of course, we could detect that we are in MADV_COLLAPSE and simply writeback ourselves. After all,
>> user space asked for a collapse, and it's not khugepaged that will simple revisit it later.
>>
>> I did something similar in
>>
>> commit ab73b29efd36f8916c6cc9954e912c4723c9a1b0
>> Author: David Hildenbrand <david@redhat.com>
>> Date:   Fri May 16 14:39:46 2025 +0200
>>
>>     s390/uv: Improve splitting of large folios that cannot be split while dirty
>>     Currently, starting a PV VM on an iomap-based filesystem with large
>>     folio support, such as XFS, will not work. We'll be stuck in
>>     unpack_one()->gmap_make_secure(), because we can't seem to make progress
>>     splitting the large folio.
>>
>> Where I effectively use filemap_write_and_wait_range().
>>
>> It could be used early to writeback the whole range to collapse once, possibly.
> 
> I agree, let's just do a sync flush unconditionally and fix this that way.
> 
> This is simpler than I thought, the key bit of information is that we have
> freshly written the executable so it sits in the page cache but dirty.
> 
> Thanks, Lorenzo


Thanks David for sharing the commit. This worked for me and fix is simple.

+        if (!is_shmem && !cc->is_khugepaged && mapping_can_writeback(mapping)) {
+                loff_t range_start = start << PAGE_SHIFT;
+                loff_t range_end = (end << PAGE_SHIFT) - 1;
+                int ret;
+
+                ret = filemap_write_and_wait_range(mapping, range_start, range_end);
+                if (ret) {
+                        result = SCAN_FAIL;
+                        goto out;
+                }
+        }

I'll do some more testing and post a cleaned-up version with proper comments; rebase on mm-next.
Thanks,
Shivank


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
  2025-11-06 21:05       ` David Hildenbrand (Red Hat)
  2025-11-07  8:51         ` Garg, Shivank
@ 2025-11-07 10:09         ` Lorenzo Stoakes
  2025-11-07 12:50           ` Lorenzo Stoakes
  1 sibling, 1 reply; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-11-07 10:09 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Liam R. Howlett, Ryan Roberts, Garg, Shivank, Andrew Morton,
	Zi Yan, Baolin Wang, Nico Pache, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Jann Horn, zokeefe, linux-mm,
	linux-kernel

On Thu, Nov 06, 2025 at 10:05:41PM +0100, David Hildenbrand (Red Hat) wrote:
> On 06.11.25 18:17, Lorenzo Stoakes wrote:
> > On Thu, Nov 06, 2025 at 11:55:05AM -0500, Liam R. Howlett wrote:
> > > * Ryan Roberts <ryan.roberts@arm.com> [251106 11:33]:
> > > > On 06/11/2025 12:16, Garg, Shivank wrote:
> > > > > Hi All,
> > > > >
> > > > > I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
> > > > > when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
> > > > > current behavior and improvements.
> > > > >
> > > > > Problem:
> > > > > When attempting to collapse read-only file-backed TEXT sections into THPs
> > > > > using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
> > > > > are marked dirty.
> > > > > madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22
> > > > >
> > > > > Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise
> > > > > attempt triggers filemap_flush() which initiates async writeback of the dirty folios.
> > > > >
> > > > > Root Cause:
> > > > > The failure occurs in mm/khugepaged.c:collapse_file():
> > > > > } else if (folio_test_dirty(folio)) {
> > > > >      /*
> > > > >       * khugepaged only works on read-only fd,
> > > > >       * so this page is dirty because it hasn't
> > > > >       * been flushed since first write. There
> > > > >       * won't be new dirty pages.
> > > > >       *
> > > > >       * Trigger async flush here and hope the
> > > > >       * writeback is done when khugepaged
> > > > >       * revisits this page.
> > > > >       */
> > > > >      xas_unlock_irq(&xas);
> > > > >      filemap_flush(mapping);
> > > > >      result = SCAN_FAIL;
> > > > >      goto xa_unlocked;
> > > > > }
> > > > >
> > > > > Why the text pages are dirty?
> > > >
> > > > This is the real question to to answer, I think...
> > >
> > > Agree with Ryan here, let's stop things from being marked dirty if they
> > > are not.
> >
> > Hmm I wonder if we have some broken assumptions in khugepaged for MAP_PRIVATE
> > mappings.
> >
> > collapse_single_pmd()
> > -> collapse_scan_file() if not vma_is_anonymous() (it won't be)
> > -> collapse_file()
> > -> the snippet above.

Sorry I was looking at Nico's series these functions aren't correct as of
mm-new atm.

This should be:

madvise_collapse()
-> hpage_collapse_scan_file()
-> collapse_file()


> >
> > But that could be running on an anon folio...
> >
> > Yup given it's CONFIG_READY_ONLY_THP_FOR_FS that is strange. We are confounding
> > expectations here surely?
> >
> > Presumably it's because these are MAP_PRIVATE mappings, so this is an anon folio
> > but then collapse_file() goes into the snippet above and gets very confused.
> >
> > Do we need to add a folio_test_anon() here?
> >
> > Unless I'm missing something... (very possible, am only glancing over the code
> > here)
>
> collapse_file() operates exclusively on the pagecache.

Right you're correct:

	XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);

	...

		folio = xas_load(&xas);

etc. etc.

And with the code you mention below, markers + MAP_PRIVATE are handled
correctly.

This THP code :) such fun.

So yeah this is as simple as the folio is literally just dirty.

And:

			} else if (folio_test_dirty(folio)) {
				/*
				 * khugepaged only works on read-only fd,
				 * so this page is dirty because it hasn't
				 * been flushed since first write. There
				 * won't be new dirty pages.
				 *
				 * Trigger async flush here and hope the
				 * writeback is done when khugepaged
				 * revisits this page.
				 *
				 * This is a one-off situation. We are not
				 * forcing writeback in loop.
				 */
				xas_unlock_irq(&xas);
				filemap_flush(mapping);
Since we do an async flush here ----^

This is why a retry (assuming writeback completed) works.

				result = SCAN_FAIL;
				goto xa_unlocked;
			} else if (folio_test_writeback(folio)) {

>
> I think we only start working on the actual page tables when calling
> retract_page_tables().

Yup.

>
> In there, we have this code, when iterating over page tables belonging
> to the mapping:
>
> 		/*
> 		 * The lock of new_folio is still held, we will be blocked in
> 		 * the page fault path, which prevents the pte entries from
> 		 * being set again. So even though the old empty PTE page may be
> 		 * concurrently freed and a new PTE page is filled into the pmd
> 		 * entry, it is still empty and can be removed.
> 		 *
> 		 * So here we only need to recheck if the state of pmd entry
> 		 * still meets our requirements, rather than checking pmd_same()
> 		 * like elsewhere.
> 		 */
> 		if (check_pmd_state(pmd) != SCAN_SUCCEED)
> 			goto drop_pml;
> 		ptl = pte_lockptr(mm, pmd);
> 		if (ptl != pml)
> 			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>
> 		/*
> 		 * Huge page lock is still held, so normally the page table
> 		 * must remain empty; and we have already skipped anon_vma
> 		 * and userfaultfd_wp() vmas.  But since the mmap_lock is not
> 		 * held, it is still possible for a racing userfaultfd_ioctl()
> 		 * to have inserted ptes or markers.  Now that we hold ptlock,
> 		 * repeating the anon_vma check protects from one category,
> 		 * and repeating the userfaultfd_wp() check from another.
> 		 */
> 		if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
> 			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
> 			pmdp_get_lockless_sync();
> 			success = true;
> 		}
>
> Given !vma->anon_vma, we cannot have anon folios in there.
>
> Given !userfaultfd_wp(vma), we cannot have uffd-wp markers in there.

Right.

>
> Given that all folios in the range we are collapsing where unmapped, we cannot have
> them mapped there.
>
> So the conclusion is that the page table must be empty and can be removed.
>
>
> Could guard markers be in there?

Right now guard markers only exist if vma->anon_vma is set, including the
file-backed case.

But for file-backed guard regions after my VMA sticky series this won't be the
case any more :)

So I had better go change that...

I hate that we have open-coded stuff all over the place that makes assumptions
like this.

This also ignores any other marker types. How I hate the uffd wp implementation.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
  2025-11-07 10:09         ` Lorenzo Stoakes
@ 2025-11-07 12:50           ` Lorenzo Stoakes
  0 siblings, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-11-07 12:50 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Liam R. Howlett, Ryan Roberts, Garg, Shivank, Andrew Morton,
	Zi Yan, Baolin Wang, Nico Pache, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Jann Horn, zokeefe, linux-mm,
	linux-kernel

On Fri, Nov 07, 2025 at 10:09:41AM +0000, Lorenzo Stoakes wrote:
> On Thu, Nov 06, 2025 at 10:05:41PM +0100, David Hildenbrand (Red Hat) wrote:
> > 		/*
> > 		 * The lock of new_folio is still held, we will be blocked in
> > 		 * the page fault path, which prevents the pte entries from
> > 		 * being set again. So even though the old empty PTE page may be
> > 		 * concurrently freed and a new PTE page is filled into the pmd
> > 		 * entry, it is still empty and can be removed.
> > 		 *
> > 		 * So here we only need to recheck if the state of pmd entry
> > 		 * still meets our requirements, rather than checking pmd_same()
> > 		 * like elsewhere.
> > 		 */
> > 		if (check_pmd_state(pmd) != SCAN_SUCCEED)
> > 			goto drop_pml;
> > 		ptl = pte_lockptr(mm, pmd);
> > 		if (ptl != pml)
> > 			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> >
> > 		/*
> > 		 * Huge page lock is still held, so normally the page table
> > 		 * must remain empty; and we have already skipped anon_vma
> > 		 * and userfaultfd_wp() vmas.  But since the mmap_lock is not
> > 		 * held, it is still possible for a racing userfaultfd_ioctl()
> > 		 * to have inserted ptes or markers.  Now that we hold ptlock,
> > 		 * repeating the anon_vma check protects from one category,
> > 		 * and repeating the userfaultfd_wp() check from another.
> > 		 */
> > 		if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
> > 			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
> > 			pmdp_get_lockless_sync();
> > 			success = true;
> > 		}
> >
> > Given !vma->anon_vma, we cannot have anon folios in there.
> >
> > Given !userfaultfd_wp(vma), we cannot have uffd-wp markers in there.
>
> Right.
>
> >
> > Given that all folios in the range we are collapsing where unmapped, we cannot have
> > them mapped there.
> >
> > So the conclusion is that the page table must be empty and can be removed.
> >
> >
> > Could guard markers be in there?
>
> Right now guard markers only exist if vma->anon_vma is set, including the
> file-backed case.
>
> But for file-backed guard regions after my VMA sticky series this won't be the
> case any more :)
>
> So I had better go change that...
>
> I hate that we have open-coded stuff all over the place that makes assumptions
> like this.
>
> This also ignores any other marker types. How I hate the uffd wp implementation.

OK I audited all vma->anon_vma uses and _this_ is literally the only place that
is affected :)

Thanks for mentioning :P have written a self test to repro and fix will land in
v3 of the sticky VMA series.

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
  2025-11-06 12:16 madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages Garg, Shivank
  2025-11-06 12:55 ` Lance Yang
  2025-11-06 16:32 ` Ryan Roberts
@ 2025-11-06 20:32 ` Yang Shi
  2025-11-07  9:44   ` Garg, Shivank
  2 siblings, 1 reply; 16+ messages in thread
From: Yang Shi @ 2025-11-06 20:32 UTC (permalink / raw)
  To: Garg, Shivank
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Vlastimil Babka, Jann Horn, zokeefe,
	linux-mm, linux-kernel

On Thu, Nov 6, 2025 at 7:16 AM Garg, Shivank <shivankg@amd.com> wrote:
>
> Hi All,
>
> I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
> when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
> current behavior and improvements.
>
> Problem:
> When attempting to collapse read-only file-backed TEXT sections into THPs
> using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
> are marked dirty.
> madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22
>
> Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise
> attempt triggers filemap_flush() which initiates async writeback of the dirty folios.
>
> Root Cause:
> The failure occurs in mm/khugepaged.c:collapse_file():
> } else if (folio_test_dirty(folio)) {
>     /*
>      * khugepaged only works on read-only fd,
>      * so this page is dirty because it hasn't
>      * been flushed since first write. There
>      * won't be new dirty pages.
>      *
>      * Trigger async flush here and hope the
>      * writeback is done when khugepaged
>      * revisits this page.
>      */
>     xas_unlock_irq(&xas);
>     filemap_flush(mapping);
>     result = SCAN_FAIL;
>     goto xa_unlocked;
> }
>
> Why the text pages are dirty?

I'm not sure how you did the test, but if you ran the program right
after it was built, it may be possible the background writeback has
not kicked in yet, then MAD_COLLAPSE saw some dirty folios. This is
how your reproducer works at least. This is why filemap_flush() was
added in the first place. Please see commit
75f360696ce9d8ec8b253452b23b3e24c0689b4b.

> It initially seemed unusual for a read-only text section to be marked as dirty, but
> this was actually confirmed by /proc/pid/smaps.
>
> 55bc90200000-55bc91200000 r-xp 00400000 07:00 133                        /mnt/xfs-mnt/large_binary_thp
> Size:              16384 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Rss:                 256 kB
> Pss:                 256 kB
> Pss_Dirty:           256 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:       256 kB
>
> /proc/pid/smaps (before calling MADV_COLLAPSE) showing Private_Dirty pages in r-xp mappings.

smaps shows private dirty if either the PTE is dirty or the folio is
dirty. For this case, I don't expect the PTE is dirty.

> This may be due to dynamic linker and relocations that occurred during program loading.
>
> Reproduction using XFS/EXT4:
>
> 1. Compile a test binary with madvise(MADV_COLLAPSE), ensuring the load TEXT segment is
>    2MB-aligned and sized to a multiple of 2MB.
>   Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
> LOAD           0x400000 0x0000000000400000 0x0000000000400000 0x1000000 0x1000000 R E 0x200000
>
> 2. Create and mount the XFS/EXT4 fs:
>    dd if=/dev/zero of=/tmp/xfs-test.img bs=1M count=1024
>    losetup -f --show /tmp/xfs-test.img  # output: /dev/loop0
>    mkfs.xfs -f /dev/loop0
>    mkdir -p /mnt/xfs-mnt
>    mount /dev/loop0 /mnt/xfs-mnt
> 3. Copy the binaries to /mnt/xfs-mnt and execute.
> 4. Returns -EINVAL on first run, then run successfully on subsequent run. (100% reproducible)
> 5. To reproduce again; reboot/kexec and repeat from step 2.
>
> Workaround:
> 1. Manually flush dirty pages before calling madvise(MADV_COLLAPSE):
>         int fd = open("/proc/self/exe", O_RDONLY);
>         if (fd >= 0) {
>                 fsync(fd);
>                 close(fd);
>         }
>         // Now madvise(MADV_COLLAPSE) succeeds
> 2. Alternatively, retrying madvise_collapse on EINVAL failure also work.
>
> Problems with Current Behavior:
> 1. Confusing Error Code: The syscall returns EINVAL which typically indicates invalid arguments
>    rather than a transient condition that could succeed on retry.

Yeah, I agree the return value is confusing. -EAGAIN may be better as
suggested by others.

>
> 2. Non-Transparent Handling: Users are unaware they need to flush dirty pages manually. Current
>    madvise_collapse assumes the caller is khugepaged (as per code snippet comment) which will revisit
>    the page. However, when called via madvise(MADV_COLLAPSE), the userspace program typically don't
>    retry, making the async flush ineffective. Should we differentiate between madvise and khugepaged
>    behavior for MADV_COLLAPSE?

Maybe MADV_COLLAPSE can have some retry logic?

Thanks,
Yang

>
> Would appreciate thoughts on the best approach to address this issue.
>
> Thanks,
> Shivank
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
  2025-11-06 20:32 ` Yang Shi
@ 2025-11-07  9:44   ` Garg, Shivank
  0 siblings, 0 replies; 16+ messages in thread
From: Garg, Shivank @ 2025-11-07  9:44 UTC (permalink / raw)
  To: Yang Shi
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Vlastimil Babka, Jann Horn, zokeefe,
	linux-mm, linux-kernel



On 11/7/2025 2:02 AM, Yang Shi wrote:
> On Thu, Nov 6, 2025 at 7:16 AM Garg, Shivank <shivankg@amd.com> wrote:
>>
>> Hi All,
>>
>> I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
>> when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
>> current behavior and improvements.
>>
>> Problem:
>> When attempting to collapse read-only file-backed TEXT sections into THPs
>> using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
>> are marked dirty.
>> madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22
>>
>> Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise
>> attempt triggers filemap_flush() which initiates async writeback of the dirty folios.
>>
>> Root Cause:
>> The failure occurs in mm/khugepaged.c:collapse_file():
>> } else if (folio_test_dirty(folio)) {
>>     /*
>>      * khugepaged only works on read-only fd,
>>      * so this page is dirty because it hasn't
>>      * been flushed since first write. There
>>      * won't be new dirty pages.
>>      *
>>      * Trigger async flush here and hope the
>>      * writeback is done when khugepaged
>>      * revisits this page.
>>      */
>>     xas_unlock_irq(&xas);
>>     filemap_flush(mapping);
>>     result = SCAN_FAIL;
>>     goto xa_unlocked;
>> }
>>
>> Why the text pages are dirty?
> 
> I'm not sure how you did the test, but if you ran the program right
> after it was built, it may be possible the background writeback has
> not kicked in yet, then MAD_COLLAPSE saw some dirty folios. This is
> how your reproducer works at least. This is why filemap_flush() was
> added in the first place. Please see commit
> 75f360696ce9d8ec8b253452b23b3e24c0689b4b.

Program can either be freshly compiled or previously compiled.
The error occurs specifically on a fresh mount after copying the binary.
The key factor is the fresh mount and copy operation.


> 
>> It initially seemed unusual for a read-only text section to be marked as dirty, but
>> this was actually confirmed by /proc/pid/smaps.
>>
>> 55bc90200000-55bc91200000 r-xp 00400000 07:00 133                        /mnt/xfs-mnt/large_binary_thp
>> Size:              16384 kB
>> KernelPageSize:        4 kB
>> MMUPageSize:           4 kB
>> Rss:                 256 kB
>> Pss:                 256 kB
>> Pss_Dirty:           256 kB
>> Shared_Clean:          0 kB
>> Shared_Dirty:          0 kB
>> Private_Clean:         0 kB
>> Private_Dirty:       256 kB
>>
>> /proc/pid/smaps (before calling MADV_COLLAPSE) showing Private_Dirty pages in r-xp mappings.
> 
> smaps shows private dirty if either the PTE is dirty or the folio is
> dirty. For this case, I don't expect the PTE is dirty.
> 
>> This may be due to dynamic linker and relocations that occurred during program loading.
>>
>> Reproduction using XFS/EXT4:
>>
>> 1. Compile a test binary with madvise(MADV_COLLAPSE), ensuring the load TEXT segment is
>>    2MB-aligned and sized to a multiple of 2MB.
>>   Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
>> LOAD           0x400000 0x0000000000400000 0x0000000000400000 0x1000000 0x1000000 R E 0x200000
>>
>> 2. Create and mount the XFS/EXT4 fs:
>>    dd if=/dev/zero of=/tmp/xfs-test.img bs=1M count=1024
>>    losetup -f --show /tmp/xfs-test.img  # output: /dev/loop0
>>    mkfs.xfs -f /dev/loop0
>>    mkdir -p /mnt/xfs-mnt
>>    mount /dev/loop0 /mnt/xfs-mnt
>> 3. Copy the binaries to /mnt/xfs-mnt and execute.
>> 4. Returns -EINVAL on first run, then run successfully on subsequent run. (100% reproducible)
>> 5. To reproduce again; reboot/kexec and repeat from step 2.
>>
>> Workaround:
>> 1. Manually flush dirty pages before calling madvise(MADV_COLLAPSE):
>>         int fd = open("/proc/self/exe", O_RDONLY);
>>         if (fd >= 0) {
>>                 fsync(fd);
>>                 close(fd);
>>         }
>>         // Now madvise(MADV_COLLAPSE) succeeds
>> 2. Alternatively, retrying madvise_collapse on EINVAL failure also work.
>>
>> Problems with Current Behavior:
>> 1. Confusing Error Code: The syscall returns EINVAL which typically indicates invalid arguments
>>    rather than a transient condition that could succeed on retry.
> 
> Yeah, I agree the return value is confusing. -EAGAIN may be better as
> suggested by others.
> 
>>
>> 2. Non-Transparent Handling: Users are unaware they need to flush dirty pages manually. Current
>>    madvise_collapse assumes the caller is khugepaged (as per code snippet comment) which will revisit
>>    the page. However, when called via madvise(MADV_COLLAPSE), the userspace program typically don't
>>    retry, making the async flush ineffective. Should we differentiate between madvise and khugepaged
>>    behavior for MADV_COLLAPSE?
> 
> Maybe MADV_COLLAPSE can have some retry logic?
> 
> Thanks,
> Yang
> 
>>
>> Would appreciate thoughts on the best approach to address this issue.
>>
>> Thanks,
>> Shivank
>>



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-11-07 12:51 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-06 12:16 madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages Garg, Shivank
2025-11-06 12:55 ` Lance Yang
2025-11-06 13:03   ` Nico Pache
2025-11-06 16:32 ` Ryan Roberts
2025-11-06 16:55   ` Liam R. Howlett
2025-11-06 17:17     ` Lorenzo Stoakes
2025-11-06 21:05       ` David Hildenbrand (Red Hat)
2025-11-07  8:51         ` Garg, Shivank
2025-11-07  9:12           ` David Hildenbrand (Red Hat)
2025-11-07 10:09             ` Lance Yang
2025-11-07 10:10             ` Lorenzo Stoakes
2025-11-07 12:46               ` Garg, Shivank
2025-11-07 10:09         ` Lorenzo Stoakes
2025-11-07 12:50           ` Lorenzo Stoakes
2025-11-06 20:32 ` Yang Shi
2025-11-07  9:44   ` Garg, Shivank

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox