linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yang Shi <shy828301@gmail.com>
To: "Garg, Shivank" <shivankg@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	 Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Zi Yan <ziy@nvidia.com>,
	 Baolin Wang <baolin.wang@linux.alibaba.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	 Nico Pache <npache@redhat.com>,
	Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
	 Barry Song <baohua@kernel.org>,
	Lance Yang <lance.yang@linux.dev>,
	 Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	zokeefe@google.com,  linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
Date: Thu, 6 Nov 2025 12:32:58 -0800	[thread overview]
Message-ID: <CAHbLzkqvXsFfziYU6A_LXfF2UQHkmNHqyT05P+dTav3mi4b0hA@mail.gmail.com> (raw)
In-Reply-To: <4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com>

On Thu, Nov 6, 2025 at 7:16 AM Garg, Shivank <shivankg@amd.com> wrote:
>
> Hi All,
>
> I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
> when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
> current behavior and improvements.
>
> Problem:
> When attempting to collapse read-only file-backed TEXT sections into THPs
> using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
> are marked dirty.
> madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22
>
> Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise
> attempt triggers filemap_flush() which initiates async writeback of the dirty folios.
>
> Root Cause:
> The failure occurs in mm/khugepaged.c:collapse_file():
> } else if (folio_test_dirty(folio)) {
>     /*
>      * khugepaged only works on read-only fd,
>      * so this page is dirty because it hasn't
>      * been flushed since first write. There
>      * won't be new dirty pages.
>      *
>      * Trigger async flush here and hope the
>      * writeback is done when khugepaged
>      * revisits this page.
>      */
>     xas_unlock_irq(&xas);
>     filemap_flush(mapping);
>     result = SCAN_FAIL;
>     goto xa_unlocked;
> }
>
> Why the text pages are dirty?

I'm not sure how you did the test, but if you ran the program right
after it was built, it may be possible the background writeback has
not kicked in yet, then MAD_COLLAPSE saw some dirty folios. This is
how your reproducer works at least. This is why filemap_flush() was
added in the first place. Please see commit
75f360696ce9d8ec8b253452b23b3e24c0689b4b.

> It initially seemed unusual for a read-only text section to be marked as dirty, but
> this was actually confirmed by /proc/pid/smaps.
>
> 55bc90200000-55bc91200000 r-xp 00400000 07:00 133                        /mnt/xfs-mnt/large_binary_thp
> Size:              16384 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Rss:                 256 kB
> Pss:                 256 kB
> Pss_Dirty:           256 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:       256 kB
>
> /proc/pid/smaps (before calling MADV_COLLAPSE) showing Private_Dirty pages in r-xp mappings.

smaps shows private dirty if either the PTE is dirty or the folio is
dirty. For this case, I don't expect the PTE is dirty.

> This may be due to dynamic linker and relocations that occurred during program loading.
>
> Reproduction using XFS/EXT4:
>
> 1. Compile a test binary with madvise(MADV_COLLAPSE), ensuring the load TEXT segment is
>    2MB-aligned and sized to a multiple of 2MB.
>   Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
> LOAD           0x400000 0x0000000000400000 0x0000000000400000 0x1000000 0x1000000 R E 0x200000
>
> 2. Create and mount the XFS/EXT4 fs:
>    dd if=/dev/zero of=/tmp/xfs-test.img bs=1M count=1024
>    losetup -f --show /tmp/xfs-test.img  # output: /dev/loop0
>    mkfs.xfs -f /dev/loop0
>    mkdir -p /mnt/xfs-mnt
>    mount /dev/loop0 /mnt/xfs-mnt
> 3. Copy the binaries to /mnt/xfs-mnt and execute.
> 4. Returns -EINVAL on first run, then run successfully on subsequent run. (100% reproducible)
> 5. To reproduce again; reboot/kexec and repeat from step 2.
>
> Workaround:
> 1. Manually flush dirty pages before calling madvise(MADV_COLLAPSE):
>         int fd = open("/proc/self/exe", O_RDONLY);
>         if (fd >= 0) {
>                 fsync(fd);
>                 close(fd);
>         }
>         // Now madvise(MADV_COLLAPSE) succeeds
> 2. Alternatively, retrying madvise_collapse on EINVAL failure also work.
>
> Problems with Current Behavior:
> 1. Confusing Error Code: The syscall returns EINVAL which typically indicates invalid arguments
>    rather than a transient condition that could succeed on retry.

Yeah, I agree the return value is confusing. -EAGAIN may be better as
suggested by others.

>
> 2. Non-Transparent Handling: Users are unaware they need to flush dirty pages manually. Current
>    madvise_collapse assumes the caller is khugepaged (as per code snippet comment) which will revisit
>    the page. However, when called via madvise(MADV_COLLAPSE), the userspace program typically don't
>    retry, making the async flush ineffective. Should we differentiate between madvise and khugepaged
>    behavior for MADV_COLLAPSE?

Maybe MADV_COLLAPSE can have some retry logic?

Thanks,
Yang

>
> Would appreciate thoughts on the best approach to address this issue.
>
> Thanks,
> Shivank
>


  parent reply	other threads:[~2025-11-06 20:33 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-06 12:16 Garg, Shivank
2025-11-06 12:55 ` Lance Yang
2025-11-06 13:03   ` Nico Pache
2025-11-06 16:32 ` Ryan Roberts
2025-11-06 16:55   ` Liam R. Howlett
2025-11-06 17:17     ` Lorenzo Stoakes
2025-11-06 21:05       ` David Hildenbrand (Red Hat)
2025-11-07  8:51         ` Garg, Shivank
2025-11-07  9:12           ` David Hildenbrand (Red Hat)
2025-11-07 10:09             ` Lance Yang
2025-11-07 10:10             ` Lorenzo Stoakes
2025-11-07 12:46               ` Garg, Shivank
2025-11-07 10:09         ` Lorenzo Stoakes
2025-11-07 12:50           ` Lorenzo Stoakes
2025-11-06 20:32 ` Yang Shi [this message]
2025-11-07  9:44   ` Garg, Shivank

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHbLzkqvXsFfziYU6A_LXfF2UQHkmNHqyT05P+dTav3mi4b0hA@mail.gmail.com \
    --to=shy828301@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=jannh@google.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=npache@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=shivankg@amd.com \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox