Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: "David Hildenbrand (Red Hat)" <davidhildenbrandkernel@gmail.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	"Garg, Shivank" <shivankg@amd.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Zi Yan <ziy@nvidia.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Nico Pache <npache@redhat.com>, Dev Jain <dev.jain@arm.com>,
	Barry Song <baohua@kernel.org>, Lance Yang <lance.yang@linux.dev>,
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	zokeefe@google.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
Date: Fri, 7 Nov 2025 10:09:41 +0000	[thread overview]
Message-ID: <868d1103-df00-43dd-b04b-21568309445e@lucifer.local> (raw)
In-Reply-To: <77a54f63-f5da-42a2-b24d-5c8a0f41d1e6@gmail.com>

On Thu, Nov 06, 2025 at 10:05:41PM +0100, David Hildenbrand (Red Hat) wrote:
> On 06.11.25 18:17, Lorenzo Stoakes wrote:
> > On Thu, Nov 06, 2025 at 11:55:05AM -0500, Liam R. Howlett wrote:
> > > * Ryan Roberts <ryan.roberts@arm.com> [251106 11:33]:
> > > > On 06/11/2025 12:16, Garg, Shivank wrote:
> > > > > Hi All,
> > > > >
> > > > > I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
> > > > > when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
> > > > > current behavior and improvements.
> > > > >
> > > > > Problem:
> > > > > When attempting to collapse read-only file-backed TEXT sections into THPs
> > > > > using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
> > > > > are marked dirty.
> > > > > madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22
> > > > >
> > > > > Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise
> > > > > attempt triggers filemap_flush() which initiates async writeback of the dirty folios.
> > > > >
> > > > > Root Cause:
> > > > > The failure occurs in mm/khugepaged.c:collapse_file():
> > > > > } else if (folio_test_dirty(folio)) {
> > > > >      /*
> > > > >       * khugepaged only works on read-only fd,
> > > > >       * so this page is dirty because it hasn't
> > > > >       * been flushed since first write. There
> > > > >       * won't be new dirty pages.
> > > > >       *
> > > > >       * Trigger async flush here and hope the
> > > > >       * writeback is done when khugepaged
> > > > >       * revisits this page.
> > > > >       */
> > > > >      xas_unlock_irq(&xas);
> > > > >      filemap_flush(mapping);
> > > > >      result = SCAN_FAIL;
> > > > >      goto xa_unlocked;
> > > > > }
> > > > >
> > > > > Why the text pages are dirty?
> > > >
> > > > This is the real question to to answer, I think...
> > >
> > > Agree with Ryan here, let's stop things from being marked dirty if they
> > > are not.
> >
> > Hmm I wonder if we have some broken assumptions in khugepaged for MAP_PRIVATE
> > mappings.
> >
> > collapse_single_pmd()
> > -> collapse_scan_file() if not vma_is_anonymous() (it won't be)
> > -> collapse_file()
> > -> the snippet above.

Sorry I was looking at Nico's series these functions aren't correct as of
mm-new atm.

This should be:

madvise_collapse()
-> hpage_collapse_scan_file()
-> collapse_file()


> >
> > But that could be running on an anon folio...
> >
> > Yup given it's CONFIG_READY_ONLY_THP_FOR_FS that is strange. We are confounding
> > expectations here surely?
> >
> > Presumably it's because these are MAP_PRIVATE mappings, so this is an anon folio
> > but then collapse_file() goes into the snippet above and gets very confused.
> >
> > Do we need to add a folio_test_anon() here?
> >
> > Unless I'm missing something... (very possible, am only glancing over the code
> > here)
>
> collapse_file() operates exclusively on the pagecache.

Right you're correct:

	XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);

	...

		folio = xas_load(&xas);

etc. etc.

And with the code you mention below, markers + MAP_PRIVATE are handled
correctly.

This THP code :) such fun.

So yeah this is as simple as the folio is literally just dirty.

And:

			} else if (folio_test_dirty(folio)) {
				/*
				 * khugepaged only works on read-only fd,
				 * so this page is dirty because it hasn't
				 * been flushed since first write. There
				 * won't be new dirty pages.
				 *
				 * Trigger async flush here and hope the
				 * writeback is done when khugepaged
				 * revisits this page.
				 *
				 * This is a one-off situation. We are not
				 * forcing writeback in loop.
				 */
				xas_unlock_irq(&xas);
				filemap_flush(mapping);
Since we do an async flush here ----^

This is why a retry (assuming writeback completed) works.

				result = SCAN_FAIL;
				goto xa_unlocked;
			} else if (folio_test_writeback(folio)) {

>
> I think we only start working on the actual page tables when calling
> retract_page_tables().

Yup.

>
> In there, we have this code, when iterating over page tables belonging
> to the mapping:
>
> 		/*
> 		 * The lock of new_folio is still held, we will be blocked in
> 		 * the page fault path, which prevents the pte entries from
> 		 * being set again. So even though the old empty PTE page may be
> 		 * concurrently freed and a new PTE page is filled into the pmd
> 		 * entry, it is still empty and can be removed.
> 		 *
> 		 * So here we only need to recheck if the state of pmd entry
> 		 * still meets our requirements, rather than checking pmd_same()
> 		 * like elsewhere.
> 		 */
> 		if (check_pmd_state(pmd) != SCAN_SUCCEED)
> 			goto drop_pml;
> 		ptl = pte_lockptr(mm, pmd);
> 		if (ptl != pml)
> 			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>
> 		/*
> 		 * Huge page lock is still held, so normally the page table
> 		 * must remain empty; and we have already skipped anon_vma
> 		 * and userfaultfd_wp() vmas.  But since the mmap_lock is not
> 		 * held, it is still possible for a racing userfaultfd_ioctl()
> 		 * to have inserted ptes or markers.  Now that we hold ptlock,
> 		 * repeating the anon_vma check protects from one category,
> 		 * and repeating the userfaultfd_wp() check from another.
> 		 */
> 		if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
> 			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
> 			pmdp_get_lockless_sync();
> 			success = true;
> 		}
>
> Given !vma->anon_vma, we cannot have anon folios in there.
>
> Given !userfaultfd_wp(vma), we cannot have uffd-wp markers in there.

Right.

>
> Given that all folios in the range we are collapsing where unmapped, we cannot have
> them mapped there.
>
> So the conclusion is that the page table must be empty and can be removed.
>
>
> Could guard markers be in there?

Right now guard markers only exist if vma->anon_vma is set, including the
file-backed case.

But for file-backed guard regions after my VMA sticky series this won't be the
case any more :)

So I had better go change that...

I hate that we have open-coded stuff all over the place that makes assumptions
like this.

This also ignores any other marker types. How I hate the uffd wp implementation.

next prev parent reply	other threads:[~2025-11-07 10:10 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-06 12:16 Garg, Shivank
2025-11-06 12:55 ` Lance Yang
2025-11-06 13:03   ` Nico Pache
2025-11-06 16:32 ` Ryan Roberts
2025-11-06 16:55   ` Liam R. Howlett
2025-11-06 17:17     ` Lorenzo Stoakes
2025-11-06 21:05       ` David Hildenbrand (Red Hat)
2025-11-07  8:51         ` Garg, Shivank
2025-11-07  9:12           ` David Hildenbrand (Red Hat)
2025-11-07 10:09             ` Lance Yang
2025-11-07 10:10             ` Lorenzo Stoakes
2025-11-07 12:46               ` Garg, Shivank
2025-11-07 10:09         ` Lorenzo Stoakes [this message]
2025-11-07 12:50           ` Lorenzo Stoakes
2025-11-06 20:32 ` Yang Shi
2025-11-07  9:44   ` Garg, Shivank

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=868d1103-df00-43dd-b04b-21568309445e@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=davidhildenbrandkernel@gmail.com \
    --cc=dev.jain@arm.com \
    --cc=jannh@google.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npache@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=shivankg@amd.com \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox