From: Pedro Falcato <pfalcato@suse.de>
To: Gregory Price <gourry@gourry.net>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org, hughd@google.com,
david@kernel.org, ljs@kernel.org, Liam.Howlett@oracle.com,
vbabka@kernel.org, rppt@kernel.org, surenb@google.com,
mhocko@suse.com, baolin.wang@linux.alibaba.com,
linux-kernel@vger.kernel.org, kernel-team@meta.com,
stable@vger.kernel.org
Subject: Re: [PATCH] mm/shmem: use invalidate_lock to fix hole-punch race
Date: Thu, 26 Mar 2026 19:16:05 +0000 [thread overview]
Message-ID: <bnukmnuxxuhdfeasjz33miemgr7w35c4aa6pqdmgupx7oxmeeb@gozgc3yxhcdd> (raw)
In-Reply-To: <acV83cdc9ZfNk8Xh@gourry-fedora-PF4VCD3F>
On Thu, Mar 26, 2026 at 01:37:17PM -0500, Gregory Price wrote:
> On Thu, Mar 26, 2026 at 05:07:42PM +0000, Pedro Falcato wrote:
> > > Two races allow PTEs to be re-installed for a folio that fallocate
> > > is about to remove from page cache:
> >
> > Hmm, I don't see how your patch fixes anything.
> >
>
> after looking at your comments below i realized race 2 actually requires
> the fork as well, which means they're both essentially variations of the
> same race, so hopefully i can simplify the change log.
Well, then I don't see how changing shmem_fault() & map_mages() fixes fork.
>
> > > fallocate fault-around fork
> > > -------- ------------ ----
> > > set i_private
> > > unmap_mapping_range()
> > > # zaps PTEs
> > > filemap_map_pages()
> > > # re-maps folio!
> > > dup_mmap()
> > > # child VMA
> > > # in tree
> > > shmem_undo_range()
> > > lock folio
> > > unmap_mapping_folio()
> ^^^ i_mmap_lock_read held, iterates VMAs
> > spin_lock(ptl);
> ^^^ child VMA's PTL
> > > # child VMA:
> > > # no PTE, skip
> > spin_unlock(ptl);
> ^^^ child VMA done, iterator moves on
> it will not re-visit the child.
>
> > > copy_page_range()
> > spin_lock(dst_ptl);
> ^ Child PTL
> > spin_lock(src_ptl);
> ^ Parent PTL
> > /* does not copy PTE. either
> > * we find a zapped PTE, or unmap_mapping_folio()
> > * finds two mappings instead of one. */
>
> At this point, unmap_mapping_folio only processed the child VMA
> (no PTE, skip). The parent PTE *has not* been zapped.
>
> copy_page_range() acquires src_ptl (parent) and reads a present PTE,
> and boom copies it to child.
Sure, but can child - parent happen when traversing the i_mmap tree? I don't
think so? (in mm/mmap.c)
/* insert tmp into the share list, just after mpnt */
vma_interval_tree_insert_after(tmp, mpnt,
&mapping->i_mmap);
The function itself is somewhat straightforward - find the leftmost node at the
right of 'prev' (our parent) and link ourselves. So an in-order traversal should
always go parent - child. Unless there's some awful tree rotation that can
happen and screw us in the meanwhile.
>
> When it reaches the parent VMA next, it zaps the parent PTE,
> but the child PTE (just installed) survives.
>
> > >
> > > Fix both races with invalidate_lock.
> > >
> >
> > I don't see what you're seeing? Note that both map_pages and fault()
> > take the folio lock (map_pages does a trylock) to exclude against truncate
> > as well.
> >
>
> The folio lock serializes map_pages/fault against truncate - but the
> race isn't between those two. It's between truncate's unmap walk and
> fork's copy_page_range - and copy_page_range doesn't take folio lock.
If we observe everything parent - child, there is no way this is broken - if
fork observes the parent pte set, zap will have to observe parent *and* child,
since they hold the corresponding pte locks, and traversal is done in order.
If fork observes the parent pte as none, zap will have already traversed the
parent, and as such there will be no additional mapping of the folio.
If this is broken, then every filesystem out there using filemap_fault() and
filemap_fault_around() has to be broken, and I hope that's not true :p
_If_ there is indeed breakage here regarding tree rotations, I would suggest:
diff --git a/mm/mmap.c b/mm/mmap.c
index 5754d1c36462..7b4e39063d67 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1833,12 +1833,12 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
vma_interval_tree_insert_after(tmp, mpnt,
&mapping->i_mmap);
flush_dcache_mmap_unlock(mapping);
- i_mmap_unlock_write(mapping);
}
if (!(tmp->vm_flags & VM_WIPEONFORK))
retval = copy_page_range(tmp, mpnt);
-
+ if (file)
+ i_mmap_unlock_write(mapping);
if (retval) {
mpnt = vma_next(&vmi);
goto loop_out;
which should protect against concurrent rmap.
--
Pedro
next prev parent reply other threads:[~2026-03-26 19:16 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-26 16:26 Gregory Price
2026-03-26 17:07 ` Pedro Falcato
2026-03-26 18:37 ` Gregory Price
2026-03-26 19:16 ` Pedro Falcato [this message]
2026-03-26 19:48 ` Gregory Price
2026-03-27 4:35 ` Gregory Price
2026-03-26 19:21 ` Matthew Wilcox
2026-03-26 20:09 ` Gregory Price
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bnukmnuxxuhdfeasjz33miemgr7w35c4aa6pqdmgupx7oxmeeb@gozgc3yxhcdd \
--to=pfalcato@suse.de \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=david@kernel.org \
--cc=gourry@gourry.net \
--cc=hughd@google.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=stable@vger.kernel.org \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox