Re: [PATCH] mm: swap: check for xa_zero_entry() on vma in swapoff path

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Liam R. Howlett" <Liam.Howlett@oracle.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: David Hildenbrand <david@redhat.com>,
	Charan Teja Kalla <quic_charante@quicinc.com>,
	akpm@linux-foundation.org, shikemeng@huaweicloud.com,
	kasong@tencent.com, nphamcs@gmail.com, bhe@redhat.com,
	baohua@kernel.org, chrisl@kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [PATCH] mm: swap: check for xa_zero_entry() on vma in swapoff path
Date: Mon, 11 Aug 2025 11:17:51 -0400	[thread overview]
Message-ID: <efe3aogdw5wxsn46xyy2rrqui7oghyi7elam7aiv3c6o6hsfbx@ee6dayztcy2x> (raw)
In-Reply-To: <fd35dd5d-95cc-4c37-bf72-52a27fe822ac@lucifer.local>

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250811 09:22]:
> On Mon, Aug 11, 2025 at 03:19:53PM +0200, David Hildenbrand wrote:
> > > > >
> > > > > > When registering vmas for uprobe, skip the vmas in an mm that is marked
> > > > > > unstable.  Modifying a vma in an unstable mm may cause issues if the mm
> > > > > > isn't fully initialised.__
> > > > > >
> > > > > > > Is there anything preventing us from just leaving a proper tree that
> > > > > > > reflects reality in place before we drop the write lock?
> > > > > >
> > > > > > When you mean proper tree, is this about the your previous question? --
> > > > > > Shouldn't we just remove anything from the tree here that was not copied
> > > > > > immediately?
> > > > >
> > > > > Commit d24062914837 (" fork: use __mt_dup() to duplicate maple tree in
> > > > > dup_mmap()") did this for efficiency, so it'd be a regression to do this.
> > > >
> > > > We're talking about the case where fork *fails*. That cannot possibly be
> > > > relevant for performance, can it? :)
> > >
> > > I think it optimises the overall operation, but as a product of that, has to
> > > handle this edge case, and that necessitated this rather horrble stuff.
> > >
> > > Obviously we don't need to optimise a 'we are about to die' case :)
> >
> > Right, so my original question was whether we could just drop all stale
> > stuff from the tree before we lift the mmap write lock, leaving only the
> > VMAs in there that we actually processed successfully.
> 
> That'd be better answered by Liam who's more familiar with it.

The short answer is no, we cannot.

But some options and questions below..

> 
> I think it may actually be difficult to do on some level or there was some
> reason we couldn't, but I may be mistaken.

Down the rabbit hole we go..

The cloning of the tree happens by copying the tree in DFS and
replacing the old nodes with new nodes.  The tree leaves end up being
copied, which contains all the vmas (unless DONT_COPY is set, so
basically always all of them..).  When the tree is copied, we have a
duplicate of the tree with pointers to all the vmas in the old process.

The way the tree fails is that we've been unable to finish cloning it,
usually for out of memory reasons.  So, this means we have a tree with
new and exciting vmas that have never been used and old but still active
vmas in oldmm.

The failure point is then marked with an XA_ZERO_ENTRY, which will
succeed in storing as it's a direct replacement in the tree so no
allocations necessary.  Thus this is safe even in -ENOMEM scenarios.

Clearing out the stale data means we may actually need to allocate to
remove vmas from the new tree, because we use allocated memory in the
maple tree - we'll need to rebalance, new parents, etc, etc.

So, to remove the stale data - we may actually have to allocate memory.

But we're most likely out of memory.. and we don't want to get the
shrinker involved in a broken task teardown, especially since it has
already been run and failed to help..

We could replace all the old vmas with XA_ZERO_ENTRY, but that doesn't
really fix this issue either.

I could make a function that frees all new vmas and destroys the tree
specifically for this failure state?

I'm almost certain this will lead to another whack-a-mole situation, but
those _should_ already be checked or written to work when there are zero
vmas in an mm (going by experience of what the scheduler does with an
empty tree).  Syzbot finds these scenarios sometimes via signals or
other corner cases that can happen..

Then again, I also thought the unstable mm should be checked where
necessary to avoid assumptions on the mm state..?

This is funny because we already have a (probably) benign race with oom
here.  This code may already visit the mm after __oom_reap_task_mm() and
the mm disappearing, but since the anon vmas should be removed,
unuse_mm() will skip them.

Although, I'm not sure what happens when
mmu_notifier_invalidate_range_start_nonblock() fails AND unuse_mm() is
called on the mm after.  Maybe checking the unstable mm is necessary
here anyways?

Thanks,
Liam

next prev parent reply	other threads:[~2025-08-11 15:18 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-08  9:21 Charan Teja Kalla
2025-08-08 12:01 ` David Hildenbrand
2025-08-08 12:04   ` David Hildenbrand
2025-08-11  9:43     ` Charan Teja Kalla
2025-08-11 12:14       ` Lorenzo Stoakes
2025-08-11 13:03         ` David Hildenbrand
2025-08-11 13:08           ` Lorenzo Stoakes
2025-08-11 13:19             ` David Hildenbrand
2025-08-11 13:22               ` Lorenzo Stoakes
2025-08-11 15:17                 ` Liam R. Howlett [this message]
2025-08-11 15:39                   ` David Hildenbrand
2025-08-11 15:48                     ` Lorenzo Stoakes
2025-08-11 15:51                       ` David Hildenbrand
2025-08-11 15:48                     ` Liam R. Howlett
2025-08-11 12:07 ` Lorenzo Stoakes
2025-08-11 16:29   ` Charan Teja Kalla

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=efe3aogdw5wxsn46xyy2rrqui7oghyi7elam7aiv3c6o6hsfbx@ee6dayztcy2x \
    --to=liam.howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=nphamcs@gmail.com \
    --cc=quic_charante@quicinc.com \
    --cc=shikemeng@huaweicloud.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox