linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org,
	syzbot+5b19bad23ac7f44bf8b8@syzkaller.appspotmail.com,
	Suren Baghdasaryan <surenb@google.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH v2] mm: fix vma_start_write_killable() signal handling
Date: Wed, 26 Nov 2025 20:33:47 +0000	[thread overview]
Message-ID: <d8c2a266-fb2c-4374-ab14-4ac910db7ad2@lucifer.local> (raw)
In-Reply-To: <aSdYkYYAuATwYqbD@casper.infradead.org>

On Wed, Nov 26, 2025 at 07:44:17PM +0000, Matthew Wilcox wrote:
> On Wed, Nov 26, 2025 at 06:55:52PM +0000, Lorenzo Stoakes wrote:
> > > It's only "impossible" currently due to some fairly esoteric reasoning.
> > > As far as _this_ function is concerned, it's entirely possible.
> > > I don't want to leave this trap for the next person who calls
> > > __vma_enter_locked(TASK_KILLABLE).
> >
> > Calls __vma_enter_locked(TASK_KILLABLE) _when detaching_, otherwise
> > refcount will always be >0.
> >
> > So we're only looking at us changing vma_mark_detached() to use
> > TASK_KILLABLE.
> >
> > As this is such a subtle corner case I still think it warrants a
> > warning. Or at least a VM_WARN_ON_ONCE(1).
> >
> > A killable detacher is, as Vlasta points out, kind of an unwise thing to do
> > anyway right?
>
> I missed where that was said?

"Yeah I guess it's for the best to keep vma_mark_detached() use the
TASK_UNINTERRUPTIBLE variant, maybe document why. Aborting the detaching
would be counter productive."

https://lore.kernel.org/all/058f5858-f508-40f8-adfe-e5de78621d64@suse.cz/

>
> > >
> > > > > +			/*
> > > > > +			 * We got a fatal signal, but the last reader went
> > > > > +			 * away as well.  Resolve the race in favour of
> > > >
> > > > This is very subtle, I don't think this really explains this clearly
> > > > enough.
> > > >
> > > > Maybe put something like:
> > > >
> > > > 	/* Couldn't wait on readers probably due to a fatal signal, so unlock. */
> > > >
> > > > Before the refcount_sub_and_test()
> > >
> > > I think this falls into the "saying what you're doing, not why
> > > you're doing it" trap.  Whereas my comment is at a higher level --
> > > there's a race where both exit conditions are true at the same time.
> > > The rcuwait_wait_event() picked one option, but we would rather resolve
> > > the race in the opposite direction.
> >
> > I find your comment unclear, and I think it's too succinct. I was trying to
> > provide the most succinct-yet-still-clear example, but if you prefer higher
> > level you're going to need more detail here.
> >
> > It assumes you 'just know' that:
> >
> > - refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt) means unlock
>
> Actually, I don't know that.  All I know is local to this function --
> that's the value we added earlier before waiting; now we need to
> subtract it since we're no longer waiting.

But why would you add/subtract VMA_LOCK_OFFSET? To, as the name suggests,
lock/unlock the VMA.

If we want to go with 'why' instead of 'what' that's useful information.

I guess you can cutely surmise that 'yes we're undoing what we did'. I don't
think it's going to hurt to explain what that is.

>
> > - err can only be set due to a fatal signal in a non-uninterruptible task
> >   mode
>
> The comment says that in the first five words!

You didn't say that err can only be non-zero because of a fatal signal in
rcuwait_wait_event()?

You said 'we got a fatal signal'. I had to go dig into that code to see where
that came from...

It's super succinct. That's cute, sure. But it's not clear.

>
> > - spurious readers can cause an incremented reference count
>
> I don't know what a "spurious reader" is.  There was a reader when we
> started waiting.  Now there isn't one.

Hmm actually there are two routes here... one with real + spurious reader
refcount increments + one with only spurious. You see this is why I think
clarity is needed, there's _so much going on_.

Anyway, there are two routes to __vma_enter_locked():

__vma_start_write():
	[ mmap write lock held ]
	-> __vma_enter_locked()

In this case, you can be waiting on actual readers.

vma_mark_detached():
	[ mmap write lock, vma write lock held ]
	-> __vma_enter_locked()

In this case, as per the comment:

	/*
	 * We are the only writer, so no need to use vma_refcount_put().
	 * The condition below is unlikely because the vma has been already
	 * write-locked and readers can increment vm_refcnt only temporarily
	 * before they check vm_lock_seq, realize the vma is locked and drop
	 * back the vm_refcnt. That is a narrow window for observing a raised
	 * vm_refcnt.
	 */

So there's a narrow window in which readers 'spuriously' or you could say
temporarily increment vma->vm_refcnt before realising the seqcount indicates a
write lock and decrementing again.

The _only way_ we encounter the issue you are writing defensive code against
here is:

- There was 1 spurious or non-spurious VMA reader.

- A fatal signal arose (assuming nobody ever goes and changes
  rcuwait_wait_event() to add more errors - very likely, not entirely certain
  though, so perhaps 'an error that meant we couldn't wait'.)

- Despite the waiter aborting, the readers finished.

- The VMA is detached (i.e. vma->vm_refcnt is 0 or refcount_sub_and_test()
  wouldn't return true)

- Setting err = 0 indicates that we are now resolving this by treating the VMA
  as detached.

And sure you're mentioning a couple words in reference to fatal signal, a
mention of last reader went away, and detach - but does any of that help clarify
what any of this actually does?

In practice I read this comment and absolutely understood nothing. I don't think
it even provides a hint.

Without reverse engineering the whole thing I wouldn't know what this meant, it
just assumes too much.

>
> > - that a race can exist between a spuriously raised reference count and the
> >   previous reference count check between read above and refcount subtract here
> >
> > - a reference count of 0 means detached
> >
> > - err = 0 means we are treating this VMA as detached resolving this race
> >   'in favour of' the VMA being detached.
> >
> > Let's get some of this information in here please.
>
> I don't think that here is the place to document these things!  And
> certainly not in a patch that we're trying to get applied five days
> before the merge window opens.  There's plenty of time to get the

I mean I could argue being stubborn about reasonable requests rather
contradicts the rush to get this in, so that goes both ways...

...But I propose a compromise below to speed this up.

> comments and the variable names sorted out; can we focus on the right
> way to fix this bug?

Since you're concerned about the urgency, let me suggest a compromise:

	/*
	 * We tried waiting on readers, but failed, likely due to a fatal
         * signal arising. Unlock the VMA and check whether the VMA is
	 * detached.
	 */
	if (refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt)) {
		/*
		 * If the VMA is now detached which means we lost a race.
		 * Let the caller know the VMA is detached.
		 */
		err = 0;
	}

That gives a _lot_ more information, keeps it relatively top-level, doesn't
make undue assumptions etc.

It also puts the broader comment about what you're doing in the right
place, and makes the 'weird thing that should never happen' comment more
specific.

>
> > Again I think we'd be better off with at least a VM_WARN_ON_ONCE() given
> > this is a rather obscure corner case.
>
> Are you satisfied with the WARN_ON(!detaching)?
>

It'd be super weird to reach that code when not detaching so sure, think it
should be VM_WARN_ON() though since the code would be horribly broken if
that was not the case already no?

For the sake of not dragging this out longer this I guess we can do without
the broader WARN_ON() or leave it until later.

But that comment needs updating.

Thanks, Lorenzo


  reply	other threads:[~2025-11-26 20:34 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-26 17:44 Matthew Wilcox (Oracle)
2025-11-26 18:06 ` Lorenzo Stoakes
2025-11-26 18:28   ` Matthew Wilcox
2025-11-26 18:43     ` Suren Baghdasaryan
2025-11-26 18:53       ` Vlastimil Babka
2025-11-26 19:34         ` Matthew Wilcox
2025-11-26 19:00       ` Lorenzo Stoakes
2025-11-26 18:55     ` Lorenzo Stoakes
2025-11-26 19:44       ` Matthew Wilcox
2025-11-26 20:33         ` Lorenzo Stoakes [this message]
2025-11-26 20:35           ` Lorenzo Stoakes
2025-11-26 22:09           ` Matthew Wilcox
2025-11-27  6:26             ` Lorenzo Stoakes
2025-11-27  9:05             ` Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d8c2a266-fb2c-4374-ab14-4ac910db7ad2@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=surenb@google.com \
    --cc=syzbot+5b19bad23ac7f44bf8b8@syzkaller.appspotmail.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox