From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org,
syzbot+5b19bad23ac7f44bf8b8@syzkaller.appspotmail.com,
Suren Baghdasaryan <surenb@google.com>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH v2] mm: fix vma_start_write_killable() signal handling
Date: Wed, 26 Nov 2025 20:33:47 +0000 [thread overview]
Message-ID: <d8c2a266-fb2c-4374-ab14-4ac910db7ad2@lucifer.local> (raw)
In-Reply-To: <aSdYkYYAuATwYqbD@casper.infradead.org>
On Wed, Nov 26, 2025 at 07:44:17PM +0000, Matthew Wilcox wrote:
> On Wed, Nov 26, 2025 at 06:55:52PM +0000, Lorenzo Stoakes wrote:
> > > It's only "impossible" currently due to some fairly esoteric reasoning.
> > > As far as _this_ function is concerned, it's entirely possible.
> > > I don't want to leave this trap for the next person who calls
> > > __vma_enter_locked(TASK_KILLABLE).
> >
> > Calls __vma_enter_locked(TASK_KILLABLE) _when detaching_, otherwise
> > refcount will always be >0.
> >
> > So we're only looking at us changing vma_mark_detached() to use
> > TASK_KILLABLE.
> >
> > As this is such a subtle corner case I still think it warrants a
> > warning. Or at least a VM_WARN_ON_ONCE(1).
> >
> > A killable detacher is, as Vlasta points out, kind of an unwise thing to do
> > anyway right?
>
> I missed where that was said?
"Yeah I guess it's for the best to keep vma_mark_detached() use the
TASK_UNINTERRUPTIBLE variant, maybe document why. Aborting the detaching
would be counter productive."
https://lore.kernel.org/all/058f5858-f508-40f8-adfe-e5de78621d64@suse.cz/
>
> > >
> > > > > + /*
> > > > > + * We got a fatal signal, but the last reader went
> > > > > + * away as well. Resolve the race in favour of
> > > >
> > > > This is very subtle, I don't think this really explains this clearly
> > > > enough.
> > > >
> > > > Maybe put something like:
> > > >
> > > > /* Couldn't wait on readers probably due to a fatal signal, so unlock. */
> > > >
> > > > Before the refcount_sub_and_test()
> > >
> > > I think this falls into the "saying what you're doing, not why
> > > you're doing it" trap. Whereas my comment is at a higher level --
> > > there's a race where both exit conditions are true at the same time.
> > > The rcuwait_wait_event() picked one option, but we would rather resolve
> > > the race in the opposite direction.
> >
> > I find your comment unclear, and I think it's too succinct. I was trying to
> > provide the most succinct-yet-still-clear example, but if you prefer higher
> > level you're going to need more detail here.
> >
> > It assumes you 'just know' that:
> >
> > - refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt) means unlock
>
> Actually, I don't know that. All I know is local to this function --
> that's the value we added earlier before waiting; now we need to
> subtract it since we're no longer waiting.
But why would you add/subtract VMA_LOCK_OFFSET? To, as the name suggests,
lock/unlock the VMA.
If we want to go with 'why' instead of 'what' that's useful information.
I guess you can cutely surmise that 'yes we're undoing what we did'. I don't
think it's going to hurt to explain what that is.
>
> > - err can only be set due to a fatal signal in a non-uninterruptible task
> > mode
>
> The comment says that in the first five words!
You didn't say that err can only be non-zero because of a fatal signal in
rcuwait_wait_event()?
You said 'we got a fatal signal'. I had to go dig into that code to see where
that came from...
It's super succinct. That's cute, sure. But it's not clear.
>
> > - spurious readers can cause an incremented reference count
>
> I don't know what a "spurious reader" is. There was a reader when we
> started waiting. Now there isn't one.
Hmm actually there are two routes here... one with real + spurious reader
refcount increments + one with only spurious. You see this is why I think
clarity is needed, there's _so much going on_.
Anyway, there are two routes to __vma_enter_locked():
__vma_start_write():
[ mmap write lock held ]
-> __vma_enter_locked()
In this case, you can be waiting on actual readers.
vma_mark_detached():
[ mmap write lock, vma write lock held ]
-> __vma_enter_locked()
In this case, as per the comment:
/*
* We are the only writer, so no need to use vma_refcount_put().
* The condition below is unlikely because the vma has been already
* write-locked and readers can increment vm_refcnt only temporarily
* before they check vm_lock_seq, realize the vma is locked and drop
* back the vm_refcnt. That is a narrow window for observing a raised
* vm_refcnt.
*/
So there's a narrow window in which readers 'spuriously' or you could say
temporarily increment vma->vm_refcnt before realising the seqcount indicates a
write lock and decrementing again.
The _only way_ we encounter the issue you are writing defensive code against
here is:
- There was 1 spurious or non-spurious VMA reader.
- A fatal signal arose (assuming nobody ever goes and changes
rcuwait_wait_event() to add more errors - very likely, not entirely certain
though, so perhaps 'an error that meant we couldn't wait'.)
- Despite the waiter aborting, the readers finished.
- The VMA is detached (i.e. vma->vm_refcnt is 0 or refcount_sub_and_test()
wouldn't return true)
- Setting err = 0 indicates that we are now resolving this by treating the VMA
as detached.
And sure you're mentioning a couple words in reference to fatal signal, a
mention of last reader went away, and detach - but does any of that help clarify
what any of this actually does?
In practice I read this comment and absolutely understood nothing. I don't think
it even provides a hint.
Without reverse engineering the whole thing I wouldn't know what this meant, it
just assumes too much.
>
> > - that a race can exist between a spuriously raised reference count and the
> > previous reference count check between read above and refcount subtract here
> >
> > - a reference count of 0 means detached
> >
> > - err = 0 means we are treating this VMA as detached resolving this race
> > 'in favour of' the VMA being detached.
> >
> > Let's get some of this information in here please.
>
> I don't think that here is the place to document these things! And
> certainly not in a patch that we're trying to get applied five days
> before the merge window opens. There's plenty of time to get the
I mean I could argue being stubborn about reasonable requests rather
contradicts the rush to get this in, so that goes both ways...
...But I propose a compromise below to speed this up.
> comments and the variable names sorted out; can we focus on the right
> way to fix this bug?
Since you're concerned about the urgency, let me suggest a compromise:
/*
* We tried waiting on readers, but failed, likely due to a fatal
* signal arising. Unlock the VMA and check whether the VMA is
* detached.
*/
if (refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt)) {
/*
* If the VMA is now detached which means we lost a race.
* Let the caller know the VMA is detached.
*/
err = 0;
}
That gives a _lot_ more information, keeps it relatively top-level, doesn't
make undue assumptions etc.
It also puts the broader comment about what you're doing in the right
place, and makes the 'weird thing that should never happen' comment more
specific.
>
> > Again I think we'd be better off with at least a VM_WARN_ON_ONCE() given
> > this is a rather obscure corner case.
>
> Are you satisfied with the WARN_ON(!detaching)?
>
It'd be super weird to reach that code when not detaching so sure, think it
should be VM_WARN_ON() though since the code would be horribly broken if
that was not the case already no?
For the sake of not dragging this out longer this I guess we can do without
the broader WARN_ON() or leave it until later.
But that comment needs updating.
Thanks, Lorenzo
next prev parent reply other threads:[~2025-11-26 20:34 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-26 17:44 Matthew Wilcox (Oracle)
2025-11-26 18:06 ` Lorenzo Stoakes
2025-11-26 18:28 ` Matthew Wilcox
2025-11-26 18:43 ` Suren Baghdasaryan
2025-11-26 18:53 ` Vlastimil Babka
2025-11-26 19:34 ` Matthew Wilcox
2025-11-26 19:00 ` Lorenzo Stoakes
2025-11-26 18:55 ` Lorenzo Stoakes
2025-11-26 19:44 ` Matthew Wilcox
2025-11-26 20:33 ` Lorenzo Stoakes [this message]
2025-11-26 20:35 ` Lorenzo Stoakes
2025-11-26 22:09 ` Matthew Wilcox
2025-11-27 6:26 ` Lorenzo Stoakes
2025-11-27 9:05 ` Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d8c2a266-fb2c-4374-ab14-4ac910db7ad2@lucifer.local \
--to=lorenzo.stoakes@oracle.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=linux-mm@kvack.org \
--cc=surenb@google.com \
--cc=syzbot+5b19bad23ac7f44bf8b8@syzkaller.appspotmail.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox