* [PATCH v2 0/2] add and use vma_assert_stabilised() helper @ 2026-01-19 20:59 Lorenzo Stoakes 2026-01-19 20:59 ` [PATCH v2 1/2] mm/vma: use lockdep where we can, reduce duplication Lorenzo Stoakes 2026-01-19 20:59 ` [PATCH v2 2/2] mm: add vma_assert_stabilised() Lorenzo Stoakes 0 siblings, 2 replies; 7+ messages in thread From: Lorenzo Stoakes @ 2026-01-19 20:59 UTC (permalink / raw) To: Andrew Morton Cc: David Hildenbrand, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Jann Horn, linux-mm, linux-kernel, linux-rt-devel, Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt Sometimes we wish to assert that a VMA is stable, that is - the VMA cannot be changed underneath us. This will be the case if EITHER the VMA lock or the mmap lock is held. We already open-code this in two places - anon_vma_name() in mm/madvise.c and vma_flag_set_atomic() in include/linux/mm.h. This series adds vma_assert_stablised() which abstract this can be used in these callsites instead. This implementation uses lockdep where possible - that is VMA read locks - which correctly track read lock acquisition/release via: vma_start_read() -> rwsem_acquire_read() vma_start_read_locked() -> vma_start_read_locked_nested() -> rwsem_acquire_read() And: vma_end_read() -> vma_refcount_put() -> rwsem_release() We don't track the VMA locks using lockdep for VMA write locks, however these are predicated upon mmap write locks whose lockdep state we do track, and additionally vma_assert_stabillised() asserts this check if VMA read lock is not held, so we get lockdep coverage in this case also. We also add vma_is_read_locked() and update vma_assert_locked() to perform lockdep tracking for VMA read locks. We also add extensive comments to describe what we're doing. There's some tricky stuff around mmap locking and stabilisation races that we have to be careful of that I describe in the patch introducing vma_assert_stabilised(). This change also lays the foundation for future series to add this assert in further places where we wish to make it clear that we rely upon a stabilised VMA. The motivation for this change was precisely this. v2: * Added lockdep as much as possible to the mix as per Peter and Sebastian. * Added comments to make clear what we're doing in each case. * I realise I made a mistake in saying the previous duplicative VMA stable asserts were wrong - vma_assert_locked() is not a no-op if !CONFIG_PER_VMA_LOCK, instead it degrades to asserting that the mmap lock is held, so this is correct, though means we'd have checked this twice, only triggering an assert the second time. * Accounted for is_vma_writer_only() case in vma_is_read_locked(). * Accounted for two hideous issues - we cannot check VMA lock first, because we may be holding a VMA write lock and be raced by VMA readers of _other_ VMA's. If we check the mmap lock first and assert, we may hold a VMA read lock and race other threads which hodl the mmap read lock and fail an assert. We resolve this by a precise mmap ownership check if lockdep is used, and allowing the check to be approximate if no lockdep. * Added more comments and updated commit logs. * Dropped Suren's Suggested-by as significant changes in this set (this was for the vma_is_read_locked() as a concept). v1: https://lore.kernel.org/all/cover.1768569863.git.lorenzo.stoakes@oracle.com/ Lorenzo Stoakes (2): mm/vma: use lockdep where we can, reduce duplication mm: add vma_assert_stabilised() include/linux/mm.h | 4 +- include/linux/mmap_lock.h | 88 +++++++++++++++++++++++++++++++++++++-- mm/madvise.c | 4 +- 3 files changed, 87 insertions(+), 9 deletions(-) -- 2.52.0 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 1/2] mm/vma: use lockdep where we can, reduce duplication 2026-01-19 20:59 [PATCH v2 0/2] add and use vma_assert_stabilised() helper Lorenzo Stoakes @ 2026-01-19 20:59 ` Lorenzo Stoakes 2026-01-20 13:53 ` Vlastimil Babka 2026-01-19 20:59 ` [PATCH v2 2/2] mm: add vma_assert_stabilised() Lorenzo Stoakes 1 sibling, 1 reply; 7+ messages in thread From: Lorenzo Stoakes @ 2026-01-19 20:59 UTC (permalink / raw) To: Andrew Morton Cc: David Hildenbrand, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Jann Horn, linux-mm, linux-kernel, linux-rt-devel, Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt We introduce vma_is_read_locked(), which must deal with the case in which VMA write lock sets refcnt to VMA_LOCK_OFFSET or VMA_LOCK_OFFSET + 1. Luckily is_vma_writer_only() already exists which we can use to check this. We then try to make vma_assert_locked() use lockdep as far as we can. Unfortunately the VMA lock implementation does not even try to track VMA write locks using lockdep, so we cannot track the lock this way. This is less egregious than it might seem as VMA write locks are predicated on mmap write locks, which we do lockdep assert. vma_assert_write_locked() already asserts the mmap write lock is taken so we get that checked implicitly. However for read locks we do indeed use lockdup, via rwsem_acquire_read() called in vma_start_read() and rwsem_release_read() called in vma_refcount_put() called in turn by vma_end_read(). Therefore we perform a lockdep assertion if the VMA is known to be read-locked. If it is write-locked, we assert the mmap lock instead, with a lockdep check if lockdep is enabled. If lockdep is not enabled, we just check that locks are in place. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> --- include/linux/mmap_lock.h | 34 ++++++++++++++++++++++++++++++---- 1 file changed, 30 insertions(+), 4 deletions(-) diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index b50416fbba20..6979222882f1 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -236,6 +236,13 @@ int vma_start_write_killable(struct vm_area_struct *vma) return __vma_start_write(vma, mm_lock_seq, TASK_KILLABLE); } +static inline bool vma_is_read_locked(const struct vm_area_struct *vma) +{ + const unsigned int refcnt = refcount_read(&vma->vm_refcnt); + + return refcnt > 1 && !is_vma_writer_only(refcnt); +} + static inline void vma_assert_write_locked(struct vm_area_struct *vma) { unsigned int mm_lock_seq; @@ -243,12 +250,31 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma) VM_BUG_ON_VMA(!__is_vma_write_locked(vma, &mm_lock_seq), vma); } +/** + * vma_assert_locked() - Assert that @vma is either read or write locked and + * that we have ownership of that lock (if lockdep is enabled). + * @vma: The VMA we assert. + * + * If lockdep is enabled, we ensure ownership of the VMA lock. Otherwise we + * assert that we are VMA write-locked, which implicitly asserts that we hold + * the mmap write lock. + */ static inline void vma_assert_locked(struct vm_area_struct *vma) { - unsigned int mm_lock_seq; - - VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt) <= 1 && - !__is_vma_write_locked(vma, &mm_lock_seq), vma); + /* + * VMA locks currently only utilise lockdep for read locks, as + * vma_end_write_all() releases an unknown number of VMA write locks and + * we don't currently walk the maple tree to identify which locks are + * released even under CONFIG_LOCKDEP. + * + * However, VMA write locks are predicated on an mmap write lock, which + * we DO track under lockdep, and which vma_assert_write_locked() + * asserts. + */ + if (vma_is_read_locked(vma)) + lockdep_assert(lock_is_held(&vma->vmlock_dep_map)); + else + vma_assert_write_locked(vma); } static inline bool vma_is_attached(struct vm_area_struct *vma) -- 2.52.0 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 1/2] mm/vma: use lockdep where we can, reduce duplication 2026-01-19 20:59 ` [PATCH v2 1/2] mm/vma: use lockdep where we can, reduce duplication Lorenzo Stoakes @ 2026-01-20 13:53 ` Vlastimil Babka 2026-01-20 17:49 ` Lorenzo Stoakes 0 siblings, 1 reply; 7+ messages in thread From: Vlastimil Babka @ 2026-01-20 13:53 UTC (permalink / raw) To: Lorenzo Stoakes, Andrew Morton Cc: David Hildenbrand, Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Jann Horn, linux-mm, linux-kernel, linux-rt-devel, Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt On 1/19/26 21:59, Lorenzo Stoakes wrote: > We introduce vma_is_read_locked(), which must deal with the case in which > VMA write lock sets refcnt to VMA_LOCK_OFFSET or VMA_LOCK_OFFSET + > 1. Luckily is_vma_writer_only() already exists which we can use to check > this. So I think there's a bit of a caveat in that - is_vma_writer_only() may be a false positive if there is a temporary reader of a detached vma (per comments in vma_mark_detached() and vma_mark_detached()) - hence vma_is_read_locked() may be a false negative - hence vma_assert_locked() might assume wrongly that we should not assert being a reader, so we vma_assert_write_locked() instead, and fail Howevever the above should mean it could be only us who is the temporary reader. And we are not going to use vma_assert_locked() during the temporary reader part (in vma_start_read()). So it's probably fine, but maybe worth some comments to prevent people getting suspicious and reconstructing this? But I think perhaps also vma_assert_locked() could, with lockdep enabled (similarly to vma_assert_stabilised() in patch 2), use the "lock_is_held(&vma->vmlock_dep_map)" condition (without immediately asserting it) for the primary reader vs writer decision, and not rely on vma_is_read_locked()? Because lockdep has the precise information. It would likely make things more ugly, or require more refactoring, but hopefully worthwhile? > We then try to make vma_assert_locked() use lockdep as far as we can. > > Unfortunately the VMA lock implementation does not even try to track VMA > write locks using lockdep, so we cannot track the lock this way. > > This is less egregious than it might seem as VMA write locks are predicated > on mmap write locks, which we do lockdep assert. > > vma_assert_write_locked() already asserts the mmap write lock is taken so > we get that checked implicitly. > However for read locks we do indeed use lockdup, via rwsem_acquire_read() > called in vma_start_read() and rwsem_release_read() called in > vma_refcount_put() called in turn by vma_end_read(). > > Therefore we perform a lockdep assertion if the VMA is known to be > read-locked. > > If it is write-locked, we assert the mmap lock instead, with a lockdep > check if lockdep is enabled. > > If lockdep is not enabled, we just check that locks are in place. > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > --- > include/linux/mmap_lock.h | 34 ++++++++++++++++++++++++++++++---- > 1 file changed, 30 insertions(+), 4 deletions(-) > > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h > index b50416fbba20..6979222882f1 100644 > --- a/include/linux/mmap_lock.h > +++ b/include/linux/mmap_lock.h > @@ -236,6 +236,13 @@ int vma_start_write_killable(struct vm_area_struct *vma) > return __vma_start_write(vma, mm_lock_seq, TASK_KILLABLE); > } > > +static inline bool vma_is_read_locked(const struct vm_area_struct *vma) > +{ > + const unsigned int refcnt = refcount_read(&vma->vm_refcnt); > + > + return refcnt > 1 && !is_vma_writer_only(refcnt); > +} > + > static inline void vma_assert_write_locked(struct vm_area_struct *vma) > { > unsigned int mm_lock_seq; > @@ -243,12 +250,31 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma) > VM_BUG_ON_VMA(!__is_vma_write_locked(vma, &mm_lock_seq), vma); > } > > +/** > + * vma_assert_locked() - Assert that @vma is either read or write locked and > + * that we have ownership of that lock (if lockdep is enabled). > + * @vma: The VMA we assert. > + * > + * If lockdep is enabled, we ensure ownership of the VMA lock. Otherwise we > + * assert that we are VMA write-locked, which implicitly asserts that we hold > + * the mmap write lock. > + */ > static inline void vma_assert_locked(struct vm_area_struct *vma) > { > - unsigned int mm_lock_seq; > - > - VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt) <= 1 && > - !__is_vma_write_locked(vma, &mm_lock_seq), vma); > + /* > + * VMA locks currently only utilise lockdep for read locks, as > + * vma_end_write_all() releases an unknown number of VMA write locks and > + * we don't currently walk the maple tree to identify which locks are > + * released even under CONFIG_LOCKDEP. > + * > + * However, VMA write locks are predicated on an mmap write lock, which > + * we DO track under lockdep, and which vma_assert_write_locked() > + * asserts. > + */ > + if (vma_is_read_locked(vma)) > + lockdep_assert(lock_is_held(&vma->vmlock_dep_map)); > + else > + vma_assert_write_locked(vma); > } > > static inline bool vma_is_attached(struct vm_area_struct *vma) ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 1/2] mm/vma: use lockdep where we can, reduce duplication 2026-01-20 13:53 ` Vlastimil Babka @ 2026-01-20 17:49 ` Lorenzo Stoakes 2026-01-20 21:28 ` Vlastimil Babka 0 siblings, 1 reply; 7+ messages in thread From: Lorenzo Stoakes @ 2026-01-20 17:49 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Jann Horn, linux-mm, linux-kernel, linux-rt-devel, Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt On Tue, Jan 20, 2026 at 02:53:30PM +0100, Vlastimil Babka wrote: > On 1/19/26 21:59, Lorenzo Stoakes wrote: > > We introduce vma_is_read_locked(), which must deal with the case in which > > VMA write lock sets refcnt to VMA_LOCK_OFFSET or VMA_LOCK_OFFSET + > > 1. Luckily is_vma_writer_only() already exists which we can use to check > > this. > > So I think there's a bit of a caveat in that > > - is_vma_writer_only() may be a false positive if there is a temporary > reader of a detached vma (per comments in vma_mark_detached() and > vma_mark_detached()) vma_mark_detached() and vma_mark_attached() I sasume you mean. OK so this is all very confusing indeed. This function is _only_ referring to the situation between __vma_enter_locked() and __vma_exit_locked(). Despite their names, suggesting maybe they happen on lock and unlock respectively, that's not the case, they're both invoked on lock and enter on start of TAKING lock and exit on completion of TAKING the lock. It seems __vma_enter_locked() is more about getting into this state with refcnt equal to VMA_LOCK_OFFSET (detaching - note elsewhere we say detached of course) or VMA_LOCK_OFFSET + 1 if attached and waiting on readers who are spuriously increasing the reference count. So fundamentally is_vma_writer_only() is actually asking 'are we in the midst of a VMA write lock acquisiton having finally set the VMA's refcnt to VMA_LOCK_OFFSET or VMA_LOCK_OFFSET+1 but haven't yet completed acquiring the lock' - e.g. having not yet called __vma_exit_locked(). With __vma_enter_locked() called from: vma_start_write() / vma_start_write_killable() -> __vma_start_write() -> __vma_enter_locked() vma_mark_detached() -> __vma_enter_locked() OK so in __vma_enter_locked() we add VMA_LOCK_OFFSET but then wait until we get to either VMA_LOCK_OFFSET + 1 (attached) or VMA_LOCK_OFFSET (detached), since presumably refcnt == 0 is detached, refcnt == 1 means write lock finally acquired (but you have to check the sequence number). And _there_ we could have spurious readers. > > - hence vma_is_read_locked() may be a false negative Yup. > > - hence vma_assert_locked() might assume wrongly that we should not assert > being a reader, so we vma_assert_write_locked() instead, and fail Aside -> Every time I come to this code it's like this - having to refresh my memory as to how any of it works, getting confused, etc. This speaks to this being a broken abstraction similar to anon_vma. What I mean by leaked abstraction is that you seem to need to maintain the _implementation_ context in your head to be able to correctly implement anything. We simply are not abstracting details here really well at all. The fact I got this wrong despite staring at this code for ages is indicative of that. Also the fact an ostensibly simple series has turned into a 'restore the context' discussion this long and taken this many hours is further suggestive. I think we can do better. I'd rather not do more 'cleanup' series, but I think this badly needs it. So maybe I'll convert this series into something that addresses some of this stuff. <- Aside > > Howevever the above should mean it could be only us who is the temporary > reader. And we are not going to use vma_assert_locked() during the temporary > reader part (in vma_start_read()). I don't think so? Spurious readers can arise at any time incrementing the refcnt via vma_start_read(), so the temporary readers could be anybody. But they'd not get the read lock, and so shouldn't call anything asserting read lock. Anyway I think my use of is_vma_writer_only() is just broken then. We _do_ need to account for the VMA write lock scenario here, but I think instead we should be good with refcnt > 1 && refcnt < VMA_LOCK_OFFSET no? > So it's probably fine, but maybe worth some comments to prevent people > getting suspicious and reconstructing this? But yeah we shouldn't be asserting this anywhere during which this should be the case. So hopefully the above resolves the issue? It's still racey (write lock might have been acquired since we checked but that's just the nature of it. But if we use lockdep as you mention we can actually do the same 'precise if lockdep, otherwise not so precise' approach in the stabilised check. Let me respin. > > But I think perhaps also vma_assert_locked() could, with lockdep enabled > (similarly to vma_assert_stabilised() in patch 2), use the > "lock_is_held(&vma->vmlock_dep_map)" condition (without immediately > asserting it) for the primary reader vs writer decision, and not rely on > vma_is_read_locked()? Because lockdep has the precise information. > > It would likely make things more ugly, or require more refactoring, but > hopefully worthwhile? Yeah good idea. Will do. Maybe I need to make this into a broader refactoring series. Because this so badly needs it. Cheers, Lorenzo ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 1/2] mm/vma: use lockdep where we can, reduce duplication 2026-01-20 17:49 ` Lorenzo Stoakes @ 2026-01-20 21:28 ` Vlastimil Babka 2026-01-21 9:07 ` Lorenzo Stoakes 0 siblings, 1 reply; 7+ messages in thread From: Vlastimil Babka @ 2026-01-20 21:28 UTC (permalink / raw) To: Lorenzo Stoakes Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Jann Horn, linux-mm, linux-kernel, linux-rt-devel, Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt On 1/20/26 18:49, Lorenzo Stoakes wrote: > On Tue, Jan 20, 2026 at 02:53:30PM +0100, Vlastimil Babka wrote: >> On 1/19/26 21:59, Lorenzo Stoakes wrote: >> > We introduce vma_is_read_locked(), which must deal with the case in which >> > VMA write lock sets refcnt to VMA_LOCK_OFFSET or VMA_LOCK_OFFSET + >> > 1. Luckily is_vma_writer_only() already exists which we can use to check >> > this. >> >> So I think there's a bit of a caveat in that >> >> - is_vma_writer_only() may be a false positive if there is a temporary >> reader of a detached vma (per comments in vma_mark_detached() and >> vma_mark_detached()) > > vma_mark_detached() and vma_mark_attached() I sasume you mean. Right. > OK so this is all very confusing indeed. > > This function is _only_ referring to the situation between > __vma_enter_locked() and __vma_exit_locked(). > > Despite their names, suggesting maybe they happen on lock and unlock > respectively, that's not the case, they're both invoked on lock and enter > on start of TAKING lock and exit on completion of TAKING the > lock. IIUC yes, for the vma "write lock". > It seems __vma_enter_locked() is more about getting into this state with > refcnt equal to VMA_LOCK_OFFSET (detaching - note elsewhere we say detached > of course) or VMA_LOCK_OFFSET + 1 if attached and waiting on readers who > are spuriously increasing the reference count. Yes. > So fundamentally is_vma_writer_only() is actually asking 'are we in the > midst of a VMA write lock acquisiton having finally set the VMA's refcnt to > VMA_LOCK_OFFSET or VMA_LOCK_OFFSET+1 but haven't yet completed acquiring > the lock' - e.g. having not yet called __vma_exit_locked(). IIUC in the current code it's not used in that "are *we* in the midst..." sense but "is there a writer in that phase that we are supposed to wake up because we are the last reader", where a rare false positive answer only results in an unnecessary wakeup of said wanna-be writer, but nothing worse. And AFAIU this patch tries to reuse the function to ask "is the vma read locked?" (and we presume it's by us). > With __vma_enter_locked() called from: > > vma_start_write() / vma_start_write_killable() > -> __vma_start_write() > -> __vma_enter_locked() > > vma_mark_detached() > -> __vma_enter_locked() > > OK so in __vma_enter_locked() we add VMA_LOCK_OFFSET but then wait until we > get to either VMA_LOCK_OFFSET + 1 (attached) or VMA_LOCK_OFFSET (detached), > since presumably refcnt == 0 is detached, refcnt == 1 means write lock > finally acquired (but you have to check the sequence number). > > And _there_ we could have spurious readers. Yes. > >> >> - hence vma_is_read_locked() may be a false negative > > Yup. > >> >> - hence vma_assert_locked() might assume wrongly that we should not assert >> being a reader, so we vma_assert_write_locked() instead, and fail > > Aside -> > > Every time I come to this code it's like this - having to refresh > my memory as to how any of it works, getting confused, etc. > > This speaks to this being a broken abstraction similar to anon_vma. > > What I mean by leaked abstraction is that you seem to need to > maintain the _implementation_ context in your head to be able to > correctly implement anything. We simply are not abstracting details > here really well at all. I think this would be true if it was applied to users of the high-level API of the code - actual locking and unlocking for read/write. Do they have to care about the implementation details? Hopefully not. Here we have to think about the implementation because we are trying to improve the API (to add assertions) so that's not surprising? If your complaing is about an "intermediate" abstractions level like the "is_vma_writer_only()" function then yeah it's far from perfect. > The fact I got this wrong despite staring at this code for ages is > indicative of that. > > Also the fact an ostensibly simple series has turned into a > 'restore the context' discussion this long and taken this many > hours is further suggestive. > > I think we can do better. I'd rather not do more 'cleanup' series, > but I think this badly needs it. > > So maybe I'll convert this series into something that addresses > some of this stuff. > > <- Aside > > >> >> Howevever the above should mean it could be only us who is the temporary >> reader. And we are not going to use vma_assert_locked() during the temporary >> reader part (in vma_start_read()). > > I don't think so? Spurious readers can arise at any time incrementing the > refcnt via vma_start_read(), so the temporary readers could be anybody. > > But they'd not get the read lock, and so shouldn't call anything asserting > read lock. > > Anyway I think my use of is_vma_writer_only() is just broken then. AFAIU the whole thing (vma_assert_locked() after this patch) would be broken in a case where we are really a reader and vma_is_read_locked() returns false wrongly, and thus makes us perform vma_assert_write_locked() and fail. So the scenarios with spurious readers can't cause that I think. What I think could cause that is there being a writer (not us), causing is_vma_writer_only() return true even when there's also a reader (us). And I concluded that could happen only in case where we would be the spurious reader racing with a detaching writer. But when we are in the temporary spurious reader situation, we don't perform vma_is_read_locked() there. > We _do_ need to account for the VMA write lock scenario here, but I think > instead we should be good with refcnt > 1 && refcnt < VMA_LOCK_OFFSET no? That check would tell us there is no writer. But there might be a writer (not us) while we're a reader, and thus that check won't work as a signal for "we must have the read lock"? >> So it's probably fine, but maybe worth some comments to prevent people >> getting suspicious and reconstructing this? > > But yeah we shouldn't be asserting this anywhere during which this should > be the case. > > So hopefully the above resolves the issue? I don't follow, but perhaps I misunderstood you above and it's late here now. > It's still racey (write lock might have been acquired since we checked but > that's just the nature of it. > > But if we use lockdep as you mention we can actually do the same 'precise > if lockdep, otherwise not so precise' approach in the stabilised check. > > Let me respin. > >> >> But I think perhaps also vma_assert_locked() could, with lockdep enabled >> (similarly to vma_assert_stabilised() in patch 2), use the >> "lock_is_held(&vma->vmlock_dep_map)" condition (without immediately >> asserting it) for the primary reader vs writer decision, and not rely on >> vma_is_read_locked()? Because lockdep has the precise information. >> >> It would likely make things more ugly, or require more refactoring, but >> hopefully worthwhile? > > Yeah good idea. Will do. Ack, thanks. > Maybe I need to make this into a broader refactoring series. Because this > so badly needs it. Yep :/ > Cheers, Lorenzo ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 1/2] mm/vma: use lockdep where we can, reduce duplication 2026-01-20 21:28 ` Vlastimil Babka @ 2026-01-21 9:07 ` Lorenzo Stoakes 0 siblings, 0 replies; 7+ messages in thread From: Lorenzo Stoakes @ 2026-01-21 9:07 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Jann Horn, linux-mm, linux-kernel, linux-rt-devel, Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt On Tue, Jan 20, 2026 at 10:28:15PM +0100, Vlastimil Babka wrote: > On 1/20/26 18:49, Lorenzo Stoakes wrote: > > On Tue, Jan 20, 2026 at 02:53:30PM +0100, Vlastimil Babka wrote: > >> On 1/19/26 21:59, Lorenzo Stoakes wrote: > >> > We introduce vma_is_read_locked(), which must deal with the case in which > >> > VMA write lock sets refcnt to VMA_LOCK_OFFSET or VMA_LOCK_OFFSET + > >> > 1. Luckily is_vma_writer_only() already exists which we can use to check > >> > this. > >> > >> So I think there's a bit of a caveat in that > >> > >> - is_vma_writer_only() may be a false positive if there is a temporary > >> reader of a detached vma (per comments in vma_mark_detached() and > >> vma_mark_detached()) > > > > vma_mark_detached() and vma_mark_attached() I sasume you mean. > > Right. > > > OK so this is all very confusing indeed. > > > > This function is _only_ referring to the situation between > > __vma_enter_locked() and __vma_exit_locked(). > > > > Despite their names, suggesting maybe they happen on lock and unlock > > respectively, that's not the case, they're both invoked on lock and enter > > on start of TAKING lock and exit on completion of TAKING the > > lock. > > IIUC yes, for the vma "write lock". Yes. > > > It seems __vma_enter_locked() is more about getting into this state with > > refcnt equal to VMA_LOCK_OFFSET (detaching - note elsewhere we say detached > > of course) or VMA_LOCK_OFFSET + 1 if attached and waiting on readers who > > are spuriously increasing the reference count. > > Yes. > > > So fundamentally is_vma_writer_only() is actually asking 'are we in the > > midst of a VMA write lock acquisiton having finally set the VMA's refcnt to > > VMA_LOCK_OFFSET or VMA_LOCK_OFFSET+1 but haven't yet completed acquiring > > the lock' - e.g. having not yet called __vma_exit_locked(). > > IIUC in the current code it's not used in that "are *we* in the midst..." > sense but "is there a writer in that phase that we are supposed to wake up > because we are the last reader", where a rare false positive answer only > results in an unnecessary wakeup of said wanna-be writer, but nothing worse. Yeah sorry I phrased that badly, 'is there a writer in the midst of...', not us of course. And yes this seems to be the case. > > And AFAIU this patch tries to reuse the function to ask "is the vma read > locked?" (and we presume it's by us). > > > With __vma_enter_locked() called from: > > > > vma_start_write() / vma_start_write_killable() > > -> __vma_start_write() > > -> __vma_enter_locked() > > > > vma_mark_detached() > > -> __vma_enter_locked() > > > > OK so in __vma_enter_locked() we add VMA_LOCK_OFFSET but then wait until we > > get to either VMA_LOCK_OFFSET + 1 (attached) or VMA_LOCK_OFFSET (detached), > > since presumably refcnt == 0 is detached, refcnt == 1 means write lock > > finally acquired (but you have to check the sequence number). > > > > And _there_ we could have spurious readers. > > Yes. > > > > >> > >> - hence vma_is_read_locked() may be a false negative > > > > Yup. > > > >> > >> - hence vma_assert_locked() might assume wrongly that we should not assert > >> being a reader, so we vma_assert_write_locked() instead, and fail > > > > Aside -> > > > > Every time I come to this code it's like this - having to refresh > > my memory as to how any of it works, getting confused, etc. > > > > This speaks to this being a broken abstraction similar to anon_vma. > > > > What I mean by leaked abstraction is that you seem to need to > > maintain the _implementation_ context in your head to be able to > > correctly implement anything. We simply are not abstracting details > > here really well at all. > > I think this would be true if it was applied to users of the high-level API > of the code - actual locking and unlocking for read/write. Do they have to > care about the implementation details? Hopefully not. No I disagree, a broken abstraction applies to maintenance too. We have to consider _everything at once_ even trying to do change that are relevant only to one part of the mechanism. It's not the case in other parts of mm (apart from anon_vma) that I need to remind myself of -the entirety of an incredibly complicated self-rolled locking mechanism- to do _anything at all_. > > Here we have to think about the implementation because we are trying to > improve the API (to add assertions) so that's not surprising? If your Err what?^W^W OK just read the 'intermediate abstraction' bit below - yes this is what I mean :) not the public API which is relatively OK, I mean the intermediate levels which are very much not ;) I'm trying to add a very basic and simple assertion of 'is lock A or lock B' taken. And _just look_ at how difficult it's been. This isn't a big change. This isn't a fundamental change. It's an absolutely minor change, and frankly something that should have been in place from the start. I decided to add it to be a good kernel citizen (I have a MILLION things to do) having (actually it turns out incorrectly) felt that the hard-coded version of this was incorrect as well as wanting to be able to assert this fundamental state in those places that we need it (very many). After the feedback from Peter + Sebastian it seemed obviously sensible to try to use lockdep as much as possible. And thus the voyage to the lands of insanity began... As always, no good deed goes unpunished :) It's a hallmark of code that is not well abstracted that you have to disentangle the _whole thing_ to be able to do simple things. It's also a hallmark of code that I feel could do with being simplified and more clearly documented. So in the respin I'll do this. The point I'm making really is there are _levels_ of abstraction both in the public API _and_ the internal implementation. The public API abstraction is generally reasonably OK. The truly broken abstraction is in the layers below. > complaing is about an "intermediate" abstractions level like the > "is_vma_writer_only()" function then yeah it's far from perfect. Right yes. This haha. > > > The fact I got this wrong despite staring at this code for ages is > > indicative of that. > > > > Also the fact an ostensibly simple series has turned into a > > 'restore the context' discussion this long and taken this many > > hours is further suggestive. > > > > I think we can do better. I'd rather not do more 'cleanup' series, > > but I think this badly needs it. > > > > So maybe I'll convert this series into something that addresses > > some of this stuff. > > > > <- Aside > > > > > >> > >> Howevever the above should mean it could be only us who is the temporary > >> reader. And we are not going to use vma_assert_locked() during the temporary > >> reader part (in vma_start_read()). > > > > I don't think so? Spurious readers can arise at any time incrementing the > > refcnt via vma_start_read(), so the temporary readers could be anybody. > > > > But they'd not get the read lock, and so shouldn't call anything asserting > > read lock. > > > > Anyway I think my use of is_vma_writer_only() is just broken then. > > AFAIU the whole thing (vma_assert_locked() after this patch) would be broken > in a case where we are really a reader and vma_is_read_locked() returns > false wrongly, and thus makes us perform vma_assert_write_locked() and fail. > So the scenarios with spurious readers can't cause that I think. > > What I think could cause that is there being a writer (not us), causing > is_vma_writer_only() return true even when there's also a reader (us). And I > concluded that could happen only in case where we would be the spurious > reader racing with a detaching writer. But when we are in the temporary > spurious reader situation, we don't perform vma_is_read_locked() there. > > > We _do_ need to account for the VMA write lock scenario here, but I think > > instead we should be good with refcnt > 1 && refcnt < VMA_LOCK_OFFSET no? > > That check would tell us there is no writer. But there might be a writer > (not us) while we're a reader, and thus that check won't work as a signal > for "we must have the read lock"? Would it? A writer lock implies refcnt = 1 (or 0 if detached) right? If another writer/detacher is in the midst of taking the lock then it'd be >= VMA_LOCK_OFFSET. We might actually encounter an issue where another thread's vma_start_read() happens to pass the lockless check, then increments refcnt before realising write locked and decrementing, but we get just the wrong timing and observe refcnt > 1 && < VMA_LOCK_OFFSET, but that's very unlikely. Anyway the conclusion is we can't have vma_is_read_locked() at all without lockdep. Fun! :) I want to assert the right so I guess we'll end up with: if (lock_is_held(...)) lockdep_assert(lock_is_held(...)); > > >> So it's probably fine, but maybe worth some comments to prevent people > >> getting suspicious and reconstructing this? > > > > But yeah we shouldn't be asserting this anywhere during which this should > > be the case. > > > > So hopefully the above resolves the issue? > > I don't follow, but perhaps I misunderstood you above and it's late here now. Well I think we've reached some point of zen now... > > > It's still racey (write lock might have been acquired since we checked but > > that's just the nature of it. > > > > But if we use lockdep as you mention we can actually do the same 'precise > > if lockdep, otherwise not so precise' approach in the stabilised check. > > > > Let me respin. > > > >> > >> But I think perhaps also vma_assert_locked() could, with lockdep enabled > >> (similarly to vma_assert_stabilised() in patch 2), use the > >> "lock_is_held(&vma->vmlock_dep_map)" condition (without immediately > >> asserting it) for the primary reader vs writer decision, and not rely on > >> vma_is_read_locked()? Because lockdep has the precise information. > >> > >> It would likely make things more ugly, or require more refactoring, but > >> hopefully worthwhile? > > > > Yeah good idea. Will do. > > Ack, thanks. > > > Maybe I need to make this into a broader refactoring series. Because this > > so badly needs it. > > Yep :/ > > > Cheers, Lorenzo > Back to a respin before I forget how all this works (again)... Cheers, Lorenzo ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 2/2] mm: add vma_assert_stabilised() 2026-01-19 20:59 [PATCH v2 0/2] add and use vma_assert_stabilised() helper Lorenzo Stoakes 2026-01-19 20:59 ` [PATCH v2 1/2] mm/vma: use lockdep where we can, reduce duplication Lorenzo Stoakes @ 2026-01-19 20:59 ` Lorenzo Stoakes 1 sibling, 0 replies; 7+ messages in thread From: Lorenzo Stoakes @ 2026-01-19 20:59 UTC (permalink / raw) To: Andrew Morton Cc: David Hildenbrand, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Jann Horn, linux-mm, linux-kernel, linux-rt-devel, Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt Sometimes we wish to assert that a VMA is stable, that is - the VMA cannot be changed underneath us. This will be the case if EITHER the VMA lock or the mmap lock is held. In order to do so, we introduce a new assert vma_assert_stablised() - this will make a lockdep assert if lockdep is enabled AND the VMA is read-locked. Currently lockdep tracking for VMA write locks is not implemented, so it suffices to check in this case that we have either an mmap read or write semaphore held. Note that because the VMA lock uses the non-standard vmlock_dep_map naming convention, we cannot use lockdep_assert_is_write_held() so have to open code this ourselves via lockdep-asserting that lock_is_held_type(&vma->vmlock_dep_map, 0). We have to be careful here - for instance when merging a VMA, we use the mmap write lock to stabilise the examination of adjacent VMAs which might be simultaneously VMA read-locked whilst being faulted in. If we were to assert VMA read lock using lockdep we would encounter an incorrect lockdep assert. Also, we have to be careful about asserting mmap locks are held - if we try to address the above issue by first checking whether mmap lock is held and if so asserting it via lockdep, we may find that we were raced by another thread acquiring an mmap read lock simultaneously that either we don't own (and thus can be released any time - so we are not stable) or was indeed released since we last checked. So to deal with these complexities we end up with either a precise (if lockdep is enabled) or imprecise (if not) approach - in the first instance we assert the lock is held using lockdep and thus whether we own it. If we do own it, then the check is complete, otherwise we must check for the VMA read lock being held (VMA write lock implies mmap write lock so the mmap lock suffices for this). If lockdep is not enabled we simply check if the mmap lock is held and risk a false positive. We add vma_assert_read_locked() for this case. There are a couple places in the kernel where we already do this stabliisation check - the anon_vma_name() helper in mm/madvise.c and vma_flag_set_atomic() in include/linux/mm.h, which we update to use vma_assert_stabilised(). This change abstracts these into vma_assert_stabilised(), uses lockdep if possible, and avoids a duplicate check of whether the mmap lock is held. This is also self-documenting and lays the foundations for further VMA stability checks in the code. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> --- include/linux/mm.h | 4 +-- include/linux/mmap_lock.h | 56 +++++++++++++++++++++++++++++++++++++++ mm/madvise.c | 4 +-- 3 files changed, 58 insertions(+), 6 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index a18ade628c8e..4c0104a21d0b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1008,9 +1008,7 @@ static inline void vma_flag_set_atomic(struct vm_area_struct *vma, { unsigned long *bitmap = ACCESS_PRIVATE(&vma->flags, __vma_flags); - /* mmap read lock/VMA read lock must be held. */ - if (!rwsem_is_locked(&vma->vm_mm->mmap_lock)) - vma_assert_locked(vma); + vma_assert_stabilised(vma); if (__vma_flag_atomic_valid(vma, bit)) set_bit((__force int)bit, bitmap); diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index 6979222882f1..17e5aa12586e 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -277,6 +277,56 @@ static inline void vma_assert_locked(struct vm_area_struct *vma) vma_assert_write_locked(vma); } +/** + * vma_assert_read_locked() - Asserts that @vma is specifically read locked. + * @vma: The VMA we assert. + */ +static inline void vma_assert_read_locked(struct vm_area_struct *vma) +{ + lockdep_assert(lock_is_held(&vma->vmlock_dep_map)); + VM_BUG_ON_VMA(!vma_is_read_locked(vma), vma); +} + +/** + * vma_assert_stabilised() - assert that this VMA cannot be changed from + * underneath us either by having a VMA or mmap lock held. + * @vma: The VMA whose stability we wish to assess. + * + * If lockdep is enabled we can precisely ensure stability via either an mmap + * lock owned by us or a specific VMA lock. + * + * With lockdep disabled we may sometimes race with other threads acquiring the + * mmap read lock simultaneous with our VMA read lock. + */ +static inline void vma_assert_stabilised(struct vm_area_struct *vma) +{ + /* + * We have to be careful about VMA read locks and concurrent mmap locks + * by other threads. If we were to assert we own an mmap lock when in + * fact it is another thread's, or if we were to race with it unlocking + * when asserting an mmap lock, we will fail incorrectly. + * + * If we have lockdep, we can treat OUR owning the mmap lock as + * sufficient stabilisation. + * + * If not, this is an approximation and we simply assume the same, + * though sometimes we might be wrong due to races. + */ + if (IS_ENABLED(CONFIG_LOCKDEP)) { + if (lockdep_is_held(&vma->vm_mm->mmap_lock)) + return; + } else { + if (rwsem_is_locked(&vma->vm_mm->mmap_lock)) + return; + } + + /* + * OK we must hold a VMA read lock, since a write lock requires mmap + * lock. + */ + vma_assert_read_locked(vma); +} + static inline bool vma_is_attached(struct vm_area_struct *vma) { return refcount_read(&vma->vm_refcnt); @@ -353,6 +403,12 @@ static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, return NULL; } +static inline void vma_assert_stabilised(struct vm_area_struct *vma) +{ + /* If no VMA locks, then either mmap lock suffices to stabilise. */ + mmap_assert_locked(vma->vm_mm); +} + static inline void vma_assert_locked(struct vm_area_struct *vma) { mmap_assert_locked(vma->vm_mm); diff --git a/mm/madvise.c b/mm/madvise.c index 4bf4c8c38fd3..1f3040688f04 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -109,9 +109,7 @@ void anon_vma_name_free(struct kref *kref) struct anon_vma_name *anon_vma_name(struct vm_area_struct *vma) { - if (!rwsem_is_locked(&vma->vm_mm->mmap_lock)) - vma_assert_locked(vma); - + vma_assert_stabilised(vma); return vma->anon_name; } -- 2.52.0 ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-01-21 9:07 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2026-01-19 20:59 [PATCH v2 0/2] add and use vma_assert_stabilised() helper Lorenzo Stoakes 2026-01-19 20:59 ` [PATCH v2 1/2] mm/vma: use lockdep where we can, reduce duplication Lorenzo Stoakes 2026-01-20 13:53 ` Vlastimil Babka 2026-01-20 17:49 ` Lorenzo Stoakes 2026-01-20 21:28 ` Vlastimil Babka 2026-01-21 9:07 ` Lorenzo Stoakes 2026-01-19 20:59 ` [PATCH v2 2/2] mm: add vma_assert_stabilised() Lorenzo Stoakes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox