* [PATCH v2 0/2] vma_start_write_killable @ 2025-11-10 20:32 Matthew Wilcox (Oracle) 2025-11-10 20:32 ` [PATCH v2 1/2] mm: Add vma_start_write_killable() Matthew Wilcox (Oracle) 2025-11-10 20:32 ` [PATCH v2 2/2] mm: Use vma_start_write_killable() in dup_mmap() Matthew Wilcox (Oracle) 0 siblings, 2 replies; 10+ messages in thread From: Matthew Wilcox (Oracle) @ 2025-11-10 20:32 UTC (permalink / raw) To: Andrew Morton Cc: Matthew Wilcox (Oracle), linux-mm, Suren Baghdasaryan, Chris Li, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Shakeel Butt, Jann Horn, Pedro Falcato When we added the VMA lock, we made a major oversight in not adding a killable variant. That can run us into trouble where a thread takes the VMA lock for read (eg handling a page fault) and then goes out to lunch for an hour (eg doing reclaim). Another thread tries to modify the VMA, taking the mmap_lock for write, then attempts to lock the VMA for write. That blocks on the first thread, and ensures that every other page fault now tries to take the mmap_lock for read. Because everything's in an uninterruptible sleep, we can't kill the task, which makes me angry. This patch set just adds vma_start_write_killable() and converts one caller to use it. Most users are somewhat tricky to convert, so expect follow-up individual patches per call-site which need careful analysis to make sure we've done proper cleanup. v2: - Document the return value from __vma_start_write() (Suren) - Call rwsem_release() on failure to make lockdep happy - Reformat the prototype for vma_start_write_killable() to make both Suren & me happy - Add the kernel-doc for vma_start_write_killable() to Documentation/mm/process_addrs.rst - Collect R-b from Suren & Liam (thanks!) Matthew Wilcox (Oracle) (2): mm: Add vma_start_write_killable() mm: Use vma_start_write_killable() in dup_mmap() Documentation/mm/process_addrs.rst | 9 +++++++- include/linux/mmap_lock.h | 30 ++++++++++++++++++++++++-- mm/mmap.c | 12 +++-------- mm/mmap_lock.c | 34 ++++++++++++++++++++++-------- tools/testing/vma/vma_internal.h | 8 +++++++ 5 files changed, 72 insertions(+), 21 deletions(-) -- 2.47.2 ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v2 1/2] mm: Add vma_start_write_killable() 2025-11-10 20:32 [PATCH v2 0/2] vma_start_write_killable Matthew Wilcox (Oracle) @ 2025-11-10 20:32 ` Matthew Wilcox (Oracle) 2025-11-11 8:58 ` Vlastimil Babka 2025-11-13 13:20 ` Lorenzo Stoakes 2025-11-10 20:32 ` [PATCH v2 2/2] mm: Use vma_start_write_killable() in dup_mmap() Matthew Wilcox (Oracle) 1 sibling, 2 replies; 10+ messages in thread From: Matthew Wilcox (Oracle) @ 2025-11-10 20:32 UTC (permalink / raw) To: Andrew Morton Cc: Matthew Wilcox (Oracle), linux-mm, Suren Baghdasaryan, Chris Li, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Shakeel Butt, Jann Horn, Pedro Falcato The vma can be held read-locked for a substantial period of time, eg if memory allocation needs to go into reclaim. It's useful to be able to send fatal signals to threads which are waiting for the write lock. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> --- Documentation/mm/process_addrs.rst | 9 +++++++- include/linux/mmap_lock.h | 30 ++++++++++++++++++++++++-- mm/mmap_lock.c | 34 ++++++++++++++++++++++-------- tools/testing/vma/vma_internal.h | 8 +++++++ 4 files changed, 69 insertions(+), 12 deletions(-) diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst index be49e2a269e4..7f2f3e87071d 100644 --- a/Documentation/mm/process_addrs.rst +++ b/Documentation/mm/process_addrs.rst @@ -48,7 +48,8 @@ Terminology * **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves as a read/write semaphore in practice. A VMA read lock is obtained via :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a - write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked + write lock via vma_start_write() or vma_start_write_killable() + (all VMA write locks are unlocked automatically when the mmap write lock is released). To take a VMA write lock you **must** have already acquired an :c:func:`!mmap_write_lock`. * **rmap locks** - When trying to access VMAs through the reverse mapping via a @@ -907,3 +908,9 @@ Stack expansion Stack expansion throws up additional complexities in that we cannot permit there to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`. + +------------------------ +Functions and structures +------------------------ + +.. kernel-doc:: include/linux/mmap_lock.h diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index 2c9fffa58714..378dfb9f1335 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -195,7 +195,8 @@ static bool __is_vma_write_locked(struct vm_area_struct *vma, unsigned int *mm_l return (vma->vm_lock_seq == *mm_lock_seq); } -void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq); +int __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq, + int state); /* * Begin writing to a VMA. @@ -209,7 +210,30 @@ static inline void vma_start_write(struct vm_area_struct *vma) if (__is_vma_write_locked(vma, &mm_lock_seq)) return; - __vma_start_write(vma, mm_lock_seq); + __vma_start_write(vma, mm_lock_seq, TASK_UNINTERRUPTIBLE); +} + +/** + * vma_start_write_killable - Begin writing to a VMA. + * @vma: The VMA we are going to modify. + * + * Exclude concurrent readers under the per-VMA lock until the currently + * write-locked mmap_lock is dropped or downgraded. + * + * Context: May sleep while waiting for readers to drop the vma read lock. + * Caller must already hold the mmap_lock for write. + * + * Return: 0 for a successful acquisition. -EINTR if a fatal signal was + * received. + */ +static inline __must_check +int vma_start_write_killable(struct vm_area_struct *vma) +{ + unsigned int mm_lock_seq; + + if (__is_vma_write_locked(vma, &mm_lock_seq)) + return 0; + return __vma_start_write(vma, mm_lock_seq, TASK_KILLABLE); } static inline void vma_assert_write_locked(struct vm_area_struct *vma) @@ -286,6 +310,8 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm, { return NULL; } static inline void vma_end_read(struct vm_area_struct *vma) {} static inline void vma_start_write(struct vm_area_struct *vma) {} +static inline __must_check +int vma_start_write_killable(struct vm_area_struct *vma) { return 0; } static inline void vma_assert_write_locked(struct vm_area_struct *vma) { mmap_assert_write_locked(vma->vm_mm); } static inline void vma_assert_attached(struct vm_area_struct *vma) {} diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c index 0a0db5849b8e..39f341caf32c 100644 --- a/mm/mmap_lock.c +++ b/mm/mmap_lock.c @@ -45,8 +45,15 @@ EXPORT_SYMBOL(__mmap_lock_do_trace_released); #ifdef CONFIG_MMU #ifdef CONFIG_PER_VMA_LOCK -static inline bool __vma_enter_locked(struct vm_area_struct *vma, bool detaching) +/* + * Return value: 0 if vma detached, + * 1 if vma attached with no readers, + * -EINTR if signal received, + */ +static inline int __vma_enter_locked(struct vm_area_struct *vma, + bool detaching, int state) { + int err; unsigned int tgt_refcnt = VMA_LOCK_OFFSET; /* Additional refcnt if the vma is attached. */ @@ -58,15 +65,19 @@ static inline bool __vma_enter_locked(struct vm_area_struct *vma, bool detaching * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached(). */ if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt)) - return false; + return 0; rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_); - rcuwait_wait_event(&vma->vm_mm->vma_writer_wait, + err = rcuwait_wait_event(&vma->vm_mm->vma_writer_wait, refcount_read(&vma->vm_refcnt) == tgt_refcnt, - TASK_UNINTERRUPTIBLE); + state); + if (err) { + rwsem_release(&vma->vmlock_dep_map, _RET_IP_); + return err; + } lock_acquired(&vma->vmlock_dep_map, _RET_IP_); - return true; + return 1; } static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached) @@ -75,16 +86,19 @@ static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached) rwsem_release(&vma->vmlock_dep_map, _RET_IP_); } -void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq) +int __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq, + int state) { - bool locked; + int locked; /* * __vma_enter_locked() returns false immediately if the vma is not * attached, otherwise it waits until refcnt is indicating that vma * is attached with no readers. */ - locked = __vma_enter_locked(vma, false); + locked = __vma_enter_locked(vma, false, state); + if (locked < 0) + return locked; /* * We should use WRITE_ONCE() here because we can have concurrent reads @@ -100,6 +114,8 @@ void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq) __vma_exit_locked(vma, &detached); WARN_ON_ONCE(detached); /* vma should remain attached */ } + + return 0; } EXPORT_SYMBOL_GPL(__vma_start_write); @@ -118,7 +134,7 @@ void vma_mark_detached(struct vm_area_struct *vma) */ if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) { /* Wait until vma is detached with no readers. */ - if (__vma_enter_locked(vma, true)) { + if (__vma_enter_locked(vma, true, TASK_UNINTERRUPTIBLE)) { bool detached; __vma_exit_locked(vma, &detached); diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index dc976a285ad2..917062cfbc69 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -844,6 +844,14 @@ static inline void vma_start_write(struct vm_area_struct *vma) vma->vm_lock_seq++; } +static inline __must_check +int vma_start_write_killable(struct vm_area_struct *vma) +{ + /* Used to indicate to tests that a write operation has begun. */ + vma->vm_lock_seq++; + return 0; +} + static inline void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, unsigned long end, -- 2.47.2 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 1/2] mm: Add vma_start_write_killable() 2025-11-10 20:32 ` [PATCH v2 1/2] mm: Add vma_start_write_killable() Matthew Wilcox (Oracle) @ 2025-11-11 8:58 ` Vlastimil Babka 2025-11-13 13:20 ` Lorenzo Stoakes 1 sibling, 0 replies; 10+ messages in thread From: Vlastimil Babka @ 2025-11-11 8:58 UTC (permalink / raw) To: Matthew Wilcox (Oracle), Andrew Morton Cc: linux-mm, Suren Baghdasaryan, Chris Li, Liam R. Howlett, Lorenzo Stoakes, Shakeel Butt, Jann Horn, Pedro Falcato On 11/10/25 21:32, Matthew Wilcox (Oracle) wrote: > The vma can be held read-locked for a substantial period of time, eg if > memory allocation needs to go into reclaim. It's useful to be able to > send fatal signals to threads which are waiting for the write lock. > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > Reviewed-by: Suren Baghdasaryan <surenb@google.com> > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 1/2] mm: Add vma_start_write_killable() 2025-11-10 20:32 ` [PATCH v2 1/2] mm: Add vma_start_write_killable() Matthew Wilcox (Oracle) 2025-11-11 8:58 ` Vlastimil Babka @ 2025-11-13 13:20 ` Lorenzo Stoakes 2025-11-13 14:40 ` Matthew Wilcox 1 sibling, 1 reply; 10+ messages in thread From: Lorenzo Stoakes @ 2025-11-13 13:20 UTC (permalink / raw) To: Matthew Wilcox (Oracle) Cc: Andrew Morton, linux-mm, Suren Baghdasaryan, Chris Li, Liam R. Howlett, Vlastimil Babka, Shakeel Butt, Jann Horn, Pedro Falcato On Mon, Nov 10, 2025 at 08:32:01PM +0000, Matthew Wilcox (Oracle) wrote: > The vma can be held read-locked for a substantial period of time, eg if > memory allocation needs to go into reclaim. It's useful to be able to > send fatal signals to threads which are waiting for the write lock. > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Some nits and comments below but generally logic LGTM so: Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Reviewed-by: Suren Baghdasaryan <surenb@google.com> > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> > --- > Documentation/mm/process_addrs.rst | 9 +++++++- > include/linux/mmap_lock.h | 30 ++++++++++++++++++++++++-- > mm/mmap_lock.c | 34 ++++++++++++++++++++++-------- > tools/testing/vma/vma_internal.h | 8 +++++++ > 4 files changed, 69 insertions(+), 12 deletions(-) > > diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst > index be49e2a269e4..7f2f3e87071d 100644 > --- a/Documentation/mm/process_addrs.rst > +++ b/Documentation/mm/process_addrs.rst > @@ -48,7 +48,8 @@ Terminology > * **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves > as a read/write semaphore in practice. A VMA read lock is obtained via > :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a > - write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked > + write lock via vma_start_write() or vma_start_write_killable() > + (all VMA write locks are unlocked Thanks for updating! > automatically when the mmap write lock is released). To take a VMA write lock > you **must** have already acquired an :c:func:`!mmap_write_lock`. > * **rmap locks** - When trying to access VMAs through the reverse mapping via a > @@ -907,3 +908,9 @@ Stack expansion > Stack expansion throws up additional complexities in that we cannot permit there > to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to > prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`. > + > +------------------------ > +Functions and structures > +------------------------ > + > +.. kernel-doc:: include/linux/mmap_lock.h Ah nice to add this now we've got everything squared away into the right header :) > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h > index 2c9fffa58714..378dfb9f1335 100644 > --- a/include/linux/mmap_lock.h > +++ b/include/linux/mmap_lock.h > @@ -195,7 +195,8 @@ static bool __is_vma_write_locked(struct vm_area_struct *vma, unsigned int *mm_l > return (vma->vm_lock_seq == *mm_lock_seq); > } > > -void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq); > +int __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq, > + int state); > > /* > * Begin writing to a VMA. > @@ -209,7 +210,30 @@ static inline void vma_start_write(struct vm_area_struct *vma) > if (__is_vma_write_locked(vma, &mm_lock_seq)) > return; > > - __vma_start_write(vma, mm_lock_seq); > + __vma_start_write(vma, mm_lock_seq, TASK_UNINTERRUPTIBLE); > +} > + > +/** > + * vma_start_write_killable - Begin writing to a VMA. > + * @vma: The VMA we are going to modify. > + * > + * Exclude concurrent readers under the per-VMA lock until the currently > + * write-locked mmap_lock is dropped or downgraded. > + * > + * Context: May sleep while waiting for readers to drop the vma read lock. > + * Caller must already hold the mmap_lock for write. > + * > + * Return: 0 for a successful acquisition. -EINTR if a fatal signal was > + * received. > + */ Not relevant to this series but probably we should write kdoc's for all of these... > +static inline __must_check > +int vma_start_write_killable(struct vm_area_struct *vma) > +{ > + unsigned int mm_lock_seq; > + Wonder if an mmap write lock assert is appropriate here. But we can defer that as we don't do that with vma_start_write() either... > + if (__is_vma_write_locked(vma, &mm_lock_seq)) > + return 0; > + return __vma_start_write(vma, mm_lock_seq, TASK_KILLABLE); > } > > static inline void vma_assert_write_locked(struct vm_area_struct *vma) > @@ -286,6 +310,8 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm, > { return NULL; } > static inline void vma_end_read(struct vm_area_struct *vma) {} > static inline void vma_start_write(struct vm_area_struct *vma) {} > +static inline __must_check > +int vma_start_write_killable(struct vm_area_struct *vma) { return 0; } > static inline void vma_assert_write_locked(struct vm_area_struct *vma) > { mmap_assert_write_locked(vma->vm_mm); } > static inline void vma_assert_attached(struct vm_area_struct *vma) {} > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c > index 0a0db5849b8e..39f341caf32c 100644 > --- a/mm/mmap_lock.c > +++ b/mm/mmap_lock.c > @@ -45,8 +45,15 @@ EXPORT_SYMBOL(__mmap_lock_do_trace_released); > > #ifdef CONFIG_MMU > #ifdef CONFIG_PER_VMA_LOCK > -static inline bool __vma_enter_locked(struct vm_area_struct *vma, bool detaching) > +/* > + * Return value: 0 if vma detached, > + * 1 if vma attached with no readers, > + * -EINTR if signal received, > + */ > +static inline int __vma_enter_locked(struct vm_area_struct *vma, > + bool detaching, int state) > { > + int err; > unsigned int tgt_refcnt = VMA_LOCK_OFFSET; > > /* Additional refcnt if the vma is attached. */ > @@ -58,15 +65,19 @@ static inline bool __vma_enter_locked(struct vm_area_struct *vma, bool detaching > * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached(). > */ > if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt)) > - return false; > + return 0; > > rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_); > - rcuwait_wait_event(&vma->vm_mm->vma_writer_wait, > + err = rcuwait_wait_event(&vma->vm_mm->vma_writer_wait, > refcount_read(&vma->vm_refcnt) == tgt_refcnt, > - TASK_UNINTERRUPTIBLE); > + state); > + if (err) { > + rwsem_release(&vma->vmlock_dep_map, _RET_IP_); > + return err; > + } > lock_acquired(&vma->vmlock_dep_map, _RET_IP_); > > - return true; > + return 1; > } > > static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached) > @@ -75,16 +86,19 @@ static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached) > rwsem_release(&vma->vmlock_dep_map, _RET_IP_); > } > > -void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq) > +int __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq, > + int state) > { > - bool locked; > + int locked; > > /* > * __vma_enter_locked() returns false immediately if the vma is not > * attached, otherwise it waits until refcnt is indicating that vma > * is attached with no readers. This comment seems to need updating? > */ > - locked = __vma_enter_locked(vma, false); > + locked = __vma_enter_locked(vma, false, state); NIT, but while we're here maybe we could make this: locked = __vma_enter_locked(vma, /*detaching=*/false, state); > + if (locked < 0) > + return locked; > > /* > * We should use WRITE_ONCE() here because we can have concurrent reads > @@ -100,6 +114,8 @@ void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq) > __vma_exit_locked(vma, &detached); > WARN_ON_ONCE(detached); /* vma should remain attached */ > } > + > + return 0; > } > EXPORT_SYMBOL_GPL(__vma_start_write); > > @@ -118,7 +134,7 @@ void vma_mark_detached(struct vm_area_struct *vma) > */ > if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) { > /* Wait until vma is detached with no readers. */ > - if (__vma_enter_locked(vma, true)) { > + if (__vma_enter_locked(vma, true, TASK_UNINTERRUPTIBLE)) { No issues with us waiting uninterruptibly here? I guess rare condition and we require this to happen so makes sense. > bool detached; > > __vma_exit_locked(vma, &detached); > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h > index dc976a285ad2..917062cfbc69 100644 > --- a/tools/testing/vma/vma_internal.h > +++ b/tools/testing/vma/vma_internal.h > @@ -844,6 +844,14 @@ static inline void vma_start_write(struct vm_area_struct *vma) > vma->vm_lock_seq++; > } > > +static inline __must_check > +int vma_start_write_killable(struct vm_area_struct *vma) > +{ > + /* Used to indicate to tests that a write operation has begun. */ > + vma->vm_lock_seq++; > + return 0; > +} Thanks for updating. A stub is fine here. > + > static inline void vma_adjust_trans_huge(struct vm_area_struct *vma, > unsigned long start, > unsigned long end, > -- > 2.47.2 > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 1/2] mm: Add vma_start_write_killable() 2025-11-13 13:20 ` Lorenzo Stoakes @ 2025-11-13 14:40 ` Matthew Wilcox 2025-11-17 19:50 ` Suren Baghdasaryan 0 siblings, 1 reply; 10+ messages in thread From: Matthew Wilcox @ 2025-11-13 14:40 UTC (permalink / raw) To: Lorenzo Stoakes Cc: Andrew Morton, linux-mm, Suren Baghdasaryan, Chris Li, Liam R. Howlett, Vlastimil Babka, Shakeel Butt, Jann Horn, Pedro Falcato On Thu, Nov 13, 2025 at 01:20:14PM +0000, Lorenzo Stoakes wrote: > > +/** > > + * vma_start_write_killable - Begin writing to a VMA. > > + * @vma: The VMA we are going to modify. > > + * > > + * Exclude concurrent readers under the per-VMA lock until the currently > > + * write-locked mmap_lock is dropped or downgraded. > > + * > > + * Context: May sleep while waiting for readers to drop the vma read lock. > > + * Caller must already hold the mmap_lock for write. > > + * > > + * Return: 0 for a successful acquisition. -EINTR if a fatal signal was > > + * received. > > + */ > > Not relevant to this series but probably we should write kdoc's for all of > these... Yes. I was mildly miffed that I couldn't crib from an existing one ;-) Definitely worth a followup patch at some point to add kdoc for all of them. > > +static inline __must_check > > +int vma_start_write_killable(struct vm_area_struct *vma) > > +{ > > + unsigned int mm_lock_seq; > > + > > Wonder if an mmap write lock assert is appropriate here. But we can defer > that as we don't do that with vma_start_write() either... Seems like a reasonable addition in a later patch. As you say, it wasn't there in vma_start_write(), so I didn't like to add it. > > +int __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq, > > + int state) > > { > > - bool locked; > > + int locked; > > > > /* > > * __vma_enter_locked() returns false immediately if the vma is not > > * attached, otherwise it waits until refcnt is indicating that vma > > * is attached with no readers. > > This comment seems to need updating? Honestly, I'd rather remove it. It's weird to document what the function you're calling does. If there's anything worth saving, put it in the comment before __vma_enter_locked(). > > */ > > - locked = __vma_enter_locked(vma, false); > > + locked = __vma_enter_locked(vma, false, state); > > NIT, but while we're here maybe we could make this: > > locked = __vma_enter_locked(vma, /*detaching=*/false, state); Ah yes, the canonical 'don't pass bool arguments to functions' problem. #define VMA_DETACHING true #define VMA_NOT_DETACHING false > > @@ -118,7 +134,7 @@ void vma_mark_detached(struct vm_area_struct *vma) > > */ > > if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) { > > /* Wait until vma is detached with no readers. */ > > - if (__vma_enter_locked(vma, true)) { > > + if (__vma_enter_locked(vma, true, TASK_UNINTERRUPTIBLE)) { > > No issues with us waiting uninterruptibly here? > > I guess rare condition and we require this to happen so makes sense. I'd defer to you on that analysis. If we want to add a vma_mark_detached_killable(), we can do that. If we want to make all callers of vma_mark_detached() check its return value, we can do that too (see earlier patch attempts you were cc'd on off-list). It really requires someone to do the analysis of whether it's worthwhile. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 1/2] mm: Add vma_start_write_killable() 2025-11-13 14:40 ` Matthew Wilcox @ 2025-11-17 19:50 ` Suren Baghdasaryan 0 siblings, 0 replies; 10+ messages in thread From: Suren Baghdasaryan @ 2025-11-17 19:50 UTC (permalink / raw) To: Matthew Wilcox Cc: Lorenzo Stoakes, Andrew Morton, linux-mm, Chris Li, Liam R. Howlett, Vlastimil Babka, Shakeel Butt, Jann Horn, Pedro Falcato On Thu, Nov 13, 2025 at 6:40 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Nov 13, 2025 at 01:20:14PM +0000, Lorenzo Stoakes wrote: > > > +/** > > > + * vma_start_write_killable - Begin writing to a VMA. > > > + * @vma: The VMA we are going to modify. > > > + * > > > + * Exclude concurrent readers under the per-VMA lock until the currently > > > + * write-locked mmap_lock is dropped or downgraded. > > > + * > > > + * Context: May sleep while waiting for readers to drop the vma read lock. > > > + * Caller must already hold the mmap_lock for write. > > > + * > > > + * Return: 0 for a successful acquisition. -EINTR if a fatal signal was > > > + * received. > > > + */ > > > > Not relevant to this series but probably we should write kdoc's for all of > > these... > > Yes. I was mildly miffed that I couldn't crib from an existing one ;-) > Definitely worth a followup patch at some point to add kdoc for all of > them. > > > > +static inline __must_check > > > +int vma_start_write_killable(struct vm_area_struct *vma) > > > +{ > > > + unsigned int mm_lock_seq; > > > + > > > > Wonder if an mmap write lock assert is appropriate here. But we can defer > > that as we don't do that with vma_start_write() either... > > Seems like a reasonable addition in a later patch. As you say, it > wasn't there in vma_start_write(), so I didn't like to add it. > > > > +int __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq, > > > + int state) > > > { > > > - bool locked; > > > + int locked; > > > > > > /* > > > * __vma_enter_locked() returns false immediately if the vma is not > > > * attached, otherwise it waits until refcnt is indicating that vma > > > * is attached with no readers. > > > > This comment seems to need updating? > > Honestly, I'd rather remove it. It's weird to document what the > function you're calling does. If there's anything worth saving, > put it in the comment before __vma_enter_locked(). Either removing or moving it above __vma_enter_locked() sounds good to me. > > > > */ > > > - locked = __vma_enter_locked(vma, false); > > > + locked = __vma_enter_locked(vma, false, state); > > > > NIT, but while we're here maybe we could make this: > > > > locked = __vma_enter_locked(vma, /*detaching=*/false, state); > > Ah yes, the canonical 'don't pass bool arguments to functions' problem. > #define VMA_DETACHING true > #define VMA_NOT_DETACHING false > > > > @@ -118,7 +134,7 @@ void vma_mark_detached(struct vm_area_struct *vma) > > > */ > > > if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) { > > > /* Wait until vma is detached with no readers. */ > > > - if (__vma_enter_locked(vma, true)) { > > > + if (__vma_enter_locked(vma, true, TASK_UNINTERRUPTIBLE)) { > > > > No issues with us waiting uninterruptibly here? > > > > I guess rare condition and we require this to happen so makes sense. > > I'd defer to you on that analysis. If we want to add a > vma_mark_detached_killable(), we can do that. If we want to make all > callers of vma_mark_detached() check its return value, we can do that > too (see earlier patch attempts you were cc'd on off-list). It really > requires someone to do the analysis of whether it's worthwhile. I analyzed this when Matthew proposed his first patch and I think waiting here uninterruptibly is fine. The only time the writer might be blocked here is when a reader temporarily increments vm_refcnt to check vm_lock_seq, realize the vma is locked and drop back the vm_refcnt. That is an unlikely case and when it happens the reader should drop the refcount and unblock the writer very quickly. > ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v2 2/2] mm: Use vma_start_write_killable() in dup_mmap() 2025-11-10 20:32 [PATCH v2 0/2] vma_start_write_killable Matthew Wilcox (Oracle) 2025-11-10 20:32 ` [PATCH v2 1/2] mm: Add vma_start_write_killable() Matthew Wilcox (Oracle) @ 2025-11-10 20:32 ` Matthew Wilcox (Oracle) 2025-11-11 9:16 ` Vlastimil Babka 2025-11-13 13:25 ` Lorenzo Stoakes 1 sibling, 2 replies; 10+ messages in thread From: Matthew Wilcox (Oracle) @ 2025-11-10 20:32 UTC (permalink / raw) To: Andrew Morton Cc: Matthew Wilcox (Oracle), linux-mm, Suren Baghdasaryan, Chris Li, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Shakeel Butt, Jann Horn, Pedro Falcato Allow waiting for the VMA write lock to be interrupted by fatal signals. The explicit check for fatal_signal_pending() can be removed as it is checked during vma_start_write_killable(). Improves the latency of killing the task as we do not wait for the reader to finish before checking for signals. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> --- mm/mmap.c | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 5fd3b80fda1d..03773fe777b7 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1750,7 +1750,9 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) for_each_vma(vmi, mpnt) { struct file *file; - vma_start_write(mpnt); + retval = vma_start_write_killable(mpnt); + if (retval < 0) + goto loop_out; if (mpnt->vm_flags & VM_DONTCOPY) { retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, mpnt->vm_end, GFP_KERNEL); @@ -1761,14 +1763,6 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) continue; } charge = 0; - /* - * Don't duplicate many vmas if we've been oom-killed (for - * example) - */ - if (fatal_signal_pending(current)) { - retval = -EINTR; - goto loop_out; - } if (mpnt->vm_flags & VM_ACCOUNT) { unsigned long len = vma_pages(mpnt); -- 2.47.2 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 2/2] mm: Use vma_start_write_killable() in dup_mmap() 2025-11-10 20:32 ` [PATCH v2 2/2] mm: Use vma_start_write_killable() in dup_mmap() Matthew Wilcox (Oracle) @ 2025-11-11 9:16 ` Vlastimil Babka 2025-11-13 13:25 ` Lorenzo Stoakes 1 sibling, 0 replies; 10+ messages in thread From: Vlastimil Babka @ 2025-11-11 9:16 UTC (permalink / raw) To: Matthew Wilcox (Oracle), Andrew Morton Cc: linux-mm, Suren Baghdasaryan, Chris Li, Liam R. Howlett, Lorenzo Stoakes, Shakeel Butt, Jann Horn, Pedro Falcato On 11/10/25 21:32, Matthew Wilcox (Oracle) wrote: > Allow waiting for the VMA write lock to be interrupted by fatal signals. > The explicit check for fatal_signal_pending() can be removed as it > is checked during vma_start_write_killable(). Improves the latency > of killing the task as we do not wait for the reader to finish before > checking for signals. > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > Reviewed-by: Suren Baghdasaryan <surenb@google.com> > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> > --- > mm/mmap.c | 12 +++--------- > 1 file changed, 3 insertions(+), 9 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 5fd3b80fda1d..03773fe777b7 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1750,7 +1750,9 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) > for_each_vma(vmi, mpnt) { > struct file *file; > > - vma_start_write(mpnt); > + retval = vma_start_write_killable(mpnt); > + if (retval < 0) > + goto loop_out; > if (mpnt->vm_flags & VM_DONTCOPY) { > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > mpnt->vm_end, GFP_KERNEL); > @@ -1761,14 +1763,6 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) > continue; > } > charge = 0; > - /* > - * Don't duplicate many vmas if we've been oom-killed (for > - * example) > - */ > - if (fatal_signal_pending(current)) { > - retval = -EINTR; > - goto loop_out; > - } > if (mpnt->vm_flags & VM_ACCOUNT) { > unsigned long len = vma_pages(mpnt); > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 2/2] mm: Use vma_start_write_killable() in dup_mmap() 2025-11-10 20:32 ` [PATCH v2 2/2] mm: Use vma_start_write_killable() in dup_mmap() Matthew Wilcox (Oracle) 2025-11-11 9:16 ` Vlastimil Babka @ 2025-11-13 13:25 ` Lorenzo Stoakes 2025-11-13 14:30 ` Matthew Wilcox 1 sibling, 1 reply; 10+ messages in thread From: Lorenzo Stoakes @ 2025-11-13 13:25 UTC (permalink / raw) To: Matthew Wilcox (Oracle) Cc: Andrew Morton, linux-mm, Suren Baghdasaryan, Chris Li, Liam R. Howlett, Vlastimil Babka, Shakeel Butt, Jann Horn, Pedro Falcato On Mon, Nov 10, 2025 at 08:32:02PM +0000, Matthew Wilcox (Oracle) wrote: > Allow waiting for the VMA write lock to be interrupted by fatal signals. > The explicit check for fatal_signal_pending() can be removed as it > is checked during vma_start_write_killable(). Improves the latency > of killing the task as we do not wait for the reader to finish before > checking for signals. Does this fix Chris's issues or is a forthcoming series required for that? I'm guessing not, as his issue was w.r.t. reclaim right rather than at the point of forking? If it does we should say so I think. If not then obviously not! > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> LGTM so: Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Reviewed-by: Suren Baghdasaryan <surenb@google.com> > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> > --- > mm/mmap.c | 12 +++--------- > 1 file changed, 3 insertions(+), 9 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 5fd3b80fda1d..03773fe777b7 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1750,7 +1750,9 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) > for_each_vma(vmi, mpnt) { > struct file *file; > > - vma_start_write(mpnt); > + retval = vma_start_write_killable(mpnt); Maybe worth explicitly adding a comment to say we handle fatal signals here? I guess it is implied but still maybe useful? > + if (retval < 0) > + goto loop_out; > if (mpnt->vm_flags & VM_DONTCOPY) { > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > mpnt->vm_end, GFP_KERNEL); > @@ -1761,14 +1763,6 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) > continue; > } > charge = 0; > - /* > - * Don't duplicate many vmas if we've been oom-killed (for > - * example) > - */ > - if (fatal_signal_pending(current)) { > - retval = -EINTR; > - goto loop_out; > - } > if (mpnt->vm_flags & VM_ACCOUNT) { > unsigned long len = vma_pages(mpnt); > > -- > 2.47.2 > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 2/2] mm: Use vma_start_write_killable() in dup_mmap() 2025-11-13 13:25 ` Lorenzo Stoakes @ 2025-11-13 14:30 ` Matthew Wilcox 0 siblings, 0 replies; 10+ messages in thread From: Matthew Wilcox @ 2025-11-13 14:30 UTC (permalink / raw) To: Lorenzo Stoakes Cc: Andrew Morton, linux-mm, Suren Baghdasaryan, Chris Li, Liam R. Howlett, Vlastimil Babka, Shakeel Butt, Jann Horn, Pedro Falcato On Thu, Nov 13, 2025 at 01:25:40PM +0000, Lorenzo Stoakes wrote: > On Mon, Nov 10, 2025 at 08:32:02PM +0000, Matthew Wilcox (Oracle) wrote: > > Allow waiting for the VMA write lock to be interrupted by fatal signals. > > The explicit check for fatal_signal_pending() can be removed as it > > is checked during vma_start_write_killable(). Improves the latency > > of killing the task as we do not wait for the reader to finish before > > checking for signals. > > Does this fix Chris's issues or is a forthcoming series required for that? > I'm guessing not, as his issue was w.r.t. reclaim right rather than at the > point of forking? > > If it does we should say so I think. If not then obviously not! No, it won't fix Chris's issue. I just wanted to have one user in tree (which was a good idea because syzbot promptly caught the lockdep bug). This one is an obvious one which already has a good fallback path. Most of the other places need a fallback path written, which requires more time and attention than I can devote to the task right now. > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Thanks! > > - vma_start_write(mpnt); > > + retval = vma_start_write_killable(mpnt); > > Maybe worth explicitly adding a comment to say we handle fatal signals > here? I guess it is implied but still maybe useful? I think that would be over-commenting, tbh. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-11-17 19:50 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-11-10 20:32 [PATCH v2 0/2] vma_start_write_killable Matthew Wilcox (Oracle) 2025-11-10 20:32 ` [PATCH v2 1/2] mm: Add vma_start_write_killable() Matthew Wilcox (Oracle) 2025-11-11 8:58 ` Vlastimil Babka 2025-11-13 13:20 ` Lorenzo Stoakes 2025-11-13 14:40 ` Matthew Wilcox 2025-11-17 19:50 ` Suren Baghdasaryan 2025-11-10 20:32 ` [PATCH v2 2/2] mm: Use vma_start_write_killable() in dup_mmap() Matthew Wilcox (Oracle) 2025-11-11 9:16 ` Vlastimil Babka 2025-11-13 13:25 ` Lorenzo Stoakes 2025-11-13 14:30 ` Matthew Wilcox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox