Re: [PATCH v3 2/3] mm: replace vma_start_write() with vma_start_write_killable()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Suren Baghdasaryan <surenb@google.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: akpm@linux-foundation.org, willy@infradead.org, david@kernel.org,
	 ziy@nvidia.com, matthew.brost@intel.com,
	joshua.hahnjy@gmail.com,  rakie.kim@sk.com, byungchul@sk.com,
	gourry@gourry.net,  ying.huang@linux.alibaba.com,
	apopple@nvidia.com,  baolin.wang@linux.alibaba.com,
	Liam.Howlett@oracle.com, npache@redhat.com,
	 ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org,
	 lance.yang@linux.dev, vbabka@suse.cz, jannh@google.com,
	rppt@kernel.org,  mhocko@suse.com, pfalcato@suse.de,
	kees@kernel.org, maddy@linux.ibm.com,  npiggin@gmail.com,
	mpe@ellerman.id.au, chleroy@kernel.org,
	 borntraeger@linux.ibm.com, frankja@linux.ibm.com,
	imbrenda@linux.ibm.com,  hca@linux.ibm.com, gor@linux.ibm.com,
	agordeev@linux.ibm.com,  svens@linux.ibm.com,
	gerald.schaefer@linux.ibm.com, linux-mm@kvack.org,
	 linuxppc-dev@lists.ozlabs.org, kvm@vger.kernel.org,
	 linux-kernel@vger.kernel.org, linux-s390@vger.kernel.org,
	 "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Subject: Re: [PATCH v3 2/3] mm: replace vma_start_write() with vma_start_write_killable()
Date: Tue, 3 Mar 2026 14:11:31 -0800	[thread overview]
Message-ID: <CAJuCfpHBfhKFeWAtQo4r-ofVtO=5MvG+OToEgc2DEY+cuZDSGw@mail.gmail.com> (raw)
In-Reply-To: <74bffc7a-2b8c-40ae-ab02-cd0ced082e18@lucifer.local>

On Mon, Mar 2, 2026 at 6:53 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Feb 25, 2026 at 11:06:08PM -0800, Suren Baghdasaryan wrote:
> > Now that we have vma_start_write_killable() we can replace most of the
> > vma_start_write() calls with it, improving reaction time to the kill
> > signal.
> >
> > There are several places which are left untouched by this patch:
> >
> > 1. free_pgtables() because function should free page tables even if a
> > fatal signal is pending.
> >
> > 2. process_vma_walk_lock(), which requires changes in its callers and
> > will be handled in the next patch.
> >
> > 3. userfaultd code, where some paths calling vma_start_write() can
> > handle EINTR and some can't without a deeper code refactoring.
>
> Surprise surprise :))
>
> >
> > 4. mpol_rebind_mm() which is used by cpusset controller for migrations
>
> Incredibly nitty but cpusset -> cpuset?

Ack.

>
> > and operates on a remote mm. Incomplete operations here would result
> > in an inconsistent cgroup state.
> >
> > 5. vm_flags_{set|mod|clear} require refactoring that involves moving
> > vma_start_write() out of these functions and replacing it with
> > vma_assert_write_locked(), then callers of these functions should
> > lock the vma themselves using vma_start_write_killable() whenever
> > possible.
>
> This should be dealt with by my ongoing mmap_prepare, vma flags work.

That would be great! It makes it much simpler once you are done with
that refactoring.

>
> >
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> # powerpc
>
> Overall I'm a little concerned about whether callers can handle -EINTR in all
> cases, have you checked? Might we cause some weirdness in userspace if a syscall
> suddenly returns -EINTR when before it didn't?

I did check the kernel users and put the patchset through AI reviews.
I haven't checked if any of the affected syscalls do not advertise
-EINTR as a possible error. Adding that to my todo list for the next
respin.

>
> Also maybe we should update the manpages to reflect this, as semi-usless as the
> 'possible error codes' sections are...

Ok, I'll check which syscalls need to be updated and will note that in
cover letter. Once the patchset is accepted I'll update the manpages
for them.

>
> > ---
> >  arch/powerpc/kvm/book3s_hv_uvmem.c |  5 +-
> >  mm/khugepaged.c                    |  5 +-
> >  mm/madvise.c                       |  4 +-
> >  mm/memory.c                        |  2 +
> >  mm/mempolicy.c                     |  8 ++-
> >  mm/mlock.c                         | 21 +++++--
> >  mm/mprotect.c                      |  4 +-
> >  mm/mremap.c                        |  4 +-
> >  mm/vma.c                           | 93 +++++++++++++++++++++---------
> >  mm/vma_exec.c                      |  6 +-
> >  10 files changed, 109 insertions(+), 43 deletions(-)
> >
> > diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> > index 5fbb95d90e99..0a28b48a46b8 100644
> > --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> > +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> > @@ -410,7 +410,10 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
> >                       ret = H_STATE;
> >                       break;
> >               }
> > -             vma_start_write(vma);
> > +             if (vma_start_write_killable(vma)) {
> > +                     ret = H_STATE;
> > +                     break;
> > +             }
> >               /* Copy vm_flags to avoid partial modifications in ksm_madvise */
> >               vm_flags = vma->vm_flags;
> >               ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 1dd3cfca610d..6c92e31ee5fb 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1141,7 +1141,10 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >       /* check if the pmd is still valid */
> > -     vma_start_write(vma);
> > +     if (vma_start_write_killable(vma)) {
> > +             result = SCAN_FAIL;
> > +             goto out_up_write;
> > +     }
> >       result = check_pmd_still_valid(mm, address, pmd);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index c0370d9b4e23..ccdaea6b3b15 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -173,7 +173,9 @@ static int madvise_update_vma(vm_flags_t new_flags,
> >       madv_behavior->vma = vma;
> >
> >       /* vm_flags is protected by the mmap_lock held in write mode. */
> > -     vma_start_write(vma);
> > +     if (vma_start_write_killable(vma))
> > +             return -EINTR;
> > +
> >       vm_flags_reset(vma, new_flags);
> >       if (set_new_anon_name)
> >               return replace_anon_vma_name(vma, anon_name);
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 07778814b4a8..691062154cf5 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -379,6 +379,8 @@ void free_pgd_range(struct mmu_gather *tlb,
> >   * page tables that should be removed.  This can differ from the vma mappings on
> >   * some archs that may have mappings that need to be removed outside the vmas.
> >   * Note that the prev->vm_end and next->vm_start are often used.
> > + * We don't use vma_start_write_killable() because page tables should be freed
> > + * even if the task is being killed.
> >   *
> >   * The vma_end differs from the pg_end when a dup_mmap() failed and the tree has
> >   * unrelated data to the mm_struct being torn down.
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 0e5175f1c767..90939f5bde02 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1784,7 +1784,8 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
> >               return -EINVAL;
> >       if (end == start)
> >               return 0;
> > -     mmap_write_lock(mm);
> > +     if (mmap_write_lock_killable(mm))
> > +             return -EINTR;
>
> Hmm mmap write lock as well now :) I guess it makes sense here, esp given mmap
> write lock part of VMA write lock.

Yeah, I thought while we are at it we can make this part a bit better too...

>
>
> >       prev = vma_prev(&vmi);
> >       for_each_vma_range(vmi, vma, end) {
> >               /*
> > @@ -1801,13 +1802,16 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
> >                       err = -EOPNOTSUPP;
> >                       break;
> >               }
> > +             if (vma_start_write_killable(vma)) {
> > +                     err = -EINTR;
> > +                     break;
> > +             }
> >               new = mpol_dup(old);
> >               if (IS_ERR(new)) {
> >                       err = PTR_ERR(new);
> >                       break;
> >               }
> >
> > -             vma_start_write(vma);
>
> Are we ok with moving this to before mpol_dup()? Does this matter? Confused as
> to why you moved this up?

I thought if locking fails, it would be better to check that earlier
before allocating a new mempolicy. That seems to be safe, but thinking
about this now, if allocation goes into direct reclaim and causes the
lock to be held for longer then that might not be such a hot idea...

>
> >               new->home_node = home_node;
> >               err = mbind_range(&vmi, vma, &prev, start, end, new);
> >               mpol_put(new);
> > diff --git a/mm/mlock.c b/mm/mlock.c
> > index 2f699c3497a5..c562c77c3ee0 100644
> > --- a/mm/mlock.c
> > +++ b/mm/mlock.c
> > @@ -420,7 +420,7 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
> >   * Called for mlock(), mlock2() and mlockall(), to set @vma VM_LOCKED;
> >   * called for munlock() and munlockall(), to clear VM_LOCKED from @vma.
> >   */
>
> You should update the comment to reflect this possible return value.

Ack.

>
> > -static void mlock_vma_pages_range(struct vm_area_struct *vma,
> > +static int mlock_vma_pages_range(struct vm_area_struct *vma,
> >       unsigned long start, unsigned long end, vm_flags_t newflags)
> >  {
> >       static const struct mm_walk_ops mlock_walk_ops = {
> > @@ -441,7 +441,9 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
> >        */
> >       if (newflags & VM_LOCKED)
> >               newflags |= VM_IO;
> > -     vma_start_write(vma);
> > +     if (vma_start_write_killable(vma))
> > +             return -EINTR;
> > +
> >       vm_flags_reset_once(vma, newflags);
> >
> >       lru_add_drain();
> > @@ -452,6 +454,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
> >               newflags &= ~VM_IO;
> >               vm_flags_reset_once(vma, newflags);
> >       }
> > +     return 0;
> >  }
> >
> >  /*
> > @@ -501,10 +504,12 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >        */
> >       if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) {
> >               /* No work to do, and mlocking twice would be wrong */
>
> I'd move this comment down to the vm_flags_reset() line as it's not applicable
> if we fail to get the lock.

Ack.

>
>
> > -             vma_start_write(vma);
> > +             ret = vma_start_write_killable(vma);
> > +             if (ret)
> > +                     goto out;
> >               vm_flags_reset(vma, newflags);
> >       } else {
> > -             mlock_vma_pages_range(vma, start, end, newflags);
> > +             ret = mlock_vma_pages_range(vma, start, end, newflags);
> >       }
> >  out:
> >       *prev = vma;
> > @@ -733,9 +738,13 @@ static int apply_mlockall_flags(int flags)
> >
> >               error = mlock_fixup(&vmi, vma, &prev, vma->vm_start, vma->vm_end,
> >                                   newflags);
> > -             /* Ignore errors, but prev needs fixing up. */
> > -             if (error)
> > +             /* Ignore errors except EINTR, but prev needs fixing up. */
>
> Well, except you're not fixing it up on -EINTR? This comment should be redone.

Hmm, should we fixup if the process is terminating? Does it matter at
this point?

>
> But I wonder if this is correct? We are ignoring all other errors that
> interrupted the operation, so why are we special casing -EINTR?

Well, -EINTR means all the work you are doing here is useless because
the process is about to go away. So, in that respect I think it's
different from other errors.

>
> > +             if (error) {
> > +                     if (error == -EINTR)
> > +                             return error;
> > +
> >                       prev = vma;
> > +             }
> >               cond_resched();
> >       }
> >  out:
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index c0571445bef7..49dbb7156936 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -765,7 +765,9 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
> >        * vm_flags and vm_page_prot are protected by the mmap_lock
> >        * held in write mode.
> >        */
> > -     vma_start_write(vma);
> > +     error = vma_start_write_killable(vma);
> > +     if (error < 0)
>
> Weird inconstency here, this should be if (error).

Ack.

>
> > +             goto fail;
> >       vm_flags_reset_once(vma, newflags);
> >       if (vma_wants_manual_pte_write_upgrade(vma))
> >               mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 2be876a70cc0..aef1e5f373c7 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -1286,7 +1286,9 @@ static unsigned long move_vma(struct vma_remap_struct *vrm)
> >               return -ENOMEM;
> >
> >       /* We don't want racing faults. */
> > -     vma_start_write(vrm->vma);
> > +     err = vma_start_write_killable(vrm->vma);
> > +     if (err)
> > +             return err;
> >
> >       /* Perform copy step. */
> >       err = copy_vma_and_data(vrm, &new_vma);
> > diff --git a/mm/vma.c b/mm/vma.c
> > index bb4d0326fecb..9f2664f1d078 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -530,6 +530,13 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       if (err)
> >               goto out_free_vmi;
> >
> > +     err = vma_start_write_killable(vma);
> > +     if (err)
> > +             goto out_free_mpol;
> > +     err = vma_start_write_killable(new);
> > +     if (err)
> > +             goto out_free_mpol;
> > +
> >       err = anon_vma_clone(new, vma, VMA_OP_SPLIT);
> >       if (err)
> >               goto out_free_mpol;
> > @@ -540,9 +547,6 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       if (new->vm_ops && new->vm_ops->open)
> >               new->vm_ops->open(new);
> >
> > -     vma_start_write(vma);
> > -     vma_start_write(new);
> > -
>
> Again you're changing ordering seemingly arbitrarily.

I moved it up to avoid undoing all the work above (vm_ops->open,
vma_dup_policy, vma_iter_prealloc, vm_area_dup,..)

> I think this is actually a
> more problematic case as you're now invoking vm_ops->open() with a VMA write
> lock held.

Are you concerned about potential increased duration of the vma lock
being held or that open() might try to take that lock itself (which is
not a problem because vma write locks are reentrant)? Maybe some other
concern I'm missing?

>
> So I think you should keep the existing position.

If we do that then we would have to undo a bunch of operations. I'm
fine adding that if there are reasons to avoid this move.

>
> >       init_vma_prep(&vp, vma);
> >       vp.insert = new;
> >       vma_prepare(&vp);
> > @@ -895,16 +899,22 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
> >       }
> >
> >       /* No matter what happens, we will be adjusting middle. */
> > -     vma_start_write(middle);
> > +     err = vma_start_write_killable(middle);
> > +     if (err)
> > +             goto abort;
> >
> >       if (merge_right) {
> > -             vma_start_write(next);
> > +             err = vma_start_write_killable(next);
> > +             if (err)
> > +                     goto abort;
> >               vmg->target = next;
> >               sticky_flags |= (next->vm_flags & VM_STICKY);
> >       }
> >
> >       if (merge_left) {
> > -             vma_start_write(prev);
> > +             err = vma_start_write_killable(prev);
> > +             if (err)
> > +                     goto abort;
> >               vmg->target = prev;
> >               sticky_flags |= (prev->vm_flags & VM_STICKY);
> >       }
> > @@ -1155,10 +1165,12 @@ int vma_expand(struct vma_merge_struct *vmg)
> >       struct vm_area_struct *next = vmg->next;
> >       bool remove_next = false;
> >       vm_flags_t sticky_flags;
> > -     int ret = 0;
> > +     int ret;
> >
> >       mmap_assert_write_locked(vmg->mm);
> > -     vma_start_write(target);
> > +     ret = vma_start_write_killable(target);
> > +     if (ret)
> > +             return ret;
> >
> >       if (next && target != next && vmg->end == next->vm_end)
> >               remove_next = true;
> > @@ -1187,6 +1199,9 @@ int vma_expand(struct vma_merge_struct *vmg)
> >        * we don't need to account for vmg->give_up_on_mm here.
> >        */
> >       if (remove_next) {
> > +             ret = vma_start_write_killable(next);
> > +             if (ret)
> > +                     return ret;
> >               ret = dup_anon_vma(target, next, &anon_dup);
> >               if (ret)
> >                       return ret;
> > @@ -1197,10 +1212,8 @@ int vma_expand(struct vma_merge_struct *vmg)
> >                       return ret;
> >       }
> >
> > -     if (remove_next) {
> > -             vma_start_write(next);
> > +     if (remove_next)
> >               vmg->__remove_next = true;
> > -     }
>
> Hmm you're moving the ordering of things around again :) You should have made
> this change as part of patch 1 anyway first so this patch wouldn't have a
> refactoring in it too.
>
> Top of rmap.c suggests you _can_ take the VMA write lock prior to trying the dup
> but I'm just not sure why you'd want to switch these around in this patch?
>
> Can we try to keep original ordering unless there's a really good reason not to?

Again, I'm trying to avoid undoing things if locking fails but this
function already has rollback code, so I can reuse it. I'll keep the
old placement here.

>
> >       if (commit_merge(vmg))
> >               goto nomem;
> >
> > @@ -1233,6 +1246,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >              unsigned long start, unsigned long end, pgoff_t pgoff)
> >  {
> >       struct vma_prepare vp;
> > +     int err;
> >
> >       WARN_ON((vma->vm_start != start) && (vma->vm_end != end));
> >
> > @@ -1244,7 +1258,11 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       if (vma_iter_prealloc(vmi, NULL))
> >               return -ENOMEM;
> >
> > -     vma_start_write(vma);
> > +     err = vma_start_write_killable(vma);
> > +     if (err) {
> > +             vma_iter_free(vmi);
> > +             return err;
> > +     }
> >
> >       init_vma_prep(&vp, vma);
> >       vma_prepare(&vp);
> > @@ -1434,7 +1452,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> >                       if (error)
> >                               goto end_split_failed;
> >               }
> > -             vma_start_write(next);
> > +             error = vma_start_write_killable(next);
> > +             if (error)
> > +                     goto munmap_gather_failed;
> >               mas_set(mas_detach, vms->vma_count++);
> >               error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
> >               if (error)
> > @@ -1828,12 +1848,17 @@ static void vma_link_file(struct vm_area_struct *vma, bool hold_rmap_lock)
> >  static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
> >  {
> >       VMA_ITERATOR(vmi, mm, 0);
> > +     int err;
> >
> >       vma_iter_config(&vmi, vma->vm_start, vma->vm_end);
> >       if (vma_iter_prealloc(&vmi, vma))
> >               return -ENOMEM;
> >
> > -     vma_start_write(vma);
> > +     err = vma_start_write_killable(vma);
> > +     if (err) {
> > +             vma_iter_free(&vmi);
> > +             return err;
> > +     }
> >       vma_iter_store_new(&vmi, vma);
> >       vma_link_file(vma, /* hold_rmap_lock= */false);
> >       mm->map_count++;
> > @@ -2215,9 +2240,8 @@ int mm_take_all_locks(struct mm_struct *mm)
> >        * is reached.
> >        */
> >       for_each_vma(vmi, vma) {
> > -             if (signal_pending(current))
> > +             if (signal_pending(current) || vma_start_write_killable(vma))
> >                       goto out_unlock;
> > -             vma_start_write(vma);
> >       }
> >
> >       vma_iter_init(&vmi, mm, 0);
> > @@ -2522,6 +2546,11 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> >       if (!vma)
> >               return -ENOMEM;
> >
> > +     /* Lock the VMA since it is modified after insertion into VMA tree */
> > +     error = vma_start_write_killable(vma);
> > +     if (error)
> > +             goto free_vma;
> > +
>
> You're doing it again :)
>
> Can we please keep the lock acquisition at the point it is in unless there's a
> really good reason not to.
>
> And if there is a good reason, please do it in another commit prior to the
> massive 'change everything' one so it's more easily reviewable :)

The reason for this one is that I want to avoid undoing
__mmap_new_file_vma() if we fail to lock the VMA later. Undoing that
one would be messy, so I would prefer locking it earlier. These
operations are already performed under the mmap write lock. Is that
really a problem if we write-lock the VMA as well?

>
> >       vma_iter_config(vmi, map->addr, map->end);
> >       vma_set_range(vma, map->addr, map->end, map->pgoff);
> >       vm_flags_init(vma, map->vm_flags);
> > @@ -2552,8 +2581,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> >       WARN_ON_ONCE(!arch_validate_flags(map->vm_flags));
> >  #endif
> >
> > -     /* Lock the VMA since it is modified after insertion into VMA tree */
> > -     vma_start_write(vma);
> >       vma_iter_store_new(vmi, vma);
> >       map->mm->map_count++;
> >       vma_link_file(vma, map->hold_file_rmap_lock);
> > @@ -2864,6 +2891,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >                unsigned long addr, unsigned long len, vm_flags_t vm_flags)
> >  {
> >       struct mm_struct *mm = current->mm;
> > +     int err = -ENOMEM;
>
> I hate this 'default error code' pattern, it's a code smell. Please update
> everything that jumps to the failure case to set err, and leave this
> uninitialised.
>
> We've had real bugs come out of this before!

Ack.

>
> >
> >       /*
> >        * Check against address space limits by the changed size
> > @@ -2908,7 +2936,10 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       vma_set_range(vma, addr, addr + len, addr >> PAGE_SHIFT);
> >       vm_flags_init(vma, vm_flags);
> >       vma->vm_page_prot = vm_get_page_prot(vm_flags);
> > -     vma_start_write(vma);
> > +     if (vma_start_write_killable(vma)) {
> > +             err = -EINTR;
> > +             goto mas_store_fail;
> > +     }
> >       if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
> >               goto mas_store_fail;
> >
> > @@ -2928,7 +2959,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       vm_area_free(vma);
> >  unacct_fail:
> >       vm_unacct_memory(len >> PAGE_SHIFT);
> > -     return -ENOMEM;
> > +     return err;
> >  }
> >
> >  /**
> > @@ -3089,7 +3120,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> >       struct mm_struct *mm = vma->vm_mm;
> >       struct vm_area_struct *next;
> >       unsigned long gap_addr;
> > -     int error = 0;
> > +     int error;
> >       VMA_ITERATOR(vmi, mm, vma->vm_start);
> >
> >       if (!(vma->vm_flags & VM_GROWSUP))
> > @@ -3126,12 +3157,14 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> >
> >       /* We must make sure the anon_vma is allocated. */
> >       if (unlikely(anon_vma_prepare(vma))) {
> > -             vma_iter_free(&vmi);
> > -             return -ENOMEM;
> > +             error = -ENOMEM;
> > +             goto free;
> >       }
> >
> >       /* Lock the VMA before expanding to prevent concurrent page faults */
> > -     vma_start_write(vma);
> > +     error = vma_start_write_killable(vma);
> > +     if (error)
> > +             goto free;
> >       /* We update the anon VMA tree. */
> >       anon_vma_lock_write(vma->anon_vma);
> >
> > @@ -3160,6 +3193,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> >               }
> >       }
> >       anon_vma_unlock_write(vma->anon_vma);
> > +free:
>
> Nitty, but this kinda sucks as a label name, generally when the error label
> contains 'free' it is free_xxx where 'xxx' is some specific thing.
>
> So somethiing like 'fail' would be good.

Ack. Will change to something more appropriate.

>
> >       vma_iter_free(&vmi);
> >       validate_mm(mm);
> >       return error;
> > @@ -3174,7 +3208,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> >  {
> >       struct mm_struct *mm = vma->vm_mm;
> >       struct vm_area_struct *prev;
> > -     int error = 0;
> > +     int error;
> >       VMA_ITERATOR(vmi, mm, vma->vm_start);
> >
> >       if (!(vma->vm_flags & VM_GROWSDOWN))
> > @@ -3205,12 +3239,14 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> >
> >       /* We must make sure the anon_vma is allocated. */
> >       if (unlikely(anon_vma_prepare(vma))) {
> > -             vma_iter_free(&vmi);
> > -             return -ENOMEM;
> > +             error = -ENOMEM;
> > +             goto free;
> >       }
> >
> >       /* Lock the VMA before expanding to prevent concurrent page faults */
> > -     vma_start_write(vma);
> > +     error = vma_start_write_killable(vma);
> > +     if (error)
> > +             goto free;
> >       /* We update the anon VMA tree. */
> >       anon_vma_lock_write(vma->anon_vma);
> >
> > @@ -3240,6 +3276,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> >               }
> >       }
> >       anon_vma_unlock_write(vma->anon_vma);
> > +free:
>
> Obviously same comment her :)

Ack.

>
> >       vma_iter_free(&vmi);
> >       validate_mm(mm);
> >       return error;
> > diff --git a/mm/vma_exec.c b/mm/vma_exec.c
> > index 8134e1afca68..a4addc2a8480 100644
> > --- a/mm/vma_exec.c
> > +++ b/mm/vma_exec.c
> > @@ -40,6 +40,7 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> >       struct vm_area_struct *next;
> >       struct mmu_gather tlb;
> >       PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length);
> > +     int err;
> >
> >       BUG_ON(new_start > new_end);
> >
> > @@ -55,8 +56,9 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> >        * cover the whole range: [new_start, old_end)
> >        */
> >       vmg.target = vma;
> > -     if (vma_expand(&vmg))
> > -             return -ENOMEM;
> > +     err = vma_expand(&vmg);
> > +     if (err)
> > +             return err;
>
> Hmm. But before we were filtering the errors and now we're not... I guess not an
> issue as before it could _only_ return -ENOMEM, but again, are we sure all
> callers are fine with -EINTR I guess :)

This function is called only from setup_arg_pages() and all its
callers end up being linux_binfmt.load_binary handlers. The returned
error propagates all the way to execve() and its friends. And with my
extreme "luck" the execve syscall lists probably every single possible
error code except EINTR :) This is depressing...

Thanks for the detailed review, Lorenzo! I guess we need to discuss
these lock moves a bit more before I start on the new version.

>
> >
> >       /*
> >        * move the page tables downwards, on failure we rely on
> > --
> > 2.53.0.414.gf7e9f6c205-goog
> >
>
> Cheers, Lorenzo
>

next prev parent reply	other threads:[~2026-03-03 22:11 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-26  7:06 [PATCH v3 0/3] Use killable vma write locking in most places Suren Baghdasaryan
2026-02-26  7:06 ` [PATCH v3 1/3] mm/vma: cleanup error handling path in vma_expand() Suren Baghdasaryan
2026-02-26 16:42   ` Liam R. Howlett
2026-02-26 17:23     ` Suren Baghdasaryan
2026-03-02 13:57   ` Lorenzo Stoakes
2026-03-03 21:08     ` Suren Baghdasaryan
2026-02-26  7:06 ` [PATCH v3 2/3] mm: replace vma_start_write() with vma_start_write_killable() Suren Baghdasaryan
2026-02-26 17:43   ` Liam R. Howlett
2026-02-26 21:44     ` Suren Baghdasaryan
2026-03-02 14:52   ` Lorenzo Stoakes
2026-03-03 22:11     ` Suren Baghdasaryan [this message]
2026-03-03 22:18       ` Matthew Wilcox
2026-03-04  0:02         ` Suren Baghdasaryan
2026-02-26  7:06 ` [PATCH v3 3/3] mm: use vma_start_write_killable() in process_vma_walk_lock() Suren Baghdasaryan
2026-02-26 18:10   ` Claudio Imbrenda
2026-02-26 18:24     ` Suren Baghdasaryan
2026-02-27  8:57       ` Claudio Imbrenda
2026-03-02 15:19   ` Lorenzo Stoakes
2026-03-03 23:59     ` Suren Baghdasaryan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJuCfpHBfhKFeWAtQo4r-ofVtO=5MvG+OToEgc2DEY+cuZDSGw@mail.gmail.com' \
    --to=surenb@google.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=agordeev@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=byungchul@sk.com \
    --cc=chleroy@kernel.org \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=frankja@linux.ibm.com \
    --cc=gerald.schaefer@linux.ibm.com \
    --cc=gor@linux.ibm.com \
    --cc=gourry@gourry.net \
    --cc=hca@linux.ibm.com \
    --cc=imbrenda@linux.ibm.com \
    --cc=jannh@google.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kees@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=maddy@linux.ibm.com \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=mpe@ellerman.id.au \
    --cc=npache@redhat.com \
    --cc=npiggin@gmail.com \
    --cc=pfalcato@suse.de \
    --cc=rakie.kim@sk.com \
    --cc=ritesh.list@gmail.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=svens@linux.ibm.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox