From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A13B5C4828F
	for <linux-mm@archiver.kernel.org>; Wed,  7 Feb 2024 18:48:52 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 38B846B0075; Wed,  7 Feb 2024 13:48:52 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 33A676B0078; Wed,  7 Feb 2024 13:48:52 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 229376B007D; Wed,  7 Feb 2024 13:48:52 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 13F5B6B0075
	for <linux-mm@kvack.org>; Wed,  7 Feb 2024 13:48:52 -0500 (EST)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id D2C3114025C
	for <linux-mm@kvack.org>; Wed,  7 Feb 2024 18:48:51 +0000 (UTC)
X-FDA: 81765894462.11.282CE0E
Received: from mail-wm1-f45.google.com (mail-wm1-f45.google.com [209.85.128.45])
	by imf19.hostedemail.com (Postfix) with ESMTP id E7B321A0005
	for <linux-mm@kvack.org>; Wed,  7 Feb 2024 18:48:49 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=t8c1smN9;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf19.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.128.45 as permitted sender) smtp.mailfrom=lokeshgidra@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707331730; a=rsa-sha256;
	cv=none;
	b=1sHSQNSMRbvPKnf368iKsbz4aBsYNU16iOdYKClQUumWEMnhOPVx2dFGxwZUXjAraAH5i9
	gvF6zvtzs+GFwOASCJeyW99bbK9+qNx7wl0HEY474/Tedc5aTJUt7mWOzhqZFOXt8xWGsT
	cPZuASKvGJkFpHyn0UsD2q1fCM3Mlig=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=t8c1smN9;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf19.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.128.45 as permitted sender) smtp.mailfrom=lokeshgidra@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1707331730;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=YEbS8x6FOBuKd7dbPbw/YK28mMkf7/ZaGGkdJ/zSoBI=;
	b=xFY53cAhfyNEVN4Zb0KyO/IOJ1FJxZRPL8vSSDz03lEC3TLTlXm9vI6MQeTaYOpphcPb1W
	J1SuoyYS5U+OEDzfnI/bPCHPkoCykwOB7HsdPy07GPSzARjUXSuSq535y2aIW8ITFUKOml
	Mod4PHDTLDD2Ln+DsvHEyn7wlfLA9hc=
Received: by mail-wm1-f45.google.com with SMTP id 5b1f17b1804b1-40fe282b8adso7961975e9.2
        for <linux-mm@kvack.org>; Wed, 07 Feb 2024 10:48:49 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1707331728; x=1707936528; darn=kvack.org;
        h=content-transfer-encoding:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=YEbS8x6FOBuKd7dbPbw/YK28mMkf7/ZaGGkdJ/zSoBI=;
        b=t8c1smN9KmvSf10gFuf/8j35NrOVdKJWAplSktVI0jgI5jd/C/8HsqSlHEEvtoDLVN
         wse6808Km0+vYuELmIJBFieDhSqeSGK0CF8qLIMTdJ2NokddHZzcH3R3+zjIyidBmzW2
         Tpjw46xEIrFBjYHSu9oOyqr3WL/VE8BTUWCgUckUuvJ29S5JQOaMHKPZroGYYuHI4BiE
         hyH+hnIQlXDQvJaXTBF7cLIjo8Gxq5LvICbXjgrk7Y43j5b/gE6Q0/M2AeoFy81Lf1XP
         qsytaSbaPWCsOQj3pdpPXq+3c3eKxg5LPYFl84rDJJ7mJs/JfREdX/5K8sUqrbNPC5av
         6TSQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1707331728; x=1707936528;
        h=content-transfer-encoding:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=YEbS8x6FOBuKd7dbPbw/YK28mMkf7/ZaGGkdJ/zSoBI=;
        b=Ecy9qPdyh+UBDz3Kdclvkc5yKOqek2lHm5vLxX9FjVNZtH+sQZyP6DqzCfGxLfZ5Ny
         veQD9aBOysTK2igCEvPmfhynt0sHQqkWu8zBzu0g5TH4TOaED05H8LJCo6U2S4SCWmjI
         cQEfzFl2I87WCHOYAX0P+CFDCW8M4XWThUj+P+FGiDj4Y8gp6reozbWA1YQegpzp0MSg
         /YW533FCG3auQSojYcup4SuoGAUQuGKBnU9wRdZpiO6qtUgp8QKIVRRSW0ZlU4/6FuJu
         G1AnQzoRIedQCnH0T+ma9g7ED87D1+CrN9+J/Xn0dz9AhUkc4cXEPqMD5oKDdN7N68QA
         vd3A==
X-Forwarded-Encrypted: i=1; AJvYcCXKsKDi9cXMHLdJF4LDZQ26V5KoVES3xYF0FPSC84yhJR6aR6zvq29cLBhHUtRmkoSrKDoePIcr55FG+1dxPYeFmMU=
X-Gm-Message-State: AOJu0YxkSl4OIJXE8SJjgSERHm+e+x5edYLAaYB+DY9/+7QQwTr9l3Sq
	+zf1uChaqHMYmNCf/l5q8BA0XIx3xRJbUi/a1nDJktj/zp+Rx33CqdFfq3m9AexiU/5OJQdyQc0
	Ll/Zy0HV+rHe/IomkSIRkegb1NgUC0iQRgPAv
X-Google-Smtp-Source: AGHT+IGz9sOVYCRyXKULBfQNcdwNAFxBi6ehHuFrlTPaZLS1FHBgo2NLZZk6fRMhF6NzcnxzFzZM+y4zxCp+jZchzJ0=
X-Received: by 2002:a5d:47a2:0:b0:33b:3aa6:a28e with SMTP id
 2-20020a5d47a2000000b0033b3aa6a28emr4478697wrb.55.1707331728007; Wed, 07 Feb
 2024 10:48:48 -0800 (PST)
MIME-Version: 1.0
References: <20240206010919.1109005-1-lokeshgidra@google.com>
 <20240206010919.1109005-4-lokeshgidra@google.com> <20240206170501.3caqeylaogpaemuc@revolver>
In-Reply-To: <20240206170501.3caqeylaogpaemuc@revolver>
From: Lokesh Gidra <lokeshgidra@google.com>
Date: Wed, 7 Feb 2024 10:48:35 -0800
Message-ID: <CA+EESO7OExRs8Tz2SRD5EZoVf1DocoTZyG4c0Y89xDzZAVViGw@mail.gmail.com>
Subject: Re: [PATCH v3 3/3] userfaultfd: use per-vma locks in userfaultfd operations
To: "Liam R. Howlett" <Liam.Howlett@oracle.com>, Lokesh Gidra <lokeshgidra@google.com>, 
	akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, selinux@vger.kernel.org, surenb@google.com, 
	kernel-team@android.com, aarcange@redhat.com, peterx@redhat.com, 
	david@redhat.com, axelrasmussen@google.com, bgeffon@google.com, 
	willy@infradead.org, jannh@google.com, kaleshsingh@google.com, 
	ngeoffray@google.com, timmurray@google.com, rppt@kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: E7B321A0005
X-Stat-Signature: kxj7kukiqdxbsta8xumif7msej7p9p57
X-HE-Tag: 1707331729-462833
X-HE-Meta: U2FsdGVkX19L0IjlXecD4GqjNQAS0KUORG+BjzMEQfm0FWxovQfs6vNUlipjf/cX/wRIfHDjuGwPDz+FdOjWSqhTJL1UwlKcp77dg0LJ90kG873k3UrEdnhFUHnSnB15ajYaCV+vmoiaEO7PKIhZniNIXNqbOO3p4nG21bVM4YRBVMG/iHXWBt5w8LTTOycAT++xJEyI/rzvO0+kH+T7VejWi8GH7ogUk1D6XWhg+uE9NXv+qDYdrkZHx8HnZ7dLsy7QGoSaf6tsPpydF9cMmxnV4wuBIyIKRmv31VuqIOlZ9bS9g0LOLgzt0UeqK0E2CuqjZ6YMOhT+tF6VzZdLol8Aljae3hwcLccvmaFOH0T8fq8Q6Po+ue0sbgc6ClMrO8smhQQSGE64+fT69zG24GyjDCSRC6I347OREBvet/nGhzShMnd7NDXDL2vIZiTf6aZ9QtI66pedXqt8zdUJAkIysWBSj/ew1K/v8toaVyLpdJVIndjbMF4Tkk2qnzbzrvqr8xHTsRLS7SKUIt3Az2qciGXK4mkwTaZI7YvGX8Tt4j6gPeO2v51+oXrGanxsn1E9W6bcg98IGhalXVJmCNbHVuaC3958HGpyzJqopY6Pgj6LgPq1oIvF68TxEvksHCGsv/Pp1w8HP9yJzD20TSvaq70/ltmhjmw75xWOAcHoESImxJEGdu/DXYWTRZX7GiBvmyB7RGiKBLhJz0XdpTGG4MtRKTE+UpQAVPl1zNd4bWxB1XQ+zf0meNAwf1nm+pSRWBKnX6l3KxuMZgoOdYeFqZ9X3X2pFsFnf7n9RH/sSzdLRNBhdqlUo4ZJoZeIfTDR0hC7WNogA2fLVFVsHVkN7zYW85457idZIr5NLeV8FA+D9EuaABihXk/BolLimZr5AWI/G4PFRyikz+3HKLiR3kADdeVlc8ubYoRttTIV4F0kmTaeKCzSZiPJ8GYsvw7S54GwyighV1t+XZt
 FujccdrH
 mAx6VfYucCn6GY9ddw/Zv6SMDUuxoD0cP/H3P+32t1HhXlvnK6I3UtTwrJLBc6hvGsEttvDE3PBUItl1+mP+a2RTWy7yj/zSpmNxhw4xiprK8RImRRaIicFZHeXAE3LDhDa5T+7CT/+qCLuM41yObvBLd/cfgRKyc77OHDgR+t5ptTWqdhDiIveNqe/+4LEPsIoJI/qQT13JtFV8cwMabHrqTTa7rBARO15nIpBXjI4bl98o0o3SX8aNO2soFay0PWI1LNZ6chaFxRpZetQtof/xcLbm8Bkhmftn/4kjjPjEghxc=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Feb 6, 2024 at 9:05=E2=80=AFAM Liam R. Howlett <Liam.Howlett@oracle=
.com> wrote:
>
> * Lokesh Gidra <lokeshgidra@google.com> [240205 20:10]:
> > All userfaultfd operations, except write-protect, opportunistically use
> > per-vma locks to lock vmas. On failure, attempt again inside mmap_lock
> > critical section.
> >
> > Write-protect operation requires mmap_lock as it iterates over multiple
> > vmas.
> >
> > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> > ---
> >  fs/userfaultfd.c              |  13 +-
> >  include/linux/mm.h            |  16 +++
> >  include/linux/userfaultfd_k.h |   5 +-
> >  mm/memory.c                   |  48 +++++++
> >  mm/userfaultfd.c              | 242 +++++++++++++++++++++-------------
> >  5 files changed, 222 insertions(+), 102 deletions(-)
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index c00a021bcce4..60dcfafdc11a 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -2005,17 +2005,8 @@ static int userfaultfd_move(struct userfaultfd_c=
tx *ctx,
> >               return -EINVAL;
> >
> >       if (mmget_not_zero(mm)) {
> > -             mmap_read_lock(mm);
> > -
> > -             /* Re-check after taking map_changing_lock */
> > -             down_read(&ctx->map_changing_lock);
> > -             if (likely(!atomic_read(&ctx->mmap_changing)))
> > -                     ret =3D move_pages(ctx, mm, uffdio_move.dst, uffd=
io_move.src,
> > -                                      uffdio_move.len, uffdio_move.mod=
e);
> > -             else
> > -                     ret =3D -EAGAIN;
> > -             up_read(&ctx->map_changing_lock);
> > -             mmap_read_unlock(mm);
> > +             ret =3D move_pages(ctx, uffdio_move.dst, uffdio_move.src,
> > +                              uffdio_move.len, uffdio_move.mode);
> >               mmput(mm);
> >       } else {
> >               return -ESRCH;
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 0d1f98ab0c72..e69dfe2edcce 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -753,6 +753,11 @@ static inline void release_fault_lock(struct vm_fa=
ult *vmf)
> >               mmap_read_unlock(vmf->vma->vm_mm);
> >  }
> >
> > +static inline void unlock_vma(struct mm_struct *mm, struct vm_area_str=
uct *vma)
> > +{
> > +     vma_end_read(vma);
> > +}
> > +
> >  static inline void assert_fault_locked(struct vm_fault *vmf)
> >  {
> >       if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > @@ -774,6 +779,9 @@ static inline void vma_assert_write_locked(struct v=
m_area_struct *vma)
> >               { mmap_assert_write_locked(vma->vm_mm); }
> >  static inline void vma_mark_detached(struct vm_area_struct *vma,
> >                                    bool detached) {}
> > +static inline void vma_acquire_read_lock(struct vm_area_struct *vma) {
> > +     mmap_assert_locked(vma->vm_mm);
> > +}
> >
> >  static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_stru=
ct *mm,
> >               unsigned long address)
> > @@ -786,6 +794,11 @@ static inline void release_fault_lock(struct vm_fa=
ult *vmf)
> >       mmap_read_unlock(vmf->vma->vm_mm);
> >  }
> >
> > +static inline void unlock_vma(struct mm_struct *mm, struct vm_area_str=
uct *vma)
> > +{
> > +     mmap_read_unlock(mm);
> > +}
> > +
>
> Instead of passing two variables and only using one based on
> configuration of kernel build, why not use vma->vm_mm to
> mmap_read_unlock() and just pass the vma?
>
> It is odd to call unlock_vma() which maps to mmap_read_unlock().  Could
> we have this abstraction depend on CONFIG_PER_VMA_LOCK in uffd so that
> reading the code remains clear?  You seem to have pretty much two
> versions of each function already.  If you do that, then we can leave
> unlock_vma() undefined if !CONFIG_PER_VMA_LOCK.
>
> >  static inline void assert_fault_locked(struct vm_fault *vmf)
> >  {
> >       mmap_assert_locked(vmf->vma->vm_mm);
> > @@ -794,6 +807,9 @@ static inline void assert_fault_locked(struct vm_fa=
ult *vmf)
> >  #endif /* CONFIG_PER_VMA_LOCK */
> >
> >  extern const struct vm_operations_struct vma_dummy_vm_ops;
> > +extern struct vm_area_struct *lock_vma(struct mm_struct *mm,
> > +                                    unsigned long address,
> > +                                    bool prepare_anon);
> >
> >  /*
> >   * WARNING: vma_init does not initialize vma->vm_lock.
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_=
k.h
> > index 3210c3552976..05d59f74fc88 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -138,9 +138,8 @@ extern long uffd_wp_range(struct vm_area_struct *vm=
a,
> >  /* move_pages */
> >  void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2);
> >  void double_pt_unlock(spinlock_t *ptl1, spinlock_t *ptl2);
> > -ssize_t move_pages(struct userfaultfd_ctx *ctx, struct mm_struct *mm,
> > -                unsigned long dst_start, unsigned long src_start,
> > -                unsigned long len, __u64 flags);
> > +ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_star=
t,
> > +                unsigned long src_start, unsigned long len, __u64 flag=
s);
> >  int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *s=
rc_pmd, pmd_t dst_pmdval,
> >                       struct vm_area_struct *dst_vma,
> >                       struct vm_area_struct *src_vma,
> > diff --git a/mm/memory.c b/mm/memory.c
> > index b05fd28dbce1..393ab3b0d6f3 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5760,8 +5760,56 @@ struct vm_area_struct *lock_vma_under_rcu(struct=
 mm_struct *mm,
> >       count_vm_vma_lock_event(VMA_LOCK_ABORT);
> >       return NULL;
> >  }
> > +
> > +static void vma_acquire_read_lock(struct vm_area_struct *vma)
> > +{
> > +     /*
> > +      * We cannot use vma_start_read() as it may fail due to false loc=
ked
> > +      * (see comment in vma_start_read()). We can avoid that by direct=
ly
> > +      * locking vm_lock under mmap_lock, which guarantees that nobody =
could
> > +      * have locked the vma for write (vma_start_write()).
> > +      */
> > +     mmap_assert_locked(vma->vm_mm);
> > +     down_read(&vma->vm_lock->lock);
> > +}
> >  #endif /* CONFIG_PER_VMA_LOCK */
> >
> > +/*
> > + * lock_vma() - Lookup and lock VMA corresponding to @address.
>
> Missing arguments in the comment
>
> > + * @prepare_anon: If true, then prepare the VMA (if anonymous) with an=
on_vma.
> > + *
> > + * Should be called without holding mmap_lock. VMA should be unlocked =
after use
> > + * with unlock_vma().
> > + *
> > + * Return: A locked VMA containing @address, NULL of no VMA is found, =
or
> > + * -ENOMEM if anon_vma couldn't be allocated.
> > + */
> > +struct vm_area_struct *lock_vma(struct mm_struct *mm,
> > +                             unsigned long address,
> > +                             bool prepare_anon)
> > +{
> > +     struct vm_area_struct *vma;
> > +
> > +     vma =3D lock_vma_under_rcu(mm, address);
> > +
>
> Nit: extra new line
>
> > +     if (vma)
> > +             return vma;
> > +
> > +     mmap_read_lock(mm);
> > +     vma =3D vma_lookup(mm, address);
> > +     if (vma) {
> > +             if (prepare_anon && vma_is_anonymous(vma) &&
> > +                 anon_vma_prepare(vma))
> > +                     vma =3D ERR_PTR(-ENOMEM);
> > +             else
> > +                     vma_acquire_read_lock(vma);
> > +     }
> > +
> > +     if (IS_ENABLED(CONFIG_PER_VMA_LOCK) || !vma || PTR_ERR(vma) =3D=
=3D -ENOMEM)
> > +             mmap_read_unlock(mm);
> > +     return vma;
> > +}
> > +
>
> It is also very odd that lock_vma() may, in fact, be locking the mm.  It
> seems like there is a layer of abstraction missing here, where your code
> would either lock the vma or lock the mm - like you had before, but
> without the confusing semantics of unlocking with a flag.  That is, we
> know what to do to unlock based on CONFIG_PER_VMA_LOCK, but it isn't
> always used.
>
> Maybe my comments were not clear on what I was thinking on the locking
> plan.  I was thinking that, in the CONFIG_PER_VMA_LOCK case, you could
> have a lock_vma() which does the per-vma locking which you can use in
> your code.  You could call lock_vma() in some uffd helper function that
> would do what is required (limit checking, etc) and return a locked vma.
>
> The counterpart of that would be another helper function that would do
> what was required under the mmap_read lock (limit check, etc).  The
> unlocking would be entirely config dependant as you have today.
>
> Just write the few functions you have twice: once for per-vma lock
> support, once without it.  Since we now can ensure the per-vma lock is
> taken in the per-vma lock path (or it failed), then you don't need to
> mmap_locked boolean you had in the previous version.  You solved the
> unlock issue already, but it should be abstracted so uffd calls the
> underlying unlock vs vma_unlock() doing an mmap_read_unlock() - because
> that's very confusing to see.
>
> I'd drop the vma from the function names that lock the mm or the vma as
> well.
>
> Thanks,
> Liam

I got it now. I'll make the changes in the next version.

Would it be ok to make lock_vma()/unlock_vma() (in case of
CONFIG_PER_VMA_LOCK) also be defined in mm/userfaultfd.c? The reason I
say this is because first there are no other users of these functions.
And also due to what Jann pointed out about anon_vma.
lock_vma_under_rcu() (rightly) only checks for private+anonymous case
and not private+file-backed case. So lock_vma() implementation is
getting very userfaultfd specific IMO.