From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5BD3DCD3440
	for <linux-mm@archiver.kernel.org>; Tue,  3 Sep 2024 21:10:37 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9DC248D01DE; Tue,  3 Sep 2024 17:10:36 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 98C278D016E; Tue,  3 Sep 2024 17:10:36 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 82CFB8D01DE; Tue,  3 Sep 2024 17:10:36 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 653C28D016E
	for <linux-mm@kvack.org>; Tue,  3 Sep 2024 17:10:36 -0400 (EDT)
Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id D9EE71C4B19
	for <linux-mm@kvack.org>; Tue,  3 Sep 2024 21:10:35 +0000 (UTC)
X-FDA: 82524670830.17.0B164B6
Received: from mail-ua1-f48.google.com (mail-ua1-f48.google.com [209.85.222.48])
	by imf05.hostedemail.com (Postfix) with ESMTP id 18E79100008
	for <linux-mm@kvack.org>; Tue,  3 Sep 2024 21:10:32 +0000 (UTC)
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=nrvTERea;
	spf=pass (imf05.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1725397737;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=k1YSZZgvfXVO5DYb8OvWgPT+mVWIMqYovLPmo9Ff4pQ=;
	b=11i/gAzXhkHQ4JkhZCDFRoOeG1CnqczwPqhI59DwTVaSLUe5FfKkPQA9Lbic6WZHgL1y97
	Nty15Do3SWa12FznI/MNggTSBex4TNOPM+DwevamFQSNU3ouzJ7EHgEJdr4lX7/9/ilqqs
	wAtgW3Emcm6ZL7KkxW7sHjJfoh2bbR8=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725397737; a=rsa-sha256;
	cv=none;
	b=cfgQI0y58xE0flE74PSHSshYjYdFc6XWQ73pmY7Lm58UgF+1+xLoQX/+cwKruezuIZ65CO
	otjlongD/KoePyVOUzFgdtToHVKvOkkhfy4O/MNAaDtO0F8wmZVYTRirbN6ZHjM9H6ya9X
	vtVY8KY3zVGIbVrScmxbMs751dmcEyQ=
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=nrvTERea;
	spf=pass (imf05.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-ua1-f48.google.com with SMTP id a1e0cc1a2514c-846bcf3677dso1187117241.1
        for <linux-mm@kvack.org>; Tue, 03 Sep 2024 14:10:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1725397832; x=1726002632; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=k1YSZZgvfXVO5DYb8OvWgPT+mVWIMqYovLPmo9Ff4pQ=;
        b=nrvTEReapSGpw2kEzrnoXA7qY+moDh/Yi6VQQ068K1Ba7d7kVcDHUt0Hc6PKNvFQ4F
         sWYKpAVCw3v7e5nqwqt7s0Tq0JVtNRh9roe2r6N3+Y7tc9VmUZvFhoHRBMTGMSQx6crj
         tLdw0pG7MWr0QO5B673pmnQB3J+JJiFxYm7x+PeSR7Pv9h1EuQz7SCUPr+OkP86pufHn
         iqPDB9RTryAGgHaNBIQgmJBpLyvqpYE4FA7qaFX3Y9QietvLG4KR7Lfq+G2+ZMOlukbn
         FFXqH8jBBZpFn9i0jmTfLXiCHVV3I3H5N2hkb/4yS2Z4pxalAna4l1ATFcO3t5HlqrDl
         S6FA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1725397832; x=1726002632;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=k1YSZZgvfXVO5DYb8OvWgPT+mVWIMqYovLPmo9Ff4pQ=;
        b=jera3uzqssJbxR61ZkIqPkQlP5al7+epSL9rF61sPFq8WwyI7NYgvhWLAxxj7T7V1H
         ORb/PIsgEIQo1VhY+M+Z4H/Dw29v1y75QKENKBYy0q7InbkLAvBWT+X2o8AA4UiCaQ+K
         G7fdupzhdG/cfJfHdm8Ipq+UC4eOInm3qs7cAKr7GlkuycoNVLxLrPQmhHhODQ0gEr6Y
         Cu6q/jaykvEG5SdVEZ1IQzyvQxXLCbxdy19VZ6mEmevYmr2NEiezpYqthUP3zvdjVR5L
         KtgUcLcP97W+jug0IR3Va58+YOnkPnjuWawfJlicXNNuhwglKQOe6OP/pNpwy5gdUipF
         FdTg==
X-Forwarded-Encrypted: i=1; AJvYcCWPiN+OVwUIW0qS2pPkIhkE7yYALd4TOv6xdV9t3880fa3yEv2dbX22j4QH1pzgl2U8YxnvUL2TvA==@kvack.org
X-Gm-Message-State: AOJu0YyvBEK85HBKnYjN5qQ1YEawwDACQxrGtI95aS79HkDjeh2JMPCt
	PYtCBEdgXZHI4261P8/OKlHqKReWigQu6yI92dF1ctLddtDSZHFOwhy6NtPsTSKmwo0a1TJLhB2
	/jgs5jCqdwvto9467pSE9uOvqHXs=
X-Google-Smtp-Source: AGHT+IERyU0B2xVmbm879xr3A+O4L9Hv/Kmux6TZiP0CyIKM7GOV6cJ1IJPUej4K7CC4lL4Df2rtOuqWOGR1OYtGNgA=
X-Received: by 2002:a05:6102:3593:b0:498:f027:254a with SMTP id
 ada2fe7eead31-49a798ffb28mr14780511137.1.1725397831982; Tue, 03 Sep 2024
 14:10:31 -0700 (PDT)
MIME-Version: 1.0
References: <20240902225009.34576-1-21cnbao@gmail.com> <20240903110109.1696-1-hdanton@sina.com>
 <CAGsJ_4xr-HvqKdh=Q=sVKM+hM+VS1Cf4gqPvq9vDtnQSBO9X=A@mail.gmail.com> <ZtcVYBUNWGow40pX@google.com>
In-Reply-To: <ZtcVYBUNWGow40pX@google.com>
From: Barry Song <21cnbao@gmail.com>
Date: Wed, 4 Sep 2024 09:10:20 +1200
Message-ID: <CAGsJ_4y-s25N94b2GnxypFhb-5bv53wOcJBt14Dx83M6AJD=7Q@mail.gmail.com>
Subject: Re: [PATCH] binder_alloc: Move alloc_page() out of mmap_rwsem to
 reduce the lock duration
To: Carlos Llamas <cmllamas@google.com>
Cc: Hillf Danton <hdanton@sina.com>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, Barry Song <v-songbaohua@oppo.com>, 
	Suren Baghdasaryan <surenb@google.com>, Michal Hocko <mhocko@suse.com>, 
	Tangquan Zheng <zhengtangquan@oppo.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 18E79100008
X-Stat-Signature: 9cj5n7x5n51bqbc8c6jqnris6ngiduuj
X-HE-Tag: 1725397832-49320
X-HE-Meta: U2FsdGVkX18HwIADki3Ruapy0l3YHHOgFvtAPw9x/po4UjoSh5UC1LfB0nkbAf8/UYSu+s6ZxV2Tg55/EAatJTh8ZqpS+XtA0jEDfyCN+x3kBJv5k09KzzHgcH6zLRcPja08l7K08niDYGxVWE0mnoY9CXrsW120W8hbZNHDDLliwx9nx7bUjsbdjZlLXrABum9XN/90dAv7Te9ukyy5UvLikxLrKpSx6sFKjCwTU5NI/X+FJuGIFaV1zlBM5zlOAhrBnb48/YFQxsIABIH2Znf/m2E/cF9wNLO5hmTJKzZ+1pm+rWKl1WJVO/VSbyCSs7Uai2yJTlX6KoZqR6M3Aq5hM8jcVbc1dkYlpvMaM7Xda4fvlA2i1j25aDGu1Lmx6FsjXuo5vsjx2MAXdUyuhM0EAYxAI4I9MJLa7QH3rgeAHQBuAOoKbsIgHdH9uoYBVJ3zP7UExb0pNVZRE/E05jmJcXgCXdCfF8C73u7sdWuSXBJAu2JvCkQmuJdtGX8RdsRI1jguk/TrBE5MfusSYjmNjiIfazscApkVp39eEb6KwPFKCDCsGgsQWcBoJAeRF7/sJ95sVpYM+kMbW7VCmPXBDEPDozZ+zHRvKheOAs1+1E75JExQwVEDi0yWX9kIBm9dIRBIxKWgRYCNekl3n2LSq387s/tIRjX1CdZ2ZL1XZaumvSPeLtAEVaQ+xUK4rr+E8Ib7MUvrF0hqcDjUcmWG5K5nyV6XQKi0TLznD3qeUUKntSG8Q33Oej97gzVZA64SHOyWm7cNpsn+yPuaFiulVPDhoO7vgMGwheAZryZ4zhO/byFJeTZ2nTjidCbvJAQq6J0Kgh06uAVQPgwmEaiUbJLx+9sQk/nnaL4PPGXuHbOfyq31DV1pZ1w+bTuJ0crO2C3Dp83vkCa0zV+lIGr59oRhIs7OJ7PpSvxy6zv9JD3q9ANLREcTNmIQ7zNpfFS10l+XZ4Fhb0tQ77E
 wCr7EjPM
 +0CGyGa2rK3yUuWqEhPqendgBIAUY/D40L2HATDBnbZ7LPYB3QVLzh1xOvwxlA2/vh/6OCu6jOsoJoOS7fH0A3PpGqZMS0/SLaSPAvSbofR6sEy5damMcVOcSLTOZoQp0qw2v1mB5+WXdLBaBEyu5nrt7TMz6P558c050u5CQ2msifKQJBgVEiCZgjCadFATkT+4R6b1cdHCDTCfg6OUczGYHWwXo2pejSk2+W26TpjP+wa/ZhacPeunx3ymHchZRR3cYy7tw4g7U2rFUKSBm2FXE7Meq3eoDVRJh/f01REFuxxH9rdDTR+DaISWwOJvEI4c3LiRtZbsBJjIGs4ScVZG8r6EtJzJwKCoEaAAfITORG+bdK92Z2vZXgQRzIWoyQ7AWEd9YOejP6pqIfw+4+m0rqY9f3DSJX1pRWFKxoGeULXgmi6t0k2VVSy6n3Lp/lUWbkHvNf38NBtE=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Sep 4, 2024 at 1:55=E2=80=AFAM Carlos Llamas <cmllamas@google.com> =
wrote:
>
> On Tue, Sep 03, 2024 at 07:45:12PM +0800, Barry Song wrote:
> > On Tue, Sep 3, 2024 at 7:01=E2=80=AFPM Hillf Danton <hdanton@sina.com> =
wrote:
> > >
> > > On Tue, Sep 03, 2024 at 10:50:09AM +1200, Barry Song wrote:
> > > > From: Barry Song <v-songbaohua@oppo.com>
> > > >
> > > > The mmap_write_lock() can block all access to the VMAs, for example=
 page
> > > > faults. Performing memory allocation while holding this lock may tr=
igger
> > > > direct reclamation, leading to others being queued in the rwsem for=
 an
> > > > extended period.
> > > > We've observed that the allocation can sometimes take more than 300=
ms,
> > > > significantly blocking other threads. The user interface sometimes
> > > > becomes less responsive as a result. To prevent this, let's move th=
e
> > > > allocation outside of the write lock.
>
> Thanks for you patch Barry. So, we are aware of this contention and I've
> been working on a fix for it. See more about this below.

Cool, Carlos!

>
> > >
> > > I suspect concurrent allocators make things better wrt response, cutt=
ing
> > > alloc latency down to 10ms for instance in your scenario. Feel free t=
o
> > > show figures given Tangquan's 48-hour profiling.
> >
> > Likely.
> >
> > Concurrent allocators are quite common in PFs which occur
> > in the same PTE. whoever gets PTL sets PTE, others free the allocated
> > pages.
> >
> > >
> > > > A potential side effect could be an extra alloc_page() for the seco=
nd
> > > > thread executing binder_install_single_page() while the first threa=
d
> > > > has done it earlier. However, according to Tangquan's 48-hour profi=
ling
> > > > using monkey, the likelihood of this occurring is minimal, with a r=
atio
> > > > of only 1 in 2400. Compared to the significantly costly rwsem, this=
 is
> > > > negligible.
>
> This is not negligible. In fact, it is the exact reason for the page
> allocation to be done with the mmap sem. If the first thread sleeps on
> vm_insert_page(), then binder gets into a bad state of multiple threads
> trying to reclaim pages that won't really be used. Memory pressure goes
> from bad to worst pretty quick.
>
> FWIW, I believe this was first talked about here:
> https://lore.kernel.org/all/ZWmNpxPXZSxdmDE1@google.com/

However, I'm not entirely convinced that this is a problem :-) Concurrent
allocations like this can occur in many places, especially in PFs. Reclamat=
ion
is not useless because it helps free up memory for others; it's not
without value.
I also don't believe binder is one of the largest users executing concurren=
t
allocations.

>
>
> > > > On the other hand, holding a write lock without making any VMA
> > > > modifications appears questionable and likely incorrect. While this
> > > > patch focuses on reducing the lock duration, future updates may aim
> > > > to eliminate the write lock entirely.
> > >
> > > If spin, better not before taking a look at vm_insert_page().
> >
> > I have patch 2/3 transitioning to mmap_read_lock, and per_vma_lock is
> > currently in the
> > testing queue. At the moment, alloc->spin is in place, but I'm not
> > entirely convinced
> > it's the best replacement for the write lock. Let's wait for
> > Tangquan's test results.
> >
> > Patch 2 is detailed below, but it has only passed the build-test phase
> > so far, so
> > its result is uncertain. I'm sharing it early in case you find it
> > interesting. And I
> > am not convinced Commit d1d8875c8c13 ("binder: fix UAF of alloc->vma in
> > race with munmap()") is a correct fix to really avoid all UAF of alloc-=
>vma.
> >
> > [PATCH]  binder_alloc: Don't use mmap_write_lock for installing page
> >
> > Commit d1d8875c8c13 ("binder: fix UAF of alloc->vma in race with
> > munmap()") uses the mmap_rwsem write lock to protect against a race
> > condition with munmap, where the vma is detached by the write lock,
> > but pages are zapped by the read lock. This approach is extremely
> > expensive for the system, though perhaps less so for binder itself,
> > as the write lock can block all other operations.
> >
> > As an alternative, we could hold only the read lock and re-check
> > that the vma hasn't been detached. To protect simultaneous page
> > installation, we could use alloc->lock instead.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  drivers/android/binder_alloc.c | 32 +++++++++++++++++---------------
> >  1 file changed, 17 insertions(+), 15 deletions(-)
> >
> > diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_al=
loc.c
> > index f20074e23a7c..a2281dfacbbc 100644
> > --- a/drivers/android/binder_alloc.c
> > +++ b/drivers/android/binder_alloc.c
> > @@ -228,24 +228,17 @@ static int binder_install_single_page(struct
> > binder_alloc *alloc,
> >                 return -ESRCH;
> >
> >         /*
> > -        * Don't allocate page in mmap_write_lock, this can block
> > -        * mmap_rwsem for a long time; Meanwhile, allocation failure
> > -        * doesn't necessarily need to return -ENOMEM, if lru_page
> > -        * has been installed, we can still return 0(success).
> > +        * Allocation failure doesn't necessarily need to return -ENOME=
M,
> > +        * if lru_page has been installed, we can still return 0(succes=
s).
> > +        * So, defer the !page check until after binder_get_installed_p=
age()
> > +        * is completed.
> >          */
> >         page =3D alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
> >
> > -       /*
> > -        * Protected with mmap_sem in write mode as multiple tasks
> > -        * might race to install the same page.
> > -        */
> > -       mmap_write_lock(alloc->mm);
> > -       if (binder_get_installed_page(lru_page)) {
> > -               ret =3D 1;
> > -               goto out;
> > -       }
> > +       mmap_read_lock(alloc->mm);
> >
> > -       if (!alloc->vma) {
> > +       /* vma might have been dropped or deattached */
> > +       if (!alloc->vma || !find_vma(alloc->mm, addr)) {
> >                 pr_err("%d: %s failed, no vma\n", alloc->pid, __func__)=
;
> >                 ret =3D -ESRCH;
> >                 goto out;
> > @@ -257,18 +250,27 @@ static int binder_install_single_page(struct
> > binder_alloc *alloc,
> >                 goto out;
> >         }
> >
> > +       spin_lock(&alloc->lock);
>
> You can't hold a spinlock and then call vm_insert_page().

Thanks! This patch has only passed the build test so far. It seems like
we can hold off on further testing for now.

>
> > +       if (binder_get_installed_page(lru_page)) {
> > +               spin_unlock(&alloc->lock);
> > +               ret =3D 1;
> > +               goto out;
> > +       }
> > +
> >         ret =3D vm_insert_page(alloc->vma, addr, page);
> >         if (ret) {
> >                 pr_err("%d: %s failed to insert page at offset %lx with=
 %d\n",
> >                        alloc->pid, __func__, addr - alloc->buffer, ret)=
;
> > +               spin_unlock(&alloc->lock);
> >                 ret =3D -ENOMEM;
> >                 goto out;
> >         }
> >
> >         /* Mark page installation complete and safe to use */
> >         binder_set_installed_page(lru_page, page);
> > +       spin_unlock(&alloc->lock);
> >  out:
> > -       mmap_write_unlock(alloc->mm);
> > +       mmap_read_unlock(alloc->mm);
> >         mmput_async(alloc->mm);
> >         if (ret && page)
> >                 __free_page(page);
> > --
> > 2.39.3 (Apple Git-146)
>
>
> Sorry, but as I mentioned, I've been working on fixing this contention
> by supporting concurrent "faults" in binder_install_single_page(). This
> is the appropriate fix. I should be sending a patch soon after working
> out the conflicts with the shrinker's callback.

Awesome! I=E2=80=99m eager to see your patch, and we=E2=80=99re ready to he=
lp with testing.
I strongly recommend dropping the write lock entirely. Using
`mmap_write_lock()` isn=E2=80=99t just a binder-specific concern; it has th=
e
potential to affect the entire Android system.

In patch 3, I experimented with using `per_vma_lock` as well. I=E2=80=99m _=
not_
proposing it for merging since you=E2=80=99re already working on it, but I =
wanted
to share the idea. (just like patch2, it has only passed build-test)

[PATCH] binder_alloc: Further move to per_vma_lock from mmap_read_lock

To further reduce the read lock duration, let's try using per_vma_lock
first. If that fails, we can take the read lock, similar to how page
fault handlers operate.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 drivers/android/binder_alloc.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.=
c
index a2281dfacbbc..b40a5dd650c8 100644
--- a/drivers/android/binder_alloc.c
+++ b/drivers/android/binder_alloc.c
@@ -221,6 +221,8 @@ static int binder_install_single_page(struct
binder_alloc *alloc,
                                      struct binder_lru_page *lru_page,
                                      unsigned long addr)
 {
+       struct vm_area_struct *vma;
+       bool per_vma_lock =3D true;
        struct page *page;
        int ret =3D 0;

@@ -235,10 +237,15 @@ static int binder_install_single_page(struct
binder_alloc *alloc,
         */
        page =3D alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);

-       mmap_read_lock(alloc->mm);
+       vma =3D lock_vma_under_rcu(alloc->mm, addr);
+       if (!vma) {
+               per_vma_lock =3D false;
+               mmap_read_lock(alloc->mm);
+               vma =3D find_vma(alloc->mm, addr);
+       }

-       /* vma might have been dropped or deattached */
-       if (!alloc->vma || !find_vma(alloc->mm, addr)) {
+       /* vma might have been dropped, deattached or changed to new one */
+       if (!alloc->vma || !vma || vma !=3D alloc->vma) {
                pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
                ret =3D -ESRCH;
                goto out;
@@ -270,7 +277,10 @@ static int binder_install_single_page(struct
binder_alloc *alloc,
        binder_set_installed_page(lru_page, page);
        spin_unlock(&alloc->lock);
 out:
-       mmap_read_unlock(alloc->mm);
+       if (per_vma_lock)
+               vma_end_read(vma);
+       else
+               mmap_read_unlock(alloc->mm);
        mmput_async(alloc->mm);
        if (ret && page)
                __free_page(page);
--=20
2.39.3 (Apple Git-146)


>
> Thanks,
> --
> Carlos Llamas

Thanks
Barry