From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7A3C1C5B542
	for <linux-mm@archiver.kernel.org>; Tue, 27 May 2025 20:40:22 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D56EB6B0096; Tue, 27 May 2025 16:40:21 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D2EB36B0098; Tue, 27 May 2025 16:40:21 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C43D86B0099; Tue, 27 May 2025 16:40:21 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id A70596B0096
	for <linux-mm@kvack.org>; Tue, 27 May 2025 16:40:21 -0400 (EDT)
Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 043741616B6
	for <linux-mm@kvack.org>; Tue, 27 May 2025 20:40:20 +0000 (UTC)
X-FDA: 83489855442.02.2CADAE9
Received: from mail-ed1-f41.google.com (mail-ed1-f41.google.com [209.85.208.41])
	by imf15.hostedemail.com (Postfix) with ESMTP id 02D5AA0016
	for <linux-mm@kvack.org>; Tue, 27 May 2025 20:40:18 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=V6JIIkWA;
	spf=pass (imf15.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.208.41 as permitted sender) smtp.mailfrom=lokeshgidra@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1748378419;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=91mdXzjYrGvaj7Dh9A9yl2QArCWVP/eA86xKiPghOcI=;
	b=iAmhd2OXHutbzU6jKVgarn++SlmZXNy5M39wiqRa/hFh989UXfzudCBfwlilsGejVBiEZg
	ImtIUbffdZPARS91WgPGKXvTsC4CiIEFPmPs7UI0J761G4Cdjz1VN3uFytb70PSTufmmmb
	Qg4eRZ5g8+R1M8urPiXUBndIlwFaLIk=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=V6JIIkWA;
	spf=pass (imf15.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.208.41 as permitted sender) smtp.mailfrom=lokeshgidra@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748378419; a=rsa-sha256;
	cv=none;
	b=5mQ6KAfTJCbr/nWCXxjkVYmb1ghVcm2bp75Xx0QTsz7wL7ukFeBgDw5sJ874xdsFj7K5Hd
	uCZcSevyIk2HDoYXqAZWJSTMRJ/5twKqYbfA0YEEsBUt5k1T4IsiJbRVH8xb8aGB2UpVCf
	MVUqjW1ubklH0y6ju/VcdCPEqAiRPE0=
Received: by mail-ed1-f41.google.com with SMTP id 4fb4d7f45d1cf-6024087086dso3169a12.0
        for <linux-mm@kvack.org>; Tue, 27 May 2025 13:40:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1748378417; x=1748983217; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=91mdXzjYrGvaj7Dh9A9yl2QArCWVP/eA86xKiPghOcI=;
        b=V6JIIkWAUMJYDcOoEQG1G68WPXmlVsztnmspZ3lPnK40ygATbim3Dyfx3CxHSn8LAP
         sWdhXs+CSj2DuSldtgWBTAqAqhRC728xf/W03WnnE+CzSA3H1YDRuzJtGr3O6zys0rjj
         0iWanYBF5b8oRwbq9yEasJ1Dk4uiEXVSuy5wPZQ8R87lUUFQ+SEtRPUGka8cM20oBuBk
         ZR7Lt48WAJZ25dSeAX+pTPKPXOQQw6Qtbsne+Ma2lfO9NHRKNusYBZ/xXELvpw6esi7g
         zcd8eU+/xBMF1VMixEU8BPwbZiIkNwYYKljsCbVBKR1GMMT6iE/z2xev0EGPp+mQ2xPQ
         VWeg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1748378417; x=1748983217;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=91mdXzjYrGvaj7Dh9A9yl2QArCWVP/eA86xKiPghOcI=;
        b=FgM3kzs7GswRfzZKEVIExHjZBdBqDWLnD1T4vFkWN4q+vvCNoCK6QudwKcf3cDITc2
         IoRKJ0KM6saYJONuwChat8zokFjtTg3CNIsZVJBLcAqiIzBfgI/ikaodPWQznOI4IvPR
         VvZkvR9GEwqNR4bzD/2wIdhXZhoCLz+DzaDBTs6l1XNC1eGzhd4wfzl0IYWmrVuhX57x
         Qx6rI+o9zR1Wf+MtkJqQ0f5DaqNs5ZAYTAwwKDycOS8riCwFl2B6aMONdmFgs+nDkej9
         DCHBKqiEBZFxYkoQeGt5yT7xn6MNU1SAVu6uA8yzBrINuKBKdv5imkjazhrp6ki/JJbh
         GKZg==
X-Forwarded-Encrypted: i=1; AJvYcCW/VmPZFYw7fddBNIXIoXggogp8/R8/EN6EtbpTKTRa6liJsd/aQKgbK2Y+aDr/BUZJRZzOhz68UQ==@kvack.org
X-Gm-Message-State: AOJu0Yw9f9fBtEYcafoHhn21CwDhajU0raunYBl3+ycdZ+l5DGWkufIX
	KyglQEePNHoo1Il0lEfc0+Oeaz8+/psnQTfXR2/eKoCBuqXWUTja+3L9nZiweqt1cLYh7n9Aogl
	R+oSYq+CKq2z/LDE+wqpNsGnAtg0SMFDNt5g4iGEk
X-Gm-Gg: ASbGncv7TMCcuN1y/6EwJJh0ZeUSE0iilaAEhEGcYxyQSM8h0O2DqpSX/19HcWjEkXr
	dE7adygq6+4KQVjpbPggfXCUYpzG7EZj/CUdtsSPjppOhaJwf3rLKI6qbp3qTs4Rz2eTdDaht6v
	dmX6fjaJpalpeSbnGA9Bv0ehmecef02NuidLzq6dmcsO+Tk2iHB8owgtzLLvWD/kh8pUQo6mbK
X-Google-Smtp-Source: AGHT+IEyQdVjEh01ByN1Gy4wyq+UouMQztZNI3UXZcPd8rFwNBK7jeF6QdKgHLHKDSNEfhZEV1rUngjqYE7RKwuOyXI=
X-Received: by 2002:aa7:cfd4:0:b0:600:9008:4a40 with SMTP id
 4fb4d7f45d1cf-60515940cd3mr27335a12.4.1748378417078; Tue, 27 May 2025
 13:40:17 -0700 (PDT)
MIME-Version: 1.0
References: <20250527044145.13153-1-21cnbao@gmail.com> <93385672-927f-4de5-a158-fc3fc0424be0@lucifer.local>
In-Reply-To: <93385672-927f-4de5-a158-fc3fc0424be0@lucifer.local>
From: Lokesh Gidra <lokeshgidra@google.com>
Date: Tue, 27 May 2025 13:40:05 -0700
X-Gm-Features: AX0GCFvsqylKcHHDepKynOdNf6b-SwzsoF1wjU3y-l44a5LLCSguH4yWv1a1Tvw
Message-ID: <CA+EESO6_RBX=nvrWO46aR7Q7xibh8fM-BX2p7_ihcbYyMfpVYQ@mail.gmail.com>
Subject: Re: [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, Barry Song <v-songbaohua@oppo.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, David Hildenbrand <david@redhat.com>, 
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>, 
	Suren Baghdasaryan <surenb@google.com>, Tangquan Zheng <zhengtangquan@oppo.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: gr94dmpt6oqi8t6nmrs7gd8jk43cyqw4
X-Rspamd-Queue-Id: 02D5AA0016
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-HE-Tag: 1748378418-291754
X-HE-Meta: U2FsdGVkX1+qUkeU/pzijcexthy0gXskYibwH/BSYTt1JVYOowDkQSIoDMYg+zmQoakMjY1hqSQFWo1hFfFLTtylVgAyU4eXHI4xRBvWFVzBJywT/SWLSqlGS0Ro4n7PILTLm1S+FWUgeSwiM+mIhUzH17t3Rwdoj4sG/0CN7hW/GwK4Jiusuyt8GQ6A1LjiS0LlaFx7KGbq56NHpUXlmS7B+dgoDFCpdcgGTzM/RPy3ZhXjpdbbVN/wMUgHCbHl93AhiF0hNxoqERwQIoWwz8KlkdvncwIhWNYJGa8Q9NGNHL0OHQBSnQQnBVXzJrkdOogEdeXtMbFplGLsNEJDkyZmJ0OG5fU6hY5FGQRuyzLHebN3OC6MAbFT4eu00mQbkF+QI+AFH9o2n9hJd0/nUf0OaCjkt0jHWshtoVlbrvUMyz4jHTL589/7+cKceYjEjQ8GE7XaZIfk7+/PVJIgdCZ8VunRb7eug8n3kkIUjtCL0L0ZXYKM6TN8Dx4g+PJwhvQFN8Rbzy4g6moQmn1x8Jcmvw9U+udjZw3SqPntU25obuttuhPZAhuAqPZfpAeizrXD8BzKi4MFQCB+MJ8mNGjp/fbu5Y63SwRpA7wqBBvc/3UvlwvQVw2HjykLBTBG8wJjZtuofMzONqaC/VVMdTtsxqXYIkCuYOazHqrPeh/xZ9x3UP4zpycwzmRa9PpFb4m1yAqa1CsppW5kfHSgdt4g3Ea6tttvKsyiu6x6PAX04LZ0DN6OVzT7ImP8ziuXzfA4vdbbWCYUd7ijLzfGl1hJDWMaGmWXXOAVrAV1OSABVLSb5ODEfgIyzf9OJpoW4Id/6AfWglKZ5PjoAcJgLHcTAJR/hgxE6FmBg/NO3H7fYwX9bS/CmIk2N/9mmKhsTwX8fw5x4XseLqeqTOuHCSNP1fUC+h0qQNHv7eh33QZNMOh8ztRpX6Q9ZS4T14s8898gJ9pAnHo=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Thanks a lot, Barry. This was overdue.

On Tue, May 27, 2025 at 2:20=E2=80=AFAM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Overall - thanks for this, and I'm not sure why we didn't think of doing
> this sooner :P this seems like a super valid thing to try to use the vma
> lock with.
>
> I see you've cc'd Suren who has the most expertise in this and can
> hopefully audit this and ensure all is good, but from the process address
> doc (see below), I think we're good to just have the VMA stabilised for a
> zap.
>
> On Tue, May 27, 2025 at 04:41:45PM +1200, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Certain madvise operations, especially MADV_DONTNEED, occur far more
> > frequently than other madvise options, particularly in native and Java
> > heaps for dynamic memory management.

Actually, it would be great if you can also optimize the MADV_FREE
case. There are quite a few performance critical situations where
MADV_FREE is performed and are done within a single VMA.
>
> Ack yeah, I have gathered that this is the case previously.
>
> >
> > Currently, the mmap_lock is always held during these operations, even w=
hen
> > unnecessary. This causes lock contention and can lead to severe priorit=
y
> > inversion, where low-priority threads=E2=80=94such as Android's HeapTas=
kDaemon=E2=80=94
> > hold the lock and block higher-priority threads.
>
> That's very nasty... we definitely want to eliminate as much mmap_lock
> contention as possible.
>
> >
> > This patch enables the use of per-VMA locks when the advised range lies
> > entirely within a single VMA, avoiding the need for full VMA traversal.=
 In
> > practice, userspace heaps rarely issue MADV_DONTNEED across multiple VM=
As.
>
> Yeah this single VMA requirement is obviously absolutely key.
>
> As per my docs [0] actually, for zapping a single VMA, 'The VMA need only=
 be
> kept stable for this operation.' (I had to look this up to remind myself =
:P)
>
> [0]: https://kernel.org/doc/html/latest/mm/process_addrs.html
>
> So we actually... should be good here, locking-wise.
>
> >
> > Tangquan=E2=80=99s testing shows that over 99.5% of memory reclaimed by=
 Android
> > benefits from this per-VMA lock optimization. After extended runtime,
> > 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
> > only 1,231 fell back to mmap_lock.

I am quite confident that this pattern is not limited to
HeapTaskDaemon. For instance, I'm sure the Scudo allocator must also
be calling MADV_DONTNEED within single VMAs.
>
> Thanks, this sounds really promising!
>
> I take it then you have as a result, heavily tested this change?
>
> >
> > To simplify handling, the implementation falls back to the standard
> > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity=
 of
> > userfaultfd_remove().
>
> Oh GOD do I hate how we implement uffd. Have I ever mentioned that? Well,
> let me mention it again...
>
> >
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Lokesh Gidra <lokeshgidra@google.com>
> > Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 34 insertions(+)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 8433ac9b27e0..da016a1d0434 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1817,6 +1817,39 @@ int do_madvise(struct mm_struct *mm, unsigned lo=
ng start, size_t len_in, int beh
> >
> >       if (madvise_should_skip(start, len_in, behavior, &error))
> >               return error;
> > +
> > +     /*
> > +      * MADV_DONTNEED is commonly used with userspace heaps and most o=
ften
> > +      * affects a single VMA. In these cases, we can use per-VMA locks=
 to
> > +      * reduce contention on the mmap_lock.
> > +      */
> > +     if (behavior =3D=3D MADV_DONTNEED || behavior =3D=3D MADV_DONTNEE=
D_LOCKED) {
>
> So firstly doing this here means process_madvise() doesn't get this benef=
it, and
> we're inconsistent between the two which we really want to avoid.
>
> But secondly - we definitely need to find a better way to do this :) this
> basically follows the 'ignore the existing approach and throw in an if
> (special case) { ... }' pattern that I feel we really need to do all we c=
an
> to avoid in the kernel.
>
> This lies the way of uffd, hugetlb, and thus horrors beyond imagining.
>
> I can see why you did this as this is kind of special-cased a bit, and we
> already do this kind of thing all over the place but let's try to avoid
> this here.
>
> So I suggest:
>
> - Remove any code for this from do_madvise() and thus make it available t=
o
>   process_madvise() also.
>
> - Try to avoid the special casing here as much as humanly possible :)
>
> - Update madvise_lock()/unlock() to get passed a pointer to struct
>   madvise_behavior to which we can add a boolean or even better I think -
>   an enum indicating which lock type was taken (this can simplify
>   madvise_unlock() also).
>
> - Update madvise_lock() to do all of the checks below, we already
>   effectively do a switch (behavior) so it's not so crazy to do this. And
>   you can also do the fallthrough logic there.
>
> - Obviously madvise_unlock() can be updated to do vma_end_read().
>
> > +             struct vm_area_struct *prev, *vma;
> > +             unsigned long untagged_start, end;
> > +
> > +             untagged_start =3D untagged_addr(start);
> > +             end =3D untagged_start + len_in;
> > +             vma =3D lock_vma_under_rcu(mm, untagged_start);
> > +             if (!vma)
> > +                     goto lock;
> > +             if (end > vma->vm_end || userfaultfd_armed(vma)) {
> > +                     vma_end_read(vma);
> > +                     goto lock;
> > +             }
> > +             if (unlikely(!can_modify_vma_madv(vma, behavior))) {
> > +                     error =3D -EPERM;
> > +                     vma_end_read(vma);
> > +                     goto out;
> > +             }
> > +             madvise_init_tlb(&madv_behavior, mm);
> > +             error =3D madvise_dontneed_free(vma, &prev, untagged_star=
t,
> > +                             end, &madv_behavior);
> > +             madvise_finish_tlb(&madv_behavior);
> > +             vma_end_read(vma);
> > +             goto out;
> > +     }
> > +
> > +lock:
> >       error =3D madvise_lock(mm, behavior);
> >       if (error)
> >               return error;
> > @@ -1825,6 +1858,7 @@ int do_madvise(struct mm_struct *mm, unsigned lon=
g start, size_t len_in, int beh
> >       madvise_finish_tlb(&madv_behavior);
> >       madvise_unlock(mm, behavior);
> >
> > +out:
> >       return error;
> >  }
> >
> > --
> > 2.39.3 (Apple Git-146)
> >
>
> Cheers, Lorenzo