From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7A3C1C5B542 for ; Tue, 27 May 2025 20:40:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D56EB6B0096; Tue, 27 May 2025 16:40:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D2EB36B0098; Tue, 27 May 2025 16:40:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C43D86B0099; Tue, 27 May 2025 16:40:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id A70596B0096 for ; Tue, 27 May 2025 16:40:21 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 043741616B6 for ; Tue, 27 May 2025 20:40:20 +0000 (UTC) X-FDA: 83489855442.02.2CADAE9 Received: from mail-ed1-f41.google.com (mail-ed1-f41.google.com [209.85.208.41]) by imf15.hostedemail.com (Postfix) with ESMTP id 02D5AA0016 for ; Tue, 27 May 2025 20:40:18 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=V6JIIkWA; spf=pass (imf15.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.208.41 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748378419; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=91mdXzjYrGvaj7Dh9A9yl2QArCWVP/eA86xKiPghOcI=; b=iAmhd2OXHutbzU6jKVgarn++SlmZXNy5M39wiqRa/hFh989UXfzudCBfwlilsGejVBiEZg ImtIUbffdZPARS91WgPGKXvTsC4CiIEFPmPs7UI0J761G4Cdjz1VN3uFytb70PSTufmmmb Qg4eRZ5g8+R1M8urPiXUBndIlwFaLIk= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=V6JIIkWA; spf=pass (imf15.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.208.41 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748378419; a=rsa-sha256; cv=none; b=5mQ6KAfTJCbr/nWCXxjkVYmb1ghVcm2bp75Xx0QTsz7wL7ukFeBgDw5sJ874xdsFj7K5Hd uCZcSevyIk2HDoYXqAZWJSTMRJ/5twKqYbfA0YEEsBUt5k1T4IsiJbRVH8xb8aGB2UpVCf MVUqjW1ubklH0y6ju/VcdCPEqAiRPE0= Received: by mail-ed1-f41.google.com with SMTP id 4fb4d7f45d1cf-6024087086dso3169a12.0 for ; Tue, 27 May 2025 13:40:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1748378417; x=1748983217; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=91mdXzjYrGvaj7Dh9A9yl2QArCWVP/eA86xKiPghOcI=; b=V6JIIkWAUMJYDcOoEQG1G68WPXmlVsztnmspZ3lPnK40ygATbim3Dyfx3CxHSn8LAP sWdhXs+CSj2DuSldtgWBTAqAqhRC728xf/W03WnnE+CzSA3H1YDRuzJtGr3O6zys0rjj 0iWanYBF5b8oRwbq9yEasJ1Dk4uiEXVSuy5wPZQ8R87lUUFQ+SEtRPUGka8cM20oBuBk ZR7Lt48WAJZ25dSeAX+pTPKPXOQQw6Qtbsne+Ma2lfO9NHRKNusYBZ/xXELvpw6esi7g zcd8eU+/xBMF1VMixEU8BPwbZiIkNwYYKljsCbVBKR1GMMT6iE/z2xev0EGPp+mQ2xPQ VWeg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748378417; x=1748983217; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=91mdXzjYrGvaj7Dh9A9yl2QArCWVP/eA86xKiPghOcI=; b=FgM3kzs7GswRfzZKEVIExHjZBdBqDWLnD1T4vFkWN4q+vvCNoCK6QudwKcf3cDITc2 IoRKJ0KM6saYJONuwChat8zokFjtTg3CNIsZVJBLcAqiIzBfgI/ikaodPWQznOI4IvPR VvZkvR9GEwqNR4bzD/2wIdhXZhoCLz+DzaDBTs6l1XNC1eGzhd4wfzl0IYWmrVuhX57x Qx6rI+o9zR1Wf+MtkJqQ0f5DaqNs5ZAYTAwwKDycOS8riCwFl2B6aMONdmFgs+nDkej9 DCHBKqiEBZFxYkoQeGt5yT7xn6MNU1SAVu6uA8yzBrINuKBKdv5imkjazhrp6ki/JJbh GKZg== X-Forwarded-Encrypted: i=1; AJvYcCW/VmPZFYw7fddBNIXIoXggogp8/R8/EN6EtbpTKTRa6liJsd/aQKgbK2Y+aDr/BUZJRZzOhz68UQ==@kvack.org X-Gm-Message-State: AOJu0Yw9f9fBtEYcafoHhn21CwDhajU0raunYBl3+ycdZ+l5DGWkufIX KyglQEePNHoo1Il0lEfc0+Oeaz8+/psnQTfXR2/eKoCBuqXWUTja+3L9nZiweqt1cLYh7n9Aogl R+oSYq+CKq2z/LDE+wqpNsGnAtg0SMFDNt5g4iGEk X-Gm-Gg: ASbGncv7TMCcuN1y/6EwJJh0ZeUSE0iilaAEhEGcYxyQSM8h0O2DqpSX/19HcWjEkXr dE7adygq6+4KQVjpbPggfXCUYpzG7EZj/CUdtsSPjppOhaJwf3rLKI6qbp3qTs4Rz2eTdDaht6v dmX6fjaJpalpeSbnGA9Bv0ehmecef02NuidLzq6dmcsO+Tk2iHB8owgtzLLvWD/kh8pUQo6mbK X-Google-Smtp-Source: AGHT+IEyQdVjEh01ByN1Gy4wyq+UouMQztZNI3UXZcPd8rFwNBK7jeF6QdKgHLHKDSNEfhZEV1rUngjqYE7RKwuOyXI= X-Received: by 2002:aa7:cfd4:0:b0:600:9008:4a40 with SMTP id 4fb4d7f45d1cf-60515940cd3mr27335a12.4.1748378417078; Tue, 27 May 2025 13:40:17 -0700 (PDT) MIME-Version: 1.0 References: <20250527044145.13153-1-21cnbao@gmail.com> <93385672-927f-4de5-a158-fc3fc0424be0@lucifer.local> In-Reply-To: <93385672-927f-4de5-a158-fc3fc0424be0@lucifer.local> From: Lokesh Gidra Date: Tue, 27 May 2025 13:40:05 -0700 X-Gm-Features: AX0GCFvsqylKcHHDepKynOdNf6b-SwzsoF1wjU3y-l44a5LLCSguH4yWv1a1Tvw Message-ID: Subject: Re: [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED To: Lorenzo Stoakes Cc: Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , "Liam R. Howlett" , David Hildenbrand , Vlastimil Babka , Jann Horn , Suren Baghdasaryan , Tangquan Zheng Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: gr94dmpt6oqi8t6nmrs7gd8jk43cyqw4 X-Rspamd-Queue-Id: 02D5AA0016 X-Rspam-User: X-Rspamd-Server: rspam02 X-HE-Tag: 1748378418-291754 X-HE-Meta: U2FsdGVkX1+qUkeU/pzijcexthy0gXskYibwH/BSYTt1JVYOowDkQSIoDMYg+zmQoakMjY1hqSQFWo1hFfFLTtylVgAyU4eXHI4xRBvWFVzBJywT/SWLSqlGS0Ro4n7PILTLm1S+FWUgeSwiM+mIhUzH17t3Rwdoj4sG/0CN7hW/GwK4Jiusuyt8GQ6A1LjiS0LlaFx7KGbq56NHpUXlmS7B+dgoDFCpdcgGTzM/RPy3ZhXjpdbbVN/wMUgHCbHl93AhiF0hNxoqERwQIoWwz8KlkdvncwIhWNYJGa8Q9NGNHL0OHQBSnQQnBVXzJrkdOogEdeXtMbFplGLsNEJDkyZmJ0OG5fU6hY5FGQRuyzLHebN3OC6MAbFT4eu00mQbkF+QI+AFH9o2n9hJd0/nUf0OaCjkt0jHWshtoVlbrvUMyz4jHTL589/7+cKceYjEjQ8GE7XaZIfk7+/PVJIgdCZ8VunRb7eug8n3kkIUjtCL0L0ZXYKM6TN8Dx4g+PJwhvQFN8Rbzy4g6moQmn1x8Jcmvw9U+udjZw3SqPntU25obuttuhPZAhuAqPZfpAeizrXD8BzKi4MFQCB+MJ8mNGjp/fbu5Y63SwRpA7wqBBvc/3UvlwvQVw2HjykLBTBG8wJjZtuofMzONqaC/VVMdTtsxqXYIkCuYOazHqrPeh/xZ9x3UP4zpycwzmRa9PpFb4m1yAqa1CsppW5kfHSgdt4g3Ea6tttvKsyiu6x6PAX04LZ0DN6OVzT7ImP8ziuXzfA4vdbbWCYUd7ijLzfGl1hJDWMaGmWXXOAVrAV1OSABVLSb5ODEfgIyzf9OJpoW4Id/6AfWglKZ5PjoAcJgLHcTAJR/hgxE6FmBg/NO3H7fYwX9bS/CmIk2N/9mmKhsTwX8fw5x4XseLqeqTOuHCSNP1fUC+h0qQNHv7eh33QZNMOh8ztRpX6Q9ZS4T14s8898gJ9pAnHo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Thanks a lot, Barry. This was overdue. On Tue, May 27, 2025 at 2:20=E2=80=AFAM Lorenzo Stoakes wrote: > > Overall - thanks for this, and I'm not sure why we didn't think of doing > this sooner :P this seems like a super valid thing to try to use the vma > lock with. > > I see you've cc'd Suren who has the most expertise in this and can > hopefully audit this and ensure all is good, but from the process address > doc (see below), I think we're good to just have the VMA stabilised for a > zap. > > On Tue, May 27, 2025 at 04:41:45PM +1200, Barry Song wrote: > > From: Barry Song > > > > Certain madvise operations, especially MADV_DONTNEED, occur far more > > frequently than other madvise options, particularly in native and Java > > heaps for dynamic memory management. Actually, it would be great if you can also optimize the MADV_FREE case. There are quite a few performance critical situations where MADV_FREE is performed and are done within a single VMA. > > Ack yeah, I have gathered that this is the case previously. > > > > > Currently, the mmap_lock is always held during these operations, even w= hen > > unnecessary. This causes lock contention and can lead to severe priorit= y > > inversion, where low-priority threads=E2=80=94such as Android's HeapTas= kDaemon=E2=80=94 > > hold the lock and block higher-priority threads. > > That's very nasty... we definitely want to eliminate as much mmap_lock > contention as possible. > > > > > This patch enables the use of per-VMA locks when the advised range lies > > entirely within a single VMA, avoiding the need for full VMA traversal.= In > > practice, userspace heaps rarely issue MADV_DONTNEED across multiple VM= As. > > Yeah this single VMA requirement is obviously absolutely key. > > As per my docs [0] actually, for zapping a single VMA, 'The VMA need only= be > kept stable for this operation.' (I had to look this up to remind myself = :P) > > [0]: https://kernel.org/doc/html/latest/mm/process_addrs.html > > So we actually... should be good here, locking-wise. > > > > > Tangquan=E2=80=99s testing shows that over 99.5% of memory reclaimed by= Android > > benefits from this per-VMA lock optimization. After extended runtime, > > 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while > > only 1,231 fell back to mmap_lock. I am quite confident that this pattern is not limited to HeapTaskDaemon. For instance, I'm sure the Scudo allocator must also be calling MADV_DONTNEED within single VMAs. > > Thanks, this sounds really promising! > > I take it then you have as a result, heavily tested this change? > > > > > To simplify handling, the implementation falls back to the standard > > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity= of > > userfaultfd_remove(). > > Oh GOD do I hate how we implement uffd. Have I ever mentioned that? Well, > let me mention it again... > > > > > Cc: "Liam R. Howlett" > > Cc: Lorenzo Stoakes > > Cc: David Hildenbrand > > Cc: Vlastimil Babka > > Cc: Jann Horn > > Cc: Suren Baghdasaryan > > Cc: Lokesh Gidra > > Cc: Tangquan Zheng > > Signed-off-by: Barry Song > > --- > > mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 34 insertions(+) > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 8433ac9b27e0..da016a1d0434 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -1817,6 +1817,39 @@ int do_madvise(struct mm_struct *mm, unsigned lo= ng start, size_t len_in, int beh > > > > if (madvise_should_skip(start, len_in, behavior, &error)) > > return error; > > + > > + /* > > + * MADV_DONTNEED is commonly used with userspace heaps and most o= ften > > + * affects a single VMA. In these cases, we can use per-VMA locks= to > > + * reduce contention on the mmap_lock. > > + */ > > + if (behavior =3D=3D MADV_DONTNEED || behavior =3D=3D MADV_DONTNEE= D_LOCKED) { > > So firstly doing this here means process_madvise() doesn't get this benef= it, and > we're inconsistent between the two which we really want to avoid. > > But secondly - we definitely need to find a better way to do this :) this > basically follows the 'ignore the existing approach and throw in an if > (special case) { ... }' pattern that I feel we really need to do all we c= an > to avoid in the kernel. > > This lies the way of uffd, hugetlb, and thus horrors beyond imagining. > > I can see why you did this as this is kind of special-cased a bit, and we > already do this kind of thing all over the place but let's try to avoid > this here. > > So I suggest: > > - Remove any code for this from do_madvise() and thus make it available t= o > process_madvise() also. > > - Try to avoid the special casing here as much as humanly possible :) > > - Update madvise_lock()/unlock() to get passed a pointer to struct > madvise_behavior to which we can add a boolean or even better I think - > an enum indicating which lock type was taken (this can simplify > madvise_unlock() also). > > - Update madvise_lock() to do all of the checks below, we already > effectively do a switch (behavior) so it's not so crazy to do this. And > you can also do the fallthrough logic there. > > - Obviously madvise_unlock() can be updated to do vma_end_read(). > > > + struct vm_area_struct *prev, *vma; > > + unsigned long untagged_start, end; > > + > > + untagged_start =3D untagged_addr(start); > > + end =3D untagged_start + len_in; > > + vma =3D lock_vma_under_rcu(mm, untagged_start); > > + if (!vma) > > + goto lock; > > + if (end > vma->vm_end || userfaultfd_armed(vma)) { > > + vma_end_read(vma); > > + goto lock; > > + } > > + if (unlikely(!can_modify_vma_madv(vma, behavior))) { > > + error =3D -EPERM; > > + vma_end_read(vma); > > + goto out; > > + } > > + madvise_init_tlb(&madv_behavior, mm); > > + error =3D madvise_dontneed_free(vma, &prev, untagged_star= t, > > + end, &madv_behavior); > > + madvise_finish_tlb(&madv_behavior); > > + vma_end_read(vma); > > + goto out; > > + } > > + > > +lock: > > error =3D madvise_lock(mm, behavior); > > if (error) > > return error; > > @@ -1825,6 +1858,7 @@ int do_madvise(struct mm_struct *mm, unsigned lon= g start, size_t len_in, int beh > > madvise_finish_tlb(&madv_behavior); > > madvise_unlock(mm, behavior); > > > > +out: > > return error; > > } > > > > -- > > 2.39.3 (Apple Git-146) > > > > Cheers, Lorenzo