From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 99748C3ABAA for ; Fri, 2 May 2025 16:16:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6B7AA6B0089; Fri, 2 May 2025 12:16:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6667A6B008A; Fri, 2 May 2025 12:16:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 506416B008C; Fri, 2 May 2025 12:16:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 31B676B0089 for ; Fri, 2 May 2025 12:16:20 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 016AF121E97 for ; Fri, 2 May 2025 16:16:20 +0000 (UTC) X-FDA: 83398470162.10.986A310 Received: from mail-qt1-f174.google.com (mail-qt1-f174.google.com [209.85.160.174]) by imf29.hostedemail.com (Postfix) with ESMTP id 21F6C120003 for ; Fri, 2 May 2025 16:16:18 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=4lbS8c3l; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf29.hostedemail.com: domain of surenb@google.com designates 209.85.160.174 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746202579; a=rsa-sha256; cv=none; b=5g65xf23ADFluY1Cj+Ox9fAzmhQeNlk0RH01TDPQrohm5BHp+IcHufDM4MwBB1epwBf9qa QO+uStuXf2oe2VHIOftgaCwS/qzDyw+R/PKtiAQf/GNVZfCx2nMmPa2KWvOUGOqMXAn6Ca w6CbklOEvSvixoLaHOaECrhDERjJtSg= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=4lbS8c3l; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf29.hostedemail.com: domain of surenb@google.com designates 209.85.160.174 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746202579; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=58bSUz7jJ2p8+5qkdJcblMgTqYam8GfQ6+npz5F5H34=; b=qlAIh1J8PcnnHxcRlAU1LfzHJRA0n8jWpoaRNZvuczhY7YqdJm84a5qB2KsBNaGGqO5eii 5NI/raF6Lx7mDSyJ74wIljbMvwa23o3iGHWlSfXRX6C/mak3iSnXT0fmfv4T0KkGKGlrlH 1FmI4xSp+9LBdr4yd9fRfl/SQ5V84yU= Received: by mail-qt1-f174.google.com with SMTP id d75a77b69052e-4774611d40bso322781cf.0 for ; Fri, 02 May 2025 09:16:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1746202578; x=1746807378; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=58bSUz7jJ2p8+5qkdJcblMgTqYam8GfQ6+npz5F5H34=; b=4lbS8c3lrcwLyi0G+BeNVlwXUoK+rc/pgCIgqr6jCTUb5xV64IMzG1uahY6zn42ZW0 +OQ9qdwdABit41VpbzHVJHzsySd0y97E86xiyGJxE7zMKOtd2fohK+r0YBWHCiKDmrgJ XeLsg3Ntw2ZQIh8Q1z6X6RtcSQcwnGnxp+G0hCEqUaw1h6s7R/HIK40QrkV34fRT+JKE 4mQ074JEIhODgTGNJe5wBJH6+7i9+H0l1LZmIqfWRvjmKFxW+WX9eg9GZ16GQekemnCO T1A89WMYgfPhm0Z/FcFEOZ2TDyKfCFrYHkKmagofeLfSu1aX56ilFTH4Mt/Nvrb/3gYW wYGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746202578; x=1746807378; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=58bSUz7jJ2p8+5qkdJcblMgTqYam8GfQ6+npz5F5H34=; b=mfkE+LZ1y8Apyr709XanUrcPAqn9WuB3cQcSzmachr7pI5TRNakQX8Ym/R2IWUTWrw 29k+3C5cRmYlHJPD25p+4qksRg10ZkdP75gzYQ26cdr2dHMX5LVdcsa5b03Mu3PLXcKZ /R8nm87XLVCWnkYJBF5SE6KolKRsGfBMRag2lALTw+9sBxaat+VAudf8KgLRTnD0pcx1 hj7H2R0fWCc9oloVN6YsBU2xVQO5P8YeeNjBGecNQDCtNyM6dwoUpSO/iiqjQFLo0ORJ BfzfTzA7Md6nSFFtFarqNiO63u4XGgVT6JeENcZuuFTYyi0D21sYZsvvyDYRUKGK/9gj CK8A== X-Forwarded-Encrypted: i=1; AJvYcCXEFfFaL99WP6/qd3emPJh7upwHyvmcWALQcCzWYZwKFV83vWib4VJinkSCHbRxPGSKebRJ+tMEwg==@kvack.org X-Gm-Message-State: AOJu0YycH/V0Gti4y9JShEgipO1IWUmJJErikktArW4py/9QldXBNyd1 Qnc5woWWeu1bwqzkDFttZKlTetz09/sqxTbXnc+O5sGD+QMiyUhqOrl1rK9zg9Cl6Ty428xPW6p wkvdtB2ykKAUGRCvPe0P5Earv+dwWytDKNeNg X-Gm-Gg: ASbGncuEADJZTL5Y5kjOOZ9I4Uu6p/qP+lLGKZGu20YuRanxaB/+PUhtI3LLJ2amGPz JmN0yc5FnCOslyhwq0E4v0dGMb3iyQZ5QrwJWP5Q2TtGK50uZhun4wiS+VY6ZPz9eZRUJgZq7qE rcXneHv6JIBBdtoO0R8F+IdFJlwQ+gsSXjhLPlKEVn2ruzzaZdbg== X-Google-Smtp-Source: AGHT+IF8L2+CCwV5k84zUeiwruyaxm/9IB09bfyjXX6Acxns+WD6FRjO3eDCBsRxkWxyNifDbB4FnnkHPTpIweEMoLU= X-Received: by 2002:a05:622a:11:b0:486:8711:19af with SMTP id d75a77b69052e-48ae773409emr10652821cf.0.1746202577791; Fri, 02 May 2025 09:16:17 -0700 (PDT) MIME-Version: 1.0 References: <20250418174959.1431962-1-surenb@google.com> <20250418174959.1431962-9-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Fri, 2 May 2025 16:16:06 +0000 X-Gm-Features: ATxdqUE5eBihdzh_T37nNq-zp7SiYqV-8pNUMisZD5AepZKCzib88UxNXda6nVs Message-ID: Subject: Re: [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl under RCU To: Jann Horn Cc: Andrii Nakryiko , akpm@linux-foundation.org, Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, david@redhat.com, vbabka@suse.cz, peterx@redhat.com, hannes@cmpxchg.org, mhocko@kernel.org, paulmck@kernel.org, shuah@kernel.org, adobriyan@gmail.com, brauner@kernel.org, josef@toxicpanda.com, yebin10@huawei.com, linux@weissschuh.net, willy@infradead.org, osalvador@suse.de, andrii@kernel.org, ryan.roberts@arm.com, christophe.leroy@csgroup.eu, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 21F6C120003 X-Stat-Signature: 64mtd1t9ojbo3ma97efmpkpzgrqqfd9q X-Rspam-User: X-HE-Tag: 1746202578-683355 X-HE-Meta: U2FsdGVkX1+Dur0XPF6iu1DnsB4Mh8qiSzshFISWK1yadCFRZnbH6f2WUB4LrHfxHNkaBQdwwlukaraNCUe6UcsWCJ5hrivy+hDTT8cpxnbeDuW0tjOlzKHwhHBuZqpqPtBx8Q3Tvc/al4FAXlumIb7B+8VxiCMc1T7Ev2jgGgHhdfDP4LDi3jYRfQDXpZ1b3Tmef1LNKwylmN6ehTRxQ0v6pEcJmft0AR66uLe4yBYinRqJCUadVDzl9z70ZWciQO6Y+0O2qA9ca+C7EduGWA/zx4cjxGz6fW2U3h4mnivg9Xtp5tRBc4p6Liyjv1k5mPknxDhKfKrC27LELTJ3yI7atjmHVleh4+tPaHQKMN8cOuPrN3yIxxxJ5pdqp1TzIhwO9WlFOfc18o92Xz+Vs49bGGK7tbwVC2rCCwUwGcAPHTXtV9huRjmPbjDn/d9Zsj7mTrjhX8JYV8oUkAGB+vNKKFPDmBQlvRBQw4vtCbeiV1wh2L3JVIcJLdLie17x8LVEzgQJnaNzbDdMbNWBmcFIsByH3GnL/T2xLCjkt/3R58bsX5jAtA9IV0WvxWD5nF5gt/SEBsaJGVX0KUTgDQPp1j5XE+RME4PJC2BPaXl91mEgOUEfAXzNlfA3Vz3tb57h4Uml4JGRky3IJUQ8vjRp0Aozsp4Qyhvmypa4Pm9yKa+BQTiiXrSU8ByASD7ut09s9z5SGMIxPg7NIjPGGZAHOSYVpId6IpSdSrFKimt1sADMg9Ou6D4zYUygXdv1yx5Tk/G567HOXTf81lFk18EDko+eqIEYHhvETW7pWYMphBNOc/yM9O8BOxT1ym0eZK94rj4z9THcx2KUPd0YCjU8mfMZ8qz0Hi3ovLHENv85Da474xC1Jw6cjgyOLkBQO6cWSKAaAZ4IsKKssvcAHg5pGOFbmfy68CrPLFE7HPZRPd24YfMk8NMPDRgRPwHU9iCsRfAgbiIr6Iw41qs kGnoIk15 MEFVsDyT2E0WkQG/6CCU2lCgBp7v+dwg5+MtRJ6Nh0A/5AxqCULKj/OWhUHoEe8iO2irRFK7Jjr/Eby1B2azw3VdgetMrS5aYaS9FOCm2MrZ4iAoaD52ZtR35J2sdX3hJ2EznbyEmXitMNoftNJGSENrARYMaJq7tLn9q92va1HDiIA+BMUnkIcBNNLNkM0tDofHUbMji5aABrbxC9zWhvC8CHZGcyUKp7ds4BenOneBEpzx5urwVRwU09F3hoMXRgm7TJQzIWXDGyeG++Rr4Z9yDbpkFNt4o1AblgiEoEINVPup0xhNjhA7ZYH1Ij102iYtXU1rzxT3bdi5gCfJibt8QbqlOpskn/Uk7sAtYjVEP8RQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 2, 2025 at 3:11=E2=80=AFPM Jann Horn wrote: > > On Fri, May 2, 2025 at 12:10=E2=80=AFAM Andrii Nakryiko > wrote: > > On Tue, Apr 29, 2025 at 10:25=E2=80=AFAM Jann Horn w= rote: > > > On Tue, Apr 29, 2025 at 7:15=E2=80=AFPM Suren Baghdasaryan wrote: > > > > On Tue, Apr 29, 2025 at 8:56=E2=80=AFAM Jann Horn wrote: > > > > > On Wed, Apr 23, 2025 at 12:54=E2=80=AFAM Andrii Nakryiko > > > > > wrote: > > > > > > On Fri, Apr 18, 2025 at 10:50=E2=80=AFAM Suren Baghdasaryan wrote: > > > > > > > Utilize speculative vma lookup to find and snapshot a vma wit= hout > > > > > > > taking mmap_lock during PROCMAP_QUERY ioctl execution. Concur= rent > > > > > > > address space modifications are detected and the lookup is re= tried. > > > > > > > While we take the mmap_lock for reading during such contentio= n, we > > > > > > > do that momentarily only to record new mm_wr_seq counter. > > > > > > > > > > > > PROCMAP_QUERY is an even more obvious candidate for fully lockl= ess > > > > > > speculation, IMO (because it's more obvious that vma's use is > > > > > > localized to do_procmap_query(), instead of being spread across > > > > > > m_start/m_next and m_show as with seq_file approach). We do > > > > > > rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA= (no > > > > > > mmap_read_lock), use that VMA to produce (speculative) output, = and > > > > > > then validate that VMA or mm_struct didn't change with > > > > > > mmap_lock_speculate_retry(). If it did - retry, if not - we are= done. > > > > > > No need for vma_copy and any gets/puts, no? > > > > > > > > > > I really strongly dislike this "fully lockless" approach because = it > > > > > means we get data races all over the place, and it gets hard to r= eason > > > > > about what happens especially if we do anything other than readin= g > > > > > plain data from the VMA. When reading the implementation of > > > > > do_procmap_query(), at basically every memory read you'd have to = think > > > > > twice as hard to figure out which fields can be concurrently upda= ted > > > > > elsewhere and whether the subsequent sequence count recheck can > > > > > recover from the resulting badness. > > > > > > > > > > Just as one example, I think get_vma_name() could (depending on > > > > > compiler optimizations) crash with a NULL deref if the VMA's ->vm= _ops > > > > > pointer is concurrently changed to &vma_dummy_vm_ops by vma_close= () > > > > > between "if (vma->vm_ops && vma->vm_ops->name)" and > > > > > "vma->vm_ops->name(vma)". And I think this illustrates how the "f= ully > > > > > lockless" approach creates more implicit assumptions about the > > > > > behavior of core MM code, which could be broken by future changes= to > > > > > MM code. > > > > > > > > Yeah, I'll need to re-evaluate such an approach after your review. = I > > > > like having get_stable_vma() to obtain a completely stable version = of > > > > the vma in a localized place and then stop worrying about possible > > > > races. If implemented correctly, would that be enough to address yo= ur > > > > concern, Jann? > > > > > > Yes, I think a stable local snapshot of the VMA (where tricky data > > > races are limited to the VMA snapshotting code) is a good tradeoff. > > > > I'm not sure I agree with VMA snapshot being better either, tbh. It is > > error-prone to have a byte-by-byte local copy of VMA (which isn't > > really that VMA anymore), and passing it into ops callbacks (which > > expect "real" VMA)... Who guarantees that this won't backfire, > > depending on vm_ops implementations? And constantly copying 176+ bytes > > just to access a few fields out of it is a bit unfortunate... > > Yeah, we shouldn't be passing VMA snapshots into ops callbacks, I > agree that we need to fall back to using proper locking for that. Yes, I'm exploring the option of falling back to per-vma locking to stabilize the VMA when its reference is used in vm_ops. > > > Also taking mmap_read_lock() sort of defeats the point of "RCU-only > > access". It's still locking/unlocking and bouncing cache lines between > > writer and reader frequently. How slow is per-VMA formatting? > > I think this mainly does two things? > > 1. It shifts the latency burden of concurrent access toward the reader > a bit, essentially allowing writers to preempt this type of reader to > some extent. > 2. It avoids bouncing cache lines between this type of reader and > other *readers*. > > > If we > > take mmap_read_lock, format VMA information into a buffer under this > > lock, and drop the mmap_read_lock, would it really be that much slower > > compared to what Suren is doing in this patch set? And if no, that > > would be so much simpler compared to this semi-locked/semi-RCU way > > that is added in this patch set, no? The problem this patch is trying to address is low priority readers blocking a high priority writer. Taking mmap_read_lock for each VMA will not help solve this problem. If we have to use locking then taking per-vma lock would at least narrow down the contention scope from the entire address space to individual VMAs. > > > But I do agree that vma->vm_ops->name access is hard to do in a > > completely lockless way reliably. But also how frequently VMAs have > > custom names/anon_vma_name? > > I think there are typically two VMAs with vm_ops->name per MM, vvar > and vdso. (Since you also asked about anon_vma_name: I think > anon_vma_name is more frequent than that on Android, there seem to be > 58 of those VMAs even in a simple "cat" process.) > > > What if we detect that VMA has some > > "fancy" functionality (like this custom name thing), and just fallback > > to mmap_read_lock-protected logic, which needs to be supported as a > > fallback even for lockless approach? > > > > This way we can process most (typical) VMAs completely locklessly, > > while not adding any extra assumptions for all the potentially > > complicated data pieces. WDYT? The option I'm currently contemplating is using per-vma locks to deal with "fancy" cases and to do snapshotting otherwise. We have several options with different levels of complexity vs performance tradeoffs and finding the right balance will require some experimentation. I'll likely need Paul's help soon to run his testcase with different versions. > > And then we'd also use the fallback path if karg.build_id_size is set? > And I guess we also need it if the VMA is hugetlb, because of > vma_kernel_pagesize()? And use READ_ONCE() in places like > vma_is_initial_heap()/vma_is_initial_stack()/arch_vma_name() for > accessing both the VMA and the MM? > > And on top of that, we'd have to open-code/change anything that > currently uses the ->vm_ops (such as vma_kernel_pagesize()), because > between us checking the type of the VMA and later accessing ->vm_ops, > the VMA object could have been reallocated with different ->vm_ops? > > I still don't like the idea of pushing the complexity of "the VMA > contents are unstable, and every read from the VMA object may return > data about a logically different VMA" down into these various helpers. > In my mind, making the API contract "The VMA contents can be an > internally consistent snapshot" to the API contract for these helpers > constrains the weirdness a bit more - though I guess the helpers will > still need READ_ONCE() for accessing properties of the MM either > way...