From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1950DC369CB for ; Wed, 23 Apr 2025 22:06:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5C3046B0007; Wed, 23 Apr 2025 18:06:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 572896B0008; Wed, 23 Apr 2025 18:06:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3EB546B000A; Wed, 23 Apr 2025 18:06:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 1D1546B0007 for ; Wed, 23 Apr 2025 18:06:23 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id F22DC1A0596 for ; Wed, 23 Apr 2025 22:06:23 +0000 (UTC) X-FDA: 83366693046.06.CA2D72D Received: from mail-pj1-f43.google.com (mail-pj1-f43.google.com [209.85.216.43]) by imf11.hostedemail.com (Postfix) with ESMTP id 2054F40004 for ; Wed, 23 Apr 2025 22:06:21 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Fv+3yZrG; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf11.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.216.43 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745445982; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=AsiPVmQ9vMCRkpTsBzVfBMDI0AoYgjzP6nuAuKJ1bTo=; b=nj7bw19QGxQg87hwhHCnYmQTBP3H5PRbrYKlaX21Z5xE5pvVDhSIcklcWtblFAZ3g//tS6 S3hJMBjyGbNsw7BHfDi5HXafDv/72kwLIIqRQCu2YkP1IY5i+d6hoF5dJfV6TqTul65GyC AB2RZlag16AZL27Q5/jxV1/Ba2v+Yog= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745445982; a=rsa-sha256; cv=none; b=L5yOkbHTggfZ1Yzb9PpD3iFTWGeWjOQVFkkFJNqJd8gR7ajlYT0DD7QqC/1nZDS6UfmRPA F/JRbu69dc9mW/Jo71ZaAeKYcUUQE4gKK+rVdnHSf8VrzEj8U62VSBOrsJDgDO7O9IwSPO SIGX1wqqS1hYG+AQVKyxAmzlgAepK0A= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Fv+3yZrG; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf11.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.216.43 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com Received: by mail-pj1-f43.google.com with SMTP id 98e67ed59e1d1-30572effb26so352892a91.0 for ; Wed, 23 Apr 2025 15:06:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1745445981; x=1746050781; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=AsiPVmQ9vMCRkpTsBzVfBMDI0AoYgjzP6nuAuKJ1bTo=; b=Fv+3yZrGQ3bcHszIVBahuj3pPpYx0J4gUvxsEuQWOhtbSNB16tsepdtlIl6ydzWKb7 ngU1zH1Um9mkP0LpQtRBteCLDVB4HNOqViDEPXVWbRvbvGljLmjjbtGX1lRbT2Ea2QvT Zvx9r9T3HVu2CFaaeWWmCT8brKwmDNHOfHc39fzDzYB+P4lc/tQuMjDkHopNdJ+4+Gcc 6G9FvHFlkwRJUSEv41ZYobM6cfRtnAQrSSbgDq9q+F8rGtK6r6Fs0BZix97ZNWszkiE2 fHfETknXWQ/gW9Xx7MhZaum+WeepKh+hve+ZWwIup9eRCSNNCe5JiwwL2Fjt3syqVRqw kXnw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745445981; x=1746050781; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=AsiPVmQ9vMCRkpTsBzVfBMDI0AoYgjzP6nuAuKJ1bTo=; b=q9J85CJHmkpkFgtg594mhCZqhAcynYkBvY3eDLLurPVjks9XSOYduGMuqdAmDe5DYN CmW0IztFlaiu1BgC0EfsKvhkk4oW/i4pkHgZ4gzs3j5uSKKuA0qS6jcTt/AG+RXpvHy2 hME857psKzfAU9a0Ck4nf+P3An7HJNlNvWkQqLi2h1L9rSy4lkSEHJhWXqaTyyNkoBpb TJTa97Bsp3UPycG21Ghcdc7lAGpT7zkzRwMiozy6Ze3wZMzlxsz60JL89p7Bj+dB5zZf JL0qNwzyUxAM1EfAtlcrXg5Y5PZMRPQcOpil6kRTZ8utr4vsvDqSaIBbyBwhXKnrqh1d iksA== X-Forwarded-Encrypted: i=1; AJvYcCU5cG7pe5EvEm3QKY0DAQGOzw9yvdP2l5dM4yi+P5pqtb5UAq6KRj+FYGYfBRWYN42YQhg6a02UEg==@kvack.org X-Gm-Message-State: AOJu0YwVs1fyME2YdRdl3uc95Km1bxpTU4BqTLxanbAq9clnjWBjYuQ0 cU7aKS9pVnkXxIETSgWNb3zXAtRhqVzmUyuzvKeCSBLkS32F5UUYtyrotWn4H9zEkj2dNgT9tB3 Eff/N/EAO1sF0mLr9ZIjDqyHxxNM= X-Gm-Gg: ASbGncs/qdaIp8CdA/fESa51VyuUlh5cqugYpEsJPPDYELzykeTkabW8lNpupA4DkvA Ir8F7QHwLIgA2+gFxiMthbadt2auPfW4c6ICRC75X7ZhzLnwgcdO6wm98y407nYUEY2v0mjOn0P 154L2XNQJXA3TCS94U+1xXxXFwEM9ISLKbPD1lNg== X-Google-Smtp-Source: AGHT+IHPtTemguCK+hx6hcId37B8xL62Uyc2LDddSUMllW5thzGIQ0M8WwtIdLrFj5yuefDf8uif4NNLqSOOAgS6Zxc= X-Received: by 2002:a17:90b:6c3:b0:2ee:6d08:7936 with SMTP id 98e67ed59e1d1-309ed2a3f21mr580521a91.20.1745445980960; Wed, 23 Apr 2025 15:06:20 -0700 (PDT) MIME-Version: 1.0 References: <20250418174959.1431962-1-surenb@google.com> <20250418174959.1431962-8-surenb@google.com> In-Reply-To: From: Andrii Nakryiko Date: Wed, 23 Apr 2025 15:06:08 -0700 X-Gm-Features: ATxdqUEhpIukcADUQT8bpUPiETvpQWy_LLT6W5s_rIldofkQMQI-nBweMCKUcrM Message-ID: Subject: Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU To: Suren Baghdasaryan Cc: akpm@linux-foundation.org, Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, david@redhat.com, vbabka@suse.cz, peterx@redhat.com, jannh@google.com, hannes@cmpxchg.org, mhocko@kernel.org, paulmck@kernel.org, shuah@kernel.org, adobriyan@gmail.com, brauner@kernel.org, josef@toxicpanda.com, yebin10@huawei.com, linux@weissschuh.net, willy@infradead.org, osalvador@suse.de, andrii@kernel.org, ryan.roberts@arm.com, christophe.leroy@csgroup.eu, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 2054F40004 X-Rspam-User: X-Stat-Signature: fy47my8w3pjfbi34t4gd3nbfuukh461u X-HE-Tag: 1745445981-481367 X-HE-Meta: U2FsdGVkX1/g0c6GlQQyUt3hS53xgt3/xkHr/SIC7SOzPX9Y9Bda6Xkvonigaqug1QSIWhZf1yvI7ZDIhnOzsrcvDVNNujFEmo7GLpMliamESxy0iTdSZCcpsebD01LJqNs7D5X63G84BPDiQb6x1VVw6jQqnjsAvv/rjpurLKxWTSzSo/3Nk2ouF202XbA53/ZxOvZtClja8ZEVOmJLQdNxqO2MycyxC/URxURqk7U5StLn55UgYxYl/cEtF4XQGq8IWqUd72mCa/Ib8EbuJ9WvAXDgzfRjiWojw+MNs2Dqi42qN+Pvxz3hqrYQhqPnFS2CkQSVuukH6mIq/TSA+l7vs7X739KYObaTUciFDe5Bk81ZzaUwQkiS/joUDGjFSPh1Ri+zj/zcJiR+7YAhBFSmerJY94+5nRSgI9M0142fvng+gggHNEt4X/YEa9AvWWgOdCO+eOy8Yn8byjVBBrd894nsVguBbPpXuGxxTs9qvE/tqctXHzfdRiXTvxHdIGq817Aw+YqCeBx01yohrj9VYReVbx5oaR3cn4LWHRq/TTQN/ZBPaQTcoXoG3PRJRnQqUlTPh+ZN6n3FWm1wTyaf4F3p65sRJKPIVSQCgvC3cTokLOsuObJ75/5d5yZ6saEDI2faYX9fdI3CqlJ7i/LaLs55HcIxkUhslHu/hvJkG4JEt70J2zxN52ShSBlOCBVtkr7Wz2r1gNLnEP9I9ULAI9aLigStAQ8Tu10n1yYdcW9CIQQzC8Z/VNVsbiwNFGpr687rOV/jd04ph3tcRbjP8zcTlm0LtquTHz2R6A8ivqWzJHh1nYt8L6wVVhKRlwwtzGVMhVNz+b1YE43TEaZxY+0in3MuV88bVjFkwbduTp1SOwVmIBS8CKLgPk170pS15Fn+vIeXPnsRnngFeflGRFd0OX6Jf9v5nRoZkzxIjm5pFQTpibE+vsht/JuFWAWe1aiGMPu3f4Jy4zr hH8U0vP8 kkjhtRsqUVxdB24Vh6Z9cdBragAErDPtADcwoVlnjJWf2Ne6ufufV0XCdkUds4preJzxGMCjGruthZlN535DKEXsB+UwS43LD1m18UpOyTO473V94v9506EM6GP8nFCxfio6SVLSXsGwsrNbb+ZgAPzo2eD6+3EJE1JaYO5cd3BZ5hPFrYFc8sP1NUr/rCXr1xbOk8N2Dj5Ki+C7qVexg+8iuCrFUC3Bcikeie4+5CPYa91R3TS9V33aRWtiUd1E/JwZokCqsyflAdHXW+c1q6jbeBbXqG+Nmv14jrlxdNXGmZNamsfmsaB/80nyLPnzcKVUC0OtS51Jp6gijHwRSau40ytSbAWuuQsifQuNi3xIc9HOdwJ18O31mh6PaqHmevSoh22fTw8iN7SZco+GOOq6RV+tXiotAJD9+bQm/0Uyhrkles3O4GXbdbg6SxsPfIp3Ndxl2CE+uWYg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 23, 2025 at 2:49=E2=80=AFPM Suren Baghdasaryan wrote: > > On Tue, Apr 22, 2025 at 3:49=E2=80=AFPM Andrii Nakryiko > wrote: > > > > On Fri, Apr 18, 2025 at 10:50=E2=80=AFAM Suren Baghdasaryan wrote: > > > > > > With maple_tree supporting vma tree traversal under RCU and vma and > > > its important members being RCU-safe, /proc/pid/maps can be read unde= r > > > RCU and without the need to read-lock mmap_lock. However vma content > > > can change from under us, therefore we make a copy of the vma and we > > > pin pointer fields used when generating the output (currently only > > > vm_file and anon_name). Afterwards we check for concurrent address > > > space modifications, wait for them to end and retry. While we take > > > the mmap_lock for reading during such contention, we do that momentar= ily > > > only to record new mm_wr_seq counter. This change is designed to redu= ce > > > > This is probably a stupid question, but why do we need to take a lock > > just to record this counter? uprobes get away without taking mmap_lock > > even for reads, and still record this seq counter. And then detect > > whether there were any modifications in between. Why does this change > > need more heavy-weight mmap_read_lock to do speculative reads? > > Not a stupid question. mmap_read_lock() is used to wait for the writer > to finish what it's doing and then we continue by recording a new > sequence counter value and call mmap_read_unlock. This is what > get_vma_snapshot() does. But your question made me realize that we can > optimize m_start() further by not taking mmap_read_lock at all. > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can > try mmap_lock_speculate_try_begin() and only if it fails do the same > dance we do in the get_vma_snapshot(). I think that should work. Ok, yeah, it would be great to avoid taking a lock in a common case! > > > > > > mmap_lock contention and prevent a process reading /proc/pid/maps fil= es > > > (often a low priority task, such as monitoring/data collection servic= es) > > > from blocking address space updates. > > > Note that this change has a userspace visible disadvantage: it allows > > > for sub-page data tearing as opposed to the previous mechanism where > > > data tearing could happen only between pages of generated output data= . > > > Since current userspace considers data tearing between pages to be > > > acceptable, we assume is will be able to handle sub-page data tearing > > > as well. > > > > > > Signed-off-by: Suren Baghdasaryan > > > --- > > > fs/proc/internal.h | 6 ++ > > > fs/proc/task_mmu.c | 170 ++++++++++++++++++++++++++++++++++--= -- > > > include/linux/mm_inline.h | 18 ++++ > > > 3 files changed, 177 insertions(+), 17 deletions(-) > > > > > > diff --git a/fs/proc/internal.h b/fs/proc/internal.h > > > index 96122e91c645..6e1169c1f4df 100644 > > > --- a/fs/proc/internal.h > > > +++ b/fs/proc/internal.h > > > @@ -379,6 +379,12 @@ struct proc_maps_private { > > > struct task_struct *task; > > > struct mm_struct *mm; > > > struct vma_iterator iter; > > > + bool mmap_locked; > > > + loff_t last_pos; > > > +#ifdef CONFIG_PER_VMA_LOCK > > > + unsigned int mm_wr_seq; > > > + struct vm_area_struct vma_copy; > > > +#endif > > > #ifdef CONFIG_NUMA > > > struct mempolicy *task_mempolicy; > > > #endif > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > > --- a/fs/proc/task_mmu.c > > > +++ b/fs/proc/task_mmu.c > > > @@ -127,13 +127,130 @@ static void release_task_mempolicy(struct proc= _maps_private *priv) > > > } > > > #endif > > > > > > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private = *priv, > > > - loff_t *ppos) > > > +#ifdef CONFIG_PER_VMA_LOCK > > > + > > > +static const struct seq_operations proc_pid_maps_op; > > > + > > > +/* > > > + * Take VMA snapshot and pin vm_file and anon_name as they are used = by > > > + * show_map_vma. > > > + */ > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct v= m_area_struct *vma) > > > +{ > > > + struct vm_area_struct *copy =3D &priv->vma_copy; > > > + int ret =3D -EAGAIN; > > > + > > > + memcpy(copy, vma, sizeof(*vma)); > > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > > + goto out; > > > + > > > + if (!anon_vma_name_get_if_valid(copy)) > > > + goto put_file; > > > > Given vm_file and anon_vma_name are both RCU-protected, if we take > > rcu_read_lock() at m_start/m_next before fetching VMA, we shouldn't > > even need getting/putting them, no? > > Yeah, anon_vma_name indeed looks safe without pinning but vm_file is > using SLAB_TYPESAFE_BY_RCU cache, so it might still be a valid pointer > but pointing to a wrong object even if the rcu grace period did not > pass. With my assumption that seq_file can't rollback once show_map() > is done, I would need a completely stable vma at the time show_map() > is executed so that it does not change from under us while we are > generating the output. > OTOH, if we indeed can rollback while generating seq_file output then > show_map() could output potentially invalid vma, then check for vma > changes and when detected, rollback seq_file and retry again. > > > > > I feel like I'm missing some important limitations, but I don't think > > they are spelled out explicitly... > > > > > + > > > + if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) > > > + return 0; > > > + > > > + /* Address space got modified, vma might be stale. Re-lock an= d retry. */ > > > + rcu_read_unlock(); > > > > Another question I have is whether we really need to copy vma into > > priv->vma_copy to have a stable snapshot? Can't we just speculate like > > with uprobes under assumption that data doesn't change. And once we > > are done producing text output, confirm that speculation was > > successful, and if not - retry? > > > > We'd need an API for seq_file to rollback to previous good known > > location for that, but that seems straightforward enough to do, no? > > Just remember buffer position before speculation, write data, check > > for no mm modifications, and if something changed, rollback seq file > > to last position. > > From looking at seq_read_iter() I think for a rollback we would have > to reset seq_file.count and seq_file.index to their previous states. > At this location: > https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L272 if > show_map() returns negative value m->count will indeed be rolled back > but not seq_file.index. Also returning negative value at > https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L230 > would be interpreted as a hard error... So, I'll need to spend some > time in this code to get the answer about rollback. > Thanks for the review! Yeah, seq_file is a glorified wrapper around a memory buffer, essentially. And at the lowest level, this transaction-like API would basically just return seq->count before we start writing anything, and rollback will just accept a new count to set to seq->count, if we need to rollback. Logistically this all needs to be factored into the whole seq_file callbacks thing, of course, especially if "transaction" will be started in m_start/m_next, while it can be "aborted" in m_show... So that's what would need careful consideration. But you can end up with faster and cleaner implementation, as we discussed above, so worth giving it a try, IMO. > > > > > > > > + ret =3D mmap_read_lock_killable(priv->mm); > > > + if (!ret) { > > > + /* mmap_lock_speculate_try_begin() succeeds when hold= ing mmap_read_lock */ > > > + mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_= seq); > > > + mmap_read_unlock(priv->mm); > > > + ret =3D -EAGAIN; > > > + } > > > + > > > + rcu_read_lock(); > > > + > > > + anon_vma_name_put_if_valid(copy); > > > +put_file: > > > + if (copy->vm_file) > > > + fput(copy->vm_file); > > > +out: > > > + return ret; > > > +} > > > + > > > +static void put_vma_snapshot(struct proc_maps_private *priv) > > > +{ > > > + struct vm_area_struct *vma =3D &priv->vma_copy; > > > + > > > + anon_vma_name_put_if_valid(vma); > > > + if (vma->vm_file) > > > + fput(vma->vm_file); > > > +} > > > + > > > > [...]