From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 02342C369D5 for ; Wed, 23 Apr 2025 21:49:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C59476B0006; Wed, 23 Apr 2025 17:49:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BE1F16B0007; Wed, 23 Apr 2025 17:49:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A0B5A6B0008; Wed, 23 Apr 2025 17:49:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 7B91B6B0006 for ; Wed, 23 Apr 2025 17:49:46 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 8A57E140493 for ; Wed, 23 Apr 2025 21:49:47 +0000 (UTC) X-FDA: 83366651214.17.8812592 Received: from mail-qt1-f177.google.com (mail-qt1-f177.google.com [209.85.160.177]) by imf11.hostedemail.com (Postfix) with ESMTP id A24C240002 for ; Wed, 23 Apr 2025 21:49:45 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=hUXiFsAI; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf11.hostedemail.com: domain of surenb@google.com designates 209.85.160.177 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745444985; a=rsa-sha256; cv=none; b=Zz8ZgqxO7wGwESrOw8TsTFBT5QLGg4lHHZ7FSEiCpJArMRf2+9JNVBTnDweL9vFDqAhwrp EyiUUezNeukv+T+XvfohZcPJ/zHdWPk6w4cW48kmGRdSsIEiLmNikipE1sRi7fFTyj7zzI ZGaoePP7VFEaYyMf/2fYKPHE3qeslB8= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=hUXiFsAI; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf11.hostedemail.com: domain of surenb@google.com designates 209.85.160.177 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745444985; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=S7/wD0XYriNGKp673eGGJ1mXgCpZhTWu5sE1MgjgFPg=; b=S6FyP3JYCs/5uLopLzooj7YX03ZTXf7HJjHLlS6GVQ91oUVH/XNMRXmUoHcYphiRe+7+wm qewty/3HzFsWXlvFFY9JB40qjgwxXBauriFMIXa40AFt0ZquWxWX8sKCVXP6ERMRb01EeV qboARMHXJ5c4I7pgupORgIX3m8OgSe0= Received: by mail-qt1-f177.google.com with SMTP id d75a77b69052e-47666573242so140461cf.0 for ; Wed, 23 Apr 2025 14:49:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745444985; x=1746049785; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=S7/wD0XYriNGKp673eGGJ1mXgCpZhTWu5sE1MgjgFPg=; b=hUXiFsAIP+XQWSXRRCvdZSZk6VAwhHsrxT0WKZSEz/ewQ4hbf0jp5NdDqG81y3mR0i fzvUGgIYa6o+gGZ4I4Y9p4QX+ytErqXmufN5GbTOFtPBvdhNrRlKDRGAnJVVTmgMr8RO JtVfFzApqKVNYt75HOaM3P/jFOSXCMtQs+oCsUKpJkHpjFMdljfAbV4laRauhAxckZkv DBLiLk1gFRv8541hxbm70Dsw3NzpicmowgGPhonG6h81z/ZHuF3nT5Hn/GW+natNBP8i 2ugg6j7y0953oSn7Az5PYzrmU2YZ61ucwXnpAvjCqfizONcO3T6VDW1B6mCZOFsF14bf TFGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745444985; x=1746049785; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=S7/wD0XYriNGKp673eGGJ1mXgCpZhTWu5sE1MgjgFPg=; b=FRHPbyaaqXqNnNHPtDM8gJ+NHuQk7oLJTcp8zGnN5383ISk77ADDyd/MbtMdHKCCod hjvhh+3fhC7sdZ5mil/LxNj/dUssM2BdTJ0c879gF7+xXqdgz4hjCXydJz3THabY3Hs9 QG4xuyckP+/q2HjrazdoitUKXzCfw5toQvoYfhVWzTF2te+2E1deVPWapsKKBFb8isD5 E2FTJKBRjUu2ralYfhbQCtYmM93ShLjdD2CVLBaKNAMuLrRZHP8sGiCaq6dBAW8vgEtz 1ewCzSVgxFBOpuUG3ZEIQjtzuChi4XfgNzucVlul/wFYRTg025n4ij24WUKiJ8GRHqjw tppQ== X-Forwarded-Encrypted: i=1; AJvYcCUMZ32oIniCOEWZE2ZfXkkcCWDlXaW5g3DII66TMcw+Un+dLo5o1Um4q5HuTOFU3E+FAv0mUwm5fw==@kvack.org X-Gm-Message-State: AOJu0Yw2qZw7IWedhwvOJsHBHACAZQyT5XhULpg3wo8n6Q0j37xBrXE5 WsidgsjB+0XpV5soVInP2Qcb9sx/e1UjwiduQESdetmvkxHpDmRhhRztiVLbxXK55SXY68Gnh9/ hFn8Ch+B8b4UdyA2CLNvDp1u15Wk3zp9UlPYn X-Gm-Gg: ASbGncud9KBLmPGsZ/n+3RDXXEdS4TwMwbl3DPA+KtFJ8CEhh7LBV42oFjc1YK/VUNo 3hMjHsvGXu7ItmfaWiRel7agVdJnGCYNMulgjGYoX2zEpHVKydeiV4SF3La1ILl22W5fO470NNS tnTP2vDlL//TEW6qB6n2hYAbosKGRUrVsq0U523AHWQGxGuOzb3irb X-Google-Smtp-Source: AGHT+IFQfVsUfGLvaOpqsQMsbqmuN8itGhxrZwwZ0MFIqvRyvcpvtZya/7FsimV2oONrYFkrBH9PEKrdK8GJwxZvGa0= X-Received: by 2002:a05:622a:4105:b0:47a:ed97:373e with SMTP id d75a77b69052e-47e79facab0mr1370541cf.10.1745444984405; Wed, 23 Apr 2025 14:49:44 -0700 (PDT) MIME-Version: 1.0 References: <20250418174959.1431962-1-surenb@google.com> <20250418174959.1431962-8-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Wed, 23 Apr 2025 14:49:33 -0700 X-Gm-Features: ATxdqUHhpsoXV5fOytdwpz53cjD4EaJekweJa9XT6PbfPehhPHjBovEw6rKpEEE Message-ID: Subject: Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU To: Andrii Nakryiko Cc: akpm@linux-foundation.org, Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, david@redhat.com, vbabka@suse.cz, peterx@redhat.com, jannh@google.com, hannes@cmpxchg.org, mhocko@kernel.org, paulmck@kernel.org, shuah@kernel.org, adobriyan@gmail.com, brauner@kernel.org, josef@toxicpanda.com, yebin10@huawei.com, linux@weissschuh.net, willy@infradead.org, osalvador@suse.de, andrii@kernel.org, ryan.roberts@arm.com, christophe.leroy@csgroup.eu, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: A24C240002 X-Stat-Signature: 7d4nc8xcfrirjerd9ye9i5k7hytyuyup X-Rspam-User: X-HE-Tag: 1745444985-288880 X-HE-Meta: U2FsdGVkX1/SGg3j5mnG/sPiSAMtFBa5esegMHi1hxOikavyEOOtnErrtRMRM/2Wfi9RpsyjHkUdNzF8iwO7PITfa1ntKZ71z1dDBa9O5i4WxTGjYbxEZ83dPD3qBHXjP/qE+OsUF4R4Pt8+FMY1aSl5JmMRDOF8v1NxOUGDBwGHN/2TadzbNK2xi6WkPp3tSxkhHIuULvAw685tNAmRqL9WbM5EayeD3NQ6nKCfDxtT8PnxSpg4Q+0cBSxVYyQ2buB9RorITgS2AOoWELphG44Cp7Xu6fIRVorSt01QFxcgbEdbRIYJsOtyjwMRzBMHut6rECCw2mU4LcF674B2bEP3LMINaQApXq3UtIpWwoLOa3rZEx4I0EKR2fY7Q4Yw4Z4fJzmFXpFJ6t3YSWkWs83POoQ/jyexgpGbaHeJbQSyrgW0NPNnshqgaKr9/7hIY7edDrGck9/NrXv8Il4vBzB5ajcKG7IJfpwpZaklw2DYaYdfg8POI4ug5g2B0P55CMjO76cqIG083LCU9IZ73l46kBql1n5Kd1nRqhuracwh0S3Bgi5RkXw/02d9RGhtcMPyHJj1753qT+MmPrg9Op7zfZ3TmU+xE6uAsBZYBF3jbEl1YYM+RMjkuQEVLvcB4ZaRclNHF/75Hm/LVxUzm2CFRMKUcIS+YFzJGvXH7auhCi4hS/7GKLkOf5AZriIM7DUAno78dpaxsoZwEStlQc3q3U0Ow663wnXnhxxAfXaRjsf2La63BEyAuES5KMX+rLpR4Q7XQNBHz/xRm5o+5yZeaUoAVBkSFsg81l+hDuUsi82TUp81Kz75k4+tV+wBEjQcNe/xk58WcbnC2J24UFBnvOSvLP5tm3MM+P3wg5yfnpJjxen3KYhJTnExgWT2G7+kDf9B+umzvVozIgKJdlM4FJQ4L+ssmsd/2pPuynmd87bNTLkXn5IKVoIofAiWvtXyCIqyuFaF1Ld/gHY ioFszVRV 10/Uz8NwxFXwz6Z7CwIUxnBb/mjt9Vs7MJGAlAunU6CndhAJuAFye+7Q5ha913YUo8sxgAMcWWkfIBOYAA/4iKMdCSkx9lwuYAIXurL/LLHAfE+VtN7h81YW+y5pMXBRTjdL8hXn/SZ7cwDMJkHb2BzCAqC1JEnZAJgDvKAjs24Jgzi67m6L8g2ZNc547e1MdGiLlWfxVR5ehEOay0748pFw50xclIrF9kt3ZLjM7r53RfkFkDN+mcTK+q9DCaLzt5lCTCNZaWa9tG1wPCcAAjJZttHyeN4UM7eVNiyy6a6+1XeE0GdSroBiIZGXXhREsCNaiTpe47cGeL8ug118DMzAAnWrclIxGwBt3VUh3/8QwaZJo/bH6spoQ2A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Apr 22, 2025 at 3:49=E2=80=AFPM Andrii Nakryiko wrote: > > On Fri, Apr 18, 2025 at 10:50=E2=80=AFAM Suren Baghdasaryan wrote: > > > > With maple_tree supporting vma tree traversal under RCU and vma and > > its important members being RCU-safe, /proc/pid/maps can be read under > > RCU and without the need to read-lock mmap_lock. However vma content > > can change from under us, therefore we make a copy of the vma and we > > pin pointer fields used when generating the output (currently only > > vm_file and anon_name). Afterwards we check for concurrent address > > space modifications, wait for them to end and retry. While we take > > the mmap_lock for reading during such contention, we do that momentaril= y > > only to record new mm_wr_seq counter. This change is designed to reduce > > This is probably a stupid question, but why do we need to take a lock > just to record this counter? uprobes get away without taking mmap_lock > even for reads, and still record this seq counter. And then detect > whether there were any modifications in between. Why does this change > need more heavy-weight mmap_read_lock to do speculative reads? Not a stupid question. mmap_read_lock() is used to wait for the writer to finish what it's doing and then we continue by recording a new sequence counter value and call mmap_read_unlock. This is what get_vma_snapshot() does. But your question made me realize that we can optimize m_start() further by not taking mmap_read_lock at all. Instead of taking mmap_read_lock then doing drop_mmap_lock() we can try mmap_lock_speculate_try_begin() and only if it fails do the same dance we do in the get_vma_snapshot(). I think that should work. > > > mmap_lock contention and prevent a process reading /proc/pid/maps files > > (often a low priority task, such as monitoring/data collection services= ) > > from blocking address space updates. > > Note that this change has a userspace visible disadvantage: it allows > > for sub-page data tearing as opposed to the previous mechanism where > > data tearing could happen only between pages of generated output data. > > Since current userspace considers data tearing between pages to be > > acceptable, we assume is will be able to handle sub-page data tearing > > as well. > > > > Signed-off-by: Suren Baghdasaryan > > --- > > fs/proc/internal.h | 6 ++ > > fs/proc/task_mmu.c | 170 ++++++++++++++++++++++++++++++++++---- > > include/linux/mm_inline.h | 18 ++++ > > 3 files changed, 177 insertions(+), 17 deletions(-) > > > > diff --git a/fs/proc/internal.h b/fs/proc/internal.h > > index 96122e91c645..6e1169c1f4df 100644 > > --- a/fs/proc/internal.h > > +++ b/fs/proc/internal.h > > @@ -379,6 +379,12 @@ struct proc_maps_private { > > struct task_struct *task; > > struct mm_struct *mm; > > struct vma_iterator iter; > > + bool mmap_locked; > > + loff_t last_pos; > > +#ifdef CONFIG_PER_VMA_LOCK > > + unsigned int mm_wr_seq; > > + struct vm_area_struct vma_copy; > > +#endif > > #ifdef CONFIG_NUMA > > struct mempolicy *task_mempolicy; > > #endif > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > > @@ -127,13 +127,130 @@ static void release_task_mempolicy(struct proc_m= aps_private *priv) > > } > > #endif > > > > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *p= riv, > > - loff_t *ppos) > > +#ifdef CONFIG_PER_VMA_LOCK > > + > > +static const struct seq_operations proc_pid_maps_op; > > + > > +/* > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by > > + * show_map_vma. > > + */ > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_= area_struct *vma) > > +{ > > + struct vm_area_struct *copy =3D &priv->vma_copy; > > + int ret =3D -EAGAIN; > > + > > + memcpy(copy, vma, sizeof(*vma)); > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > + goto out; > > + > > + if (!anon_vma_name_get_if_valid(copy)) > > + goto put_file; > > Given vm_file and anon_vma_name are both RCU-protected, if we take > rcu_read_lock() at m_start/m_next before fetching VMA, we shouldn't > even need getting/putting them, no? Yeah, anon_vma_name indeed looks safe without pinning but vm_file is using SLAB_TYPESAFE_BY_RCU cache, so it might still be a valid pointer but pointing to a wrong object even if the rcu grace period did not pass. With my assumption that seq_file can't rollback once show_map() is done, I would need a completely stable vma at the time show_map() is executed so that it does not change from under us while we are generating the output. OTOH, if we indeed can rollback while generating seq_file output then show_map() could output potentially invalid vma, then check for vma changes and when detected, rollback seq_file and retry again. > > I feel like I'm missing some important limitations, but I don't think > they are spelled out explicitly... > > > + > > + if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) > > + return 0; > > + > > + /* Address space got modified, vma might be stale. Re-lock and = retry. */ > > + rcu_read_unlock(); > > Another question I have is whether we really need to copy vma into > priv->vma_copy to have a stable snapshot? Can't we just speculate like > with uprobes under assumption that data doesn't change. And once we > are done producing text output, confirm that speculation was > successful, and if not - retry? > > We'd need an API for seq_file to rollback to previous good known > location for that, but that seems straightforward enough to do, no? > Just remember buffer position before speculation, write data, check > for no mm modifications, and if something changed, rollback seq file > to last position. >From looking at seq_read_iter() I think for a rollback we would have to reset seq_file.count and seq_file.index to their previous states. At this location: https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L272 if show_map() returns negative value m->count will indeed be rolled back but not seq_file.index. Also returning negative value at https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L230 would be interpreted as a hard error... So, I'll need to spend some time in this code to get the answer about rollback. Thanks for the review! > > > > + ret =3D mmap_read_lock_killable(priv->mm); > > + if (!ret) { > > + /* mmap_lock_speculate_try_begin() succeeds when holdin= g mmap_read_lock */ > > + mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_se= q); > > + mmap_read_unlock(priv->mm); > > + ret =3D -EAGAIN; > > + } > > + > > + rcu_read_lock(); > > + > > + anon_vma_name_put_if_valid(copy); > > +put_file: > > + if (copy->vm_file) > > + fput(copy->vm_file); > > +out: > > + return ret; > > +} > > + > > +static void put_vma_snapshot(struct proc_maps_private *priv) > > +{ > > + struct vm_area_struct *vma =3D &priv->vma_copy; > > + > > + anon_vma_name_put_if_valid(vma); > > + if (vma->vm_file) > > + fput(vma->vm_file); > > +} > > + > > [...]