From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7D5E6C48260 for ; Thu, 25 Jan 2024 21:31:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 10DF68D0006; Thu, 25 Jan 2024 16:31:56 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 096D68D0001; Thu, 25 Jan 2024 16:31:56 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EA09C8D0006; Thu, 25 Jan 2024 16:31:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id DC2CA8D0001 for ; Thu, 25 Jan 2024 16:31:55 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id B72CE1402F4 for ; Thu, 25 Jan 2024 21:31:55 +0000 (UTC) X-FDA: 81719130990.17.E3DA819 Received: from mail-yb1-f171.google.com (mail-yb1-f171.google.com [209.85.219.171]) by imf13.hostedemail.com (Postfix) with ESMTP id EBF1420020 for ; Thu, 25 Jan 2024 21:31:53 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=AIiCY1fW; spf=pass (imf13.hostedemail.com: domain of surenb@google.com designates 209.85.219.171 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706218314; a=rsa-sha256; cv=none; b=B4EK4k+Pv1QcQq+ht0I+18c/gR6ruI5v/j8fPun2eiIhXdxIQeL/xYoS5n8KrGSbR74HTe gud2IUHiV7MlNNp0syJShHwkculvdb7UiQlve0nvT0POZNmzk3OZPY8n6n/07caZ41ppaZ mILNCmECTphJhpKoV1wgHsQfuC8Sfjk= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=AIiCY1fW; spf=pass (imf13.hostedemail.com: domain of surenb@google.com designates 209.85.219.171 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706218314; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Zla3H72Ilwvo/eWpxYhoN+6CJQRd1jb7Ae4lq+rRb1c=; b=QpiypSxrqgZQGwyCf1EB5ycyt3mNMOUNGy1WN/U+m+Fn/GtxT033V9g631DVk5jOc2A0wi 01LW6YNjUiFgcBauL0r+HIgLNJESl+GBZfxafWdXKLyKN+pwN1ipeXn8YbJbmsGyQdWtOf L071n4XiVG98OgW/ytjF5oVPhSskSeg= Received: by mail-yb1-f171.google.com with SMTP id 3f1490d57ef6-dc24ead4428so5125635276.1 for ; Thu, 25 Jan 2024 13:31:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1706218313; x=1706823113; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Zla3H72Ilwvo/eWpxYhoN+6CJQRd1jb7Ae4lq+rRb1c=; b=AIiCY1fWgfKtuNn+FxkNKdBw9L71QJ2OOa97GzTDSo6Yk4Hu7gdxXIajXoylMTENaZ AnDJDUFYJhGGyRxxnzLGiWy7TrbW9KcO43KHsMJHL+/FC3TjdVbWg+AI1Qef1AQRjJ3t ajQxQ+YZHxkJT5xzGa8vCDucrGMLbjXTHFL78z+YqDWDVu0KA8YQ20JptEozg7g6me+8 LpQC95L9ww8JvlhGyp1F8yXxBGSvNhkXzN4L+ygsEVPqhA2lqEo2aPhW9MmoaOl12iHS yG56L9SpSp5/o1jmYuvOwkFAIMIiXZQI9r4t4IMNQFUvt3O2ujd0EsHjU63QN3Alpbpn thAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706218313; x=1706823113; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Zla3H72Ilwvo/eWpxYhoN+6CJQRd1jb7Ae4lq+rRb1c=; b=WXxnLa62BZDOizSBFF/TQrT2CYcRkiaZfeXsTLf74jCRlraUZkuAAPEWUIJ6BBJvuM 3/Ep6PSrILjEAxGWV7gkxWSxWaxYVDR6i68HY5s6EmNoM2ehoJZp5xXNp1HifhzMsN+x e0GIF1P0JAXcGWkWZFIw/0U6z7xJzaKJfoeAe9s7Yx/kVVJ/l+V5cMjvOKhrOjLAkZO5 1OLaNZfJ1ijgzAoSHQndFlQzj2CiJpBX6S2VlLSXEKa/RtC8VmRcbaV+jwhgv9g1x6S+ I+vmuFXWpzSRkubmLFUK0T9y9hkN7zFNim/DuFd00RkQADJiTQYPw5BnibKK8IXNFhRU y6Ww== X-Gm-Message-State: AOJu0YyzMQD5XeykJec2i9DsIJqKvoqbyTV/YkjyyR/7FKna+Ki24Xy+ +hO6bwTIifzqyBo6bEomc03wLreAxJCbdfnZ8fscYOqEn6L4sIxvkqEAD3Mk9WjElmZL3Vbd6Xq FIawhA0jII1PR1WpfdIxYKxWPjIpKQSjEEeZo X-Google-Smtp-Source: AGHT+IGXm6ZV7XipOpIRJJkQ3M5Lk73qTFNsJKi41d/hkr6pxGn4JRC7wlmQvXAm/UpfGcKzyqbwuqpqleP8yOj0aPs= X-Received: by 2002:a25:2b43:0:b0:dbe:a31e:712a with SMTP id r64-20020a252b43000000b00dbea31e712amr441292ybr.109.1706218312759; Thu, 25 Jan 2024 13:31:52 -0800 (PST) MIME-Version: 1.0 References: <20240123231014.3801041-1-surenb@google.com> <20240123231014.3801041-3-surenb@google.com> In-Reply-To: <20240123231014.3801041-3-surenb@google.com> From: Suren Baghdasaryan Date: Thu, 25 Jan 2024 13:31:41 -0800 Message-ID: Subject: Re: [PATCH v2 3/3] mm/maps: read proc/pid/maps under RCU To: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, dchinner@redhat.com, casey@schaufler-ca.com, ben.wolsieffer@hefring.com, paulmck@kernel.org, david@redhat.com, avagin@google.com, usama.anjum@collabora.com, peterx@redhat.com, hughd@google.com, ryan.roberts@arm.com, wangkefeng.wang@huawei.com, Liam.Howlett@oracle.com, yuzhao@google.com, axelrasmussen@google.com, lstoakes@gmail.com, talumbau@google.com, willy@infradead.org, vbabka@suse.cz, mgorman@techsingularity.net, jhubbard@nvidia.com, vishal.moola@gmail.com, mathieu.desnoyers@efficios.com, dhowells@redhat.com, jgg@ziepe.ca, sidhartha.kumar@oracle.com, andriy.shevchenko@linux.intel.com, yangxingui@huawei.com, keescook@chromium.org, sj@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: EBF1420020 X-Stat-Signature: 9sy7afeymq5tuc8uxgxke1gk7qb73n1u X-Rspam-User: X-HE-Tag: 1706218313-374519 X-HE-Meta: U2FsdGVkX1+vPwe7hbGSGD/mOZVSIcerDDk8TiWFKUtoUriGtIygAG73PlJ59Y8pxE7FRUNSxm6BLrF3D9k0Ais52PGGdwq1jHeNYge6OnlumFIJmw7E52yiVuezTEwFz4JOmgUE/k5xLCod53a8HNNhvwq8r2avQNedNVtladyDPQW6Gmq4kg+9P82tenro0bDJLe6N6le3RChaobrm9U9oAm2d6oKbL2C4Z5y0F6fhC2bAyJ5dHBCFxQ1HqK3faSE+Or7hjM3yUpZEIKU1VT/KF1rimN0oclDny3Bgx45wbKNaUfISEoV5WVob8nKkhF40A1mEmpK2bPQoEGWZ1y3ULTTmi/gtrkOUq2hXjfDullYSRE9gOWcxuIewOuRFmI1E5dOl1KjBVplUtzqsLaRwjJKgA34/K7ix3iD+Cv+I7wChJVTL6qtFBY1ZxGEtHVBYTm3vX0OKC0sB88j2dVJvudBMks2wtnOlJSuKQYjhWVuFAL0+z9wPG5kZ/eJ5zHkgqYq0AzNc0XC7q2xUwxf+t881H4u9Lpk/I3xwCw0Xs3OfYmdH2otLp9DiOLQ12woCaUfGYuguMGkTuJjdYwXht8itD1fmxacySJAlwaKBj8bs66A4Op9iO8+zLHNDEjW2FYMkxRGXQj7+zmZAQWfiwW1Zs5GgZTPz583IHHxp1SVfgDMe0FL6DXpW6rodYORqtZXrMumQcpaP1OwA24SvJ1J1a5lKwfYgibXLJwhg+BxXJ68gazFpTVyYVpzppqYDjeynmdRpCsHDA+7NOmo3Lsk2poP2zu55Ne0fD2d5YNDRTcyvx5Wyqb96BuRhjKSNFJu558R3SkZN3ROKrq/xF2FC/D2sb/FSgjbu/R74H4Sz8qlfpcax++W0P9KFJPEOt4RGI4fKkcA0LzzmolwMmWIHS7GN0KfZ5BSMWswXZ4o4WaDo4IbhBJA1moRg9Q6xcWRYOTeh+9tBDLm rPOrhZS7 DOEvPHwwZi67eeOCuaHD7RLrsozx5rFvvNdbJ1v6zXO3E6sWjPD3/DS/GQBlDr9e7VaOzykGjzUK5DcVnrroLj0G9JSuzNgyYw3iaXMmCrFWfkGk7YpNcG5FuDJ63frkRmpHzEA1mpMyeF4Vcc/gpyIa1dykkyRs8U9wMSsgESLRlhe01YG3w59/doXSm7i/dRZx26A6/F5/vMrjzs0f2G0/tvAI/OuOdrDCKnAG02gN4EBYMrG1gRgDoz2Djr3NThwGM+wmTwi4y7d8CKPcE9pJZVxbcfUthjo05td6EDHq+nFagrM8LoIqgichmp2V14romzrlKkgrFE6zpO7HCup4/C0aDz7U7LWpqGI6YOtqgMT8MvY4WFU7DLYyM/CpJcNyi X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jan 23, 2024 at 3:10=E2=80=AFPM Suren Baghdasaryan wrote: > > With maple_tree supporting vma tree traversal under RCU and per-vma locks > making vma access RCU-safe, /proc/pid/maps can be read under RCU and > without the need to read-lock mmap_lock. However vma content can change > from under us, therefore we make a copy of the vma and we pin pointer > fields used when generating the output (currently only vm_file and > anon_name). Afterwards we check for concurrent address space > modifications, wait for them to end and retry. That last check is needed > to avoid possibility of missing a vma during concurrent maple_tree > node replacement, which might report a NULL when a vma is replaced > with another one. While we take the mmap_lock for reading during such > contention, we do that momentarily only to record new mm_wr_seq counter. > This change is designed to reduce mmap_lock contention and prevent a > process reading /proc/pid/maps files (often a low priority task, such as > monitoring/data collection services) from blocking address space updates. > > Note that this change has a userspace visible disadvantage: it allows for > sub-page data tearing as opposed to the previous mechanism where data > tearing could happen only between pages of generated output data. > Since current userspace considers data tearing between pages to be > acceptable, we assume is will be able to handle sub-page data tearing > as well. > > Signed-off-by: Suren Baghdasaryan > --- > Changes since v1 [1]: > - Fixed CONFIG_ANON_VMA_NAME=3Dn build by introducing > anon_vma_name_{get|put}_if_valid, per SeongJae Park > - Fixed misspelling of get_vma_snapshot() > > [1] https://lore.kernel.org/all/20240122071324.2099712-3-surenb@google.co= m/ > > fs/proc/internal.h | 2 + > fs/proc/task_mmu.c | 113 +++++++++++++++++++++++++++++++++++--- > include/linux/mm_inline.h | 18 ++++++ > 3 files changed, 126 insertions(+), 7 deletions(-) > > diff --git a/fs/proc/internal.h b/fs/proc/internal.h > index a71ac5379584..e0247225bb68 100644 > --- a/fs/proc/internal.h > +++ b/fs/proc/internal.h > @@ -290,6 +290,8 @@ struct proc_maps_private { > struct task_struct *task; > struct mm_struct *mm; > struct vma_iterator iter; > + unsigned long mm_wr_seq; > + struct vm_area_struct vma_copy; > #ifdef CONFIG_NUMA > struct mempolicy *task_mempolicy; > #endif > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 3f78ebbb795f..0d5a515156ee 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -126,11 +126,95 @@ static void release_task_mempolicy(struct proc_maps= _private *priv) > } > #endif > > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *pri= v, > - loff_t *ppos) > +#ifdef CONFIG_PER_VMA_LOCK > + > +static const struct seq_operations proc_pid_maps_op; > + > +/* > + * Take VMA snapshot and pin vm_file and anon_name as they are used by > + * show_map_vma. > + */ > +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_ar= ea_struct *vma) > +{ > + struct vm_area_struct *copy =3D &priv->vma_copy; > + int ret =3D -EAGAIN; > + > + memcpy(copy, vma, sizeof(*vma)); > + if (copy->vm_file && !get_file_rcu(©->vm_file)) There is a problem in this patchset. I assumed that get_file_rcu() can be used against vma->vm_file but that's not true. vma->vm_file is freed via a call to fput() which schedules its freeing using schedule_delayed_work(..., 1) but I don't think that constitutes RCU grace period, so it can be freed from under us. Andrew, could you please remove this patchset from your tree until I sort this out? Thanks, Suren. > + goto out; > + > + if (!anon_vma_name_get_if_valid(copy)) > + goto put_file; > + > + if (priv->mm_wr_seq =3D=3D mmap_write_seq_read(priv->mm)) > + return 0; > + > + /* Address space got modified, vma might be stale. Wait and retry= . */ > + rcu_read_unlock(); > + ret =3D mmap_read_lock_killable(priv->mm); > + mmap_write_seq_record(priv->mm, &priv->mm_wr_seq); > + mmap_read_unlock(priv->mm); > + rcu_read_lock(); > + > + if (!ret) > + ret =3D -EAGAIN; /* no other errors, ok to retry */ > + > + anon_vma_name_put_if_valid(copy); > +put_file: > + if (copy->vm_file) > + fput(copy->vm_file); > +out: > + return ret; > +} > + > +static void put_vma_snapshot(struct proc_maps_private *priv) > +{ > + struct vm_area_struct *vma =3D &priv->vma_copy; > + > + anon_vma_name_put_if_valid(vma); > + if (vma->vm_file) > + fput(vma->vm_file); > +} > + > +static inline bool needs_mmap_lock(struct seq_file *m) > +{ > + /* > + * smaps and numa_maps perform page table walk, therefore require > + * mmap_lock but maps can be read under RCU. > + */ > + return m->op !=3D &proc_pid_maps_op; > +} > + > +#else /* CONFIG_PER_VMA_LOCK */ > + > +/* Without per-vma locks VMA access is not RCU-safe */ > +static inline bool needs_mmap_lock(struct seq_file *m) { return true; } > + > +#endif /* CONFIG_PER_VMA_LOCK */ > + > +static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *p= pos) > { > + struct proc_maps_private *priv =3D m->private; > struct vm_area_struct *vma =3D vma_next(&priv->iter); > > +#ifdef CONFIG_PER_VMA_LOCK > + if (vma && !needs_mmap_lock(m)) { > + int ret; > + > + put_vma_snapshot(priv); > + while ((ret =3D get_vma_snapshot(priv, vma)) =3D=3D -EAGA= IN) { > + /* lookup the vma at the last position again */ > + vma_iter_init(&priv->iter, priv->mm, *ppos); > + vma =3D vma_next(&priv->iter); > + } > + > + if (ret) { > + put_vma_snapshot(priv); > + return NULL; > + } > + vma =3D &priv->vma_copy; > + } > +#endif > if (vma) { > *ppos =3D vma->vm_start; > } else { > @@ -169,12 +253,20 @@ static void *m_start(struct seq_file *m, loff_t *pp= os) > return ERR_PTR(-EINTR); > } > > + /* Drop mmap_lock if possible */ > + if (!needs_mmap_lock(m)) { > + mmap_write_seq_record(priv->mm, &priv->mm_wr_seq); > + mmap_read_unlock(priv->mm); > + rcu_read_lock(); > + memset(&priv->vma_copy, 0, sizeof(priv->vma_copy)); > + } > + > vma_iter_init(&priv->iter, mm, last_addr); > hold_task_mempolicy(priv); > if (last_addr =3D=3D -2UL) > return get_gate_vma(mm); > > - return proc_get_vma(priv, ppos); > + return proc_get_vma(m, ppos); > } > > static void *m_next(struct seq_file *m, void *v, loff_t *ppos) > @@ -183,7 +275,7 @@ static void *m_next(struct seq_file *m, void *v, loff= _t *ppos) > *ppos =3D -1UL; > return NULL; > } > - return proc_get_vma(m->private, ppos); > + return proc_get_vma(m, ppos); > } > > static void m_stop(struct seq_file *m, void *v) > @@ -195,7 +287,10 @@ static void m_stop(struct seq_file *m, void *v) > return; > > release_task_mempolicy(priv); > - mmap_read_unlock(mm); > + if (needs_mmap_lock(m)) > + mmap_read_unlock(mm); > + else > + rcu_read_unlock(); > mmput(mm); > put_task_struct(priv->task); > priv->task =3D NULL; > @@ -283,8 +378,10 @@ show_map_vma(struct seq_file *m, struct vm_area_stru= ct *vma) > start =3D vma->vm_start; > end =3D vma->vm_end; > show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino); > - if (mm) > - anon_name =3D anon_vma_name(vma); > + if (mm) { > + anon_name =3D needs_mmap_lock(m) ? anon_vma_name(vma) : > + anon_vma_name_get_rcu(vma); > + } > > /* > * Print the dentry name for named mappings, and a > @@ -338,6 +435,8 @@ show_map_vma(struct seq_file *m, struct vm_area_struc= t *vma) > seq_puts(m, name); > } > seq_putc(m, '\n'); > + if (anon_name && !needs_mmap_lock(m)) > + anon_vma_name_put(anon_name); > } > > static int show_map(struct seq_file *m, void *v) > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h > index bbdb0ca857f1..a4a644fe005e 100644 > --- a/include/linux/mm_inline.h > +++ b/include/linux/mm_inline.h > @@ -413,6 +413,21 @@ static inline bool anon_vma_name_eq(struct anon_vma_= name *anon_name1, > > struct anon_vma_name *anon_vma_name_get_rcu(struct vm_area_struct *vma); > > +/* > + * Takes a reference if anon_vma is valid and stable (has references). > + * Fails only if anon_vma is valid but we failed to get a reference. > + */ > +static inline bool anon_vma_name_get_if_valid(struct vm_area_struct *vma= ) > +{ > + return !vma->anon_name || anon_vma_name_get_rcu(vma); > +} > + > +static inline void anon_vma_name_put_if_valid(struct vm_area_struct *vma= ) > +{ > + if (vma->anon_name) > + anon_vma_name_put(vma->anon_name); > +} > + > #else /* CONFIG_ANON_VMA_NAME */ > static inline void anon_vma_name_get(struct anon_vma_name *anon_name) {} > static inline void anon_vma_name_put(struct anon_vma_name *anon_name) {} > @@ -432,6 +447,9 @@ struct anon_vma_name *anon_vma_name_get_rcu(struct vm= _area_struct *vma) > return NULL; > } > > +static inline bool anon_vma_name_get_if_valid(struct vm_area_struct *vma= ) { return true; } > +static inline void anon_vma_name_put_if_valid(struct vm_area_struct *vma= ) {} > + > #endif /* CONFIG_ANON_VMA_NAME */ > > static inline void init_tlb_flush_pending(struct mm_struct *mm) > -- > 2.43.0.429.g432eaa2c6b-goog >