Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: Suren Baghdasaryan <surenb@google.com>
Cc: akpm@linux-foundation.org, Liam.Howlett@oracle.com,
	 lorenzo.stoakes@oracle.com, david@redhat.com, vbabka@suse.cz,
	 peterx@redhat.com, jannh@google.com, hannes@cmpxchg.org,
	mhocko@kernel.org,  paulmck@kernel.org, shuah@kernel.org,
	adobriyan@gmail.com, brauner@kernel.org,  josef@toxicpanda.com,
	yebin10@huawei.com, linux@weissschuh.net,  willy@infradead.org,
	osalvador@suse.de, andrii@kernel.org,  ryan.roberts@arm.com,
	christophe.leroy@csgroup.eu,  linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,  linux-mm@kvack.org,
	linux-kselftest@vger.kernel.org
Subject: Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
Date: Wed, 23 Apr 2025 15:06:08 -0700	[thread overview]
Message-ID: <CAEf4BzYctDuS4DRTzdRQyyhCYvFTggOz=wcbizXEYvC_z_SSng@mail.gmail.com> (raw)
In-Reply-To: <CAJuCfpE_jJ0Xq5T0HcLpquRzO+NdvN3T3_JXEwSjt2NG9Ryy5g@mail.gmail.com>

On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > RCU and without the need to read-lock mmap_lock. However vma content
> > > can change from under us, therefore we make a copy of the vma and we
> > > pin pointer fields used when generating the output (currently only
> > > vm_file and anon_name). Afterwards we check for concurrent address
> > > space modifications, wait for them to end and retry. While we take
> > > the mmap_lock for reading during such contention, we do that momentarily
> > > only to record new mm_wr_seq counter. This change is designed to reduce
> >
> > This is probably a stupid question, but why do we need to take a lock
> > just to record this counter? uprobes get away without taking mmap_lock
> > even for reads, and still record this seq counter. And then detect
> > whether there were any modifications in between. Why does this change
> > need more heavy-weight mmap_read_lock to do speculative reads?
>
> Not a stupid question. mmap_read_lock() is used to wait for the writer
> to finish what it's doing and then we continue by recording a new
> sequence counter value and call mmap_read_unlock. This is what
> get_vma_snapshot() does. But your question made me realize that we can
> optimize m_start() further by not taking mmap_read_lock at all.
> Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
> try mmap_lock_speculate_try_begin() and only if it fails do the same
> dance we do in the get_vma_snapshot(). I think that should work.

Ok, yeah, it would be great to avoid taking a lock in a common case!

>
> >
> > > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > > (often a low priority task, such as monitoring/data collection services)
> > > from blocking address space updates.
> > > Note that this change has a userspace visible disadvantage: it allows
> > > for sub-page data tearing as opposed to the previous mechanism where
> > > data tearing could happen only between pages of generated output data.
> > > Since current userspace considers data tearing between pages to be
> > > acceptable, we assume is will be able to handle sub-page data tearing
> > > as well.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  fs/proc/internal.h        |   6 ++
> > >  fs/proc/task_mmu.c        | 170 ++++++++++++++++++++++++++++++++++----
> > >  include/linux/mm_inline.h |  18 ++++
> > >  3 files changed, 177 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > > index 96122e91c645..6e1169c1f4df 100644
> > > --- a/fs/proc/internal.h
> > > +++ b/fs/proc/internal.h
> > > @@ -379,6 +379,12 @@ struct proc_maps_private {
> > >         struct task_struct *task;
> > >         struct mm_struct *mm;
> > >         struct vma_iterator iter;
> > > +       bool mmap_locked;
> > > +       loff_t last_pos;
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +       unsigned int mm_wr_seq;
> > > +       struct vm_area_struct vma_copy;
> > > +#endif
> > >  #ifdef CONFIG_NUMA
> > >         struct mempolicy *task_mempolicy;
> > >  #endif
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index b9e4fbbdf6e6..f9d50a61167c 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -127,13 +127,130 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
> > >  }
> > >  #endif
> > >
> > > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> > > -                                               loff_t *ppos)
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +
> > > +static const struct seq_operations proc_pid_maps_op;
> > > +
> > > +/*
> > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > > + * show_map_vma.
> > > + */
> > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
> > > +{
> > > +       struct vm_area_struct *copy = &priv->vma_copy;
> > > +       int ret = -EAGAIN;
> > > +
> > > +       memcpy(copy, vma, sizeof(*vma));
> > > +       if (copy->vm_file && !get_file_rcu(&copy->vm_file))
> > > +               goto out;
> > > +
> > > +       if (!anon_vma_name_get_if_valid(copy))
> > > +               goto put_file;
> >
> > Given vm_file and anon_vma_name are both RCU-protected, if we take
> > rcu_read_lock() at m_start/m_next before fetching VMA, we shouldn't
> > even need getting/putting them, no?
>
> Yeah, anon_vma_name indeed looks safe without pinning but vm_file is
> using SLAB_TYPESAFE_BY_RCU cache, so it might still be a valid pointer
> but pointing to a wrong object even if the rcu grace period did not
> pass. With my assumption that seq_file can't rollback once show_map()
> is done, I would need a completely stable vma at the time show_map()
> is executed so that it does not change from under us while we are
> generating the output.
> OTOH, if we indeed can rollback while generating seq_file output then
> show_map() could output potentially invalid vma, then check for vma
> changes and when detected, rollback seq_file and retry again.
>
> >
> > I feel like I'm missing some important limitations, but I don't think
> > they are spelled out explicitly...
> >
> > > +
> > > +       if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq))
> > > +               return 0;
> > > +
> > > +       /* Address space got modified, vma might be stale. Re-lock and retry. */
> > > +       rcu_read_unlock();
> >
> > Another question I have is whether we really need to copy vma into
> > priv->vma_copy to have a stable snapshot? Can't we just speculate like
> > with uprobes under assumption that data doesn't change. And once we
> > are done producing text output, confirm that speculation was
> > successful, and if not - retry?
> >
> > We'd need an API for seq_file to rollback to previous good known
> > location for that, but that seems straightforward enough to do, no?
> > Just remember buffer position before speculation, write data, check
> > for no mm modifications, and if something changed, rollback seq file
> > to last position.
>
> From looking at seq_read_iter() I think for a rollback we would have
> to reset seq_file.count and seq_file.index to their previous states.
> At this location:
> https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L272 if
> show_map() returns negative value m->count will indeed be rolled back
> but not seq_file.index. Also returning negative value at
> https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L230
> would be interpreted as a hard error... So, I'll need to spend some
> time in this code to get the answer about rollback.
> Thanks for the review!

Yeah, seq_file is a glorified wrapper around a memory buffer,
essentially. And at the lowest level, this transaction-like API would
basically just return seq->count before we start writing anything, and
rollback will just accept a new count to set to seq->count, if we need
to rollback.

Logistically this all needs to be factored into the whole seq_file
callbacks thing, of course, especially if "transaction" will be
started in m_start/m_next, while it can be "aborted" in m_show... So
that's what would need careful consideration.

But you can end up with faster and cleaner implementation, as we
discussed above, so worth giving it a try, IMO.

>
> >
> >
> > > +       ret = mmap_read_lock_killable(priv->mm);
> > > +       if (!ret) {
> > > +               /* mmap_lock_speculate_try_begin() succeeds when holding mmap_read_lock */
> > > +               mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
> > > +               mmap_read_unlock(priv->mm);
> > > +               ret = -EAGAIN;
> > > +       }
> > > +
> > > +       rcu_read_lock();
> > > +
> > > +       anon_vma_name_put_if_valid(copy);
> > > +put_file:
> > > +       if (copy->vm_file)
> > > +               fput(copy->vm_file);
> > > +out:
> > > +       return ret;
> > > +}
> > > +
> > > +static void put_vma_snapshot(struct proc_maps_private *priv)
> > > +{
> > > +       struct vm_area_struct *vma = &priv->vma_copy;
> > > +
> > > +       anon_vma_name_put_if_valid(vma);
> > > +       if (vma->vm_file)
> > > +               fput(vma->vm_file);
> > > +}
> > > +
> >
> > [...]

next prev parent reply	other threads:[~2025-04-23 22:06 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-18 17:49 [PATCH v3 0/8] perform /proc/pid/maps read and PROCMAP_QUERY " Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 1/8] selftests/proc: add /proc/pid/maps tearing from vma split test Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 2/8] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 3/8] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping Suren Baghdasaryan
2025-04-18 18:30   ` Lorenzo Stoakes
2025-04-18 19:31     ` Suren Baghdasaryan
2025-04-18 19:55       ` Lorenzo Stoakes
2025-04-18 20:03         ` Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 4/8] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 5/8] selftests/proc: add verbose more for tests to facilitate debugging Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 6/8] mm: make vm_area_struct anon_name field RCU-safe Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU Suren Baghdasaryan
2025-04-22 22:48   ` Andrii Nakryiko
2025-04-23 21:49     ` Suren Baghdasaryan
2025-04-23 22:06       ` Andrii Nakryiko [this message]
2025-04-24  0:23         ` Liam R. Howlett
2025-04-24 15:20           ` Suren Baghdasaryan
2025-04-24 16:03             ` Andrii Nakryiko
2025-04-24 16:42               ` Liam R. Howlett
2025-04-24 17:38                 ` Suren Baghdasaryan
2025-04-24 17:44                 ` Andrii Nakryiko
2025-04-29 15:39   ` Jann Horn
2025-04-29 17:09     ` Suren Baghdasaryan
2025-04-29 17:20       ` Jann Horn
2025-04-29 18:04         ` Suren Baghdasaryan
2025-04-29 18:54           ` Jann Horn
2025-04-29 20:33             ` Suren Baghdasaryan
2025-04-29 20:43               ` Jann Horn
2025-05-05 13:50             ` Christian Brauner
2025-05-05 16:38               ` Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl " Suren Baghdasaryan
2025-04-22 22:54   ` Andrii Nakryiko
2025-04-23 21:53     ` Suren Baghdasaryan
2025-04-29 15:55     ` Jann Horn
2025-04-29 17:14       ` Suren Baghdasaryan
2025-04-29 17:24         ` Jann Horn
2025-05-01 22:10           ` Andrii Nakryiko
2025-05-02 15:11             ` Jann Horn
2025-05-02 16:16               ` Suren Baghdasaryan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAEf4BzYctDuS4DRTzdRQyyhCYvFTggOz=wcbizXEYvC_z_SSng@mail.gmail.com' \
    --to=andrii.nakryiko@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=adobriyan@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrii@kernel.org \
    --cc=brauner@kernel.org \
    --cc=christophe.leroy@csgroup.eu \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=jannh@google.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux@weissschuh.net \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@kernel.org \
    --cc=osalvador@suse.de \
    --cc=paulmck@kernel.org \
    --cc=peterx@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=shuah@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=yebin10@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox