From: Suren Baghdasaryan <surenb@google.com>
To: Christian Brauner <brauner@kernel.org>
Cc: Jann Horn <jannh@google.com>, Al Viro <viro@zeniv.linux.org.uk>,
linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org,
Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com,
david@redhat.com, vbabka@suse.cz, peterx@redhat.com,
hannes@cmpxchg.org, mhocko@kernel.org, paulmck@kernel.org,
shuah@kernel.org, adobriyan@gmail.com, josef@toxicpanda.com,
yebin10@huawei.com, linux@weissschuh.net, willy@infradead.org,
osalvador@suse.de, andrii@kernel.org, ryan.roberts@arm.com,
christophe.leroy@csgroup.eu, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, linux-kselftest@vger.kernel.org
Subject: Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
Date: Mon, 5 May 2025 09:38:27 -0700 [thread overview]
Message-ID: <CAJuCfpHYYGa4R11xJwZACNeXDxVroh-gUxHepdc2FfmgP_qBaA@mail.gmail.com> (raw)
In-Reply-To: <20250505-wachen-konform-3fe08f1b3214@brauner>
On Mon, May 5, 2025 at 6:50 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, Apr 29, 2025 at 08:54:58PM +0200, Jann Horn wrote:
> > On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > On Tue, Apr 29, 2025 at 10:21 AM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > Hi!
> > > >
> > > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu
> > > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I
> > > > forgot all about that...)
> > >
> > > Does this fact affect your previous comments? Just want to make sure
> > > I'm not missing something...
> >
> > When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing
> > down a VMA, and using get_file_rcu() for the lockless lookup, I did
> > not realize that you could actually also race with all the other
> > places that set ->vm_file, like __mmap_new_file_vma() and so on; and I
> > did not think about whether any of those code paths might leave a VMA
> > with a dangling ->vm_file pointer.
> >
> > I guess maybe that means you really do need to do the lookup from the
> > copied data, as you did in your patch; and that might require calling
> > get_file_active() on the copied ->vm_file pointer (instead of
> > get_file_rcu()), even though I think that is not really how
> > get_file_active() is supposed to be used (it's supposed to be used
> > when you know the original file hasn't been freed yet). Really what
>
> I think it's fine for get_file_active() to be used in this way. That
> ->vm_file pointer usage should get a fat comment above it explaining how
> what you're doing is safe.
Got it. Will use it in my next version. Thanks!
>
> > you'd want for that is basically a raw __get_file_rcu(), but that is
> > static and I think Christian wouldn't want to expose more of these
> > internals outside VFS...
>
> Yeah, no. I don't want that to be usable outside of that file.
>
> > (In that case, all the stuff below about get_file_rcu() would be moot.)
> >
> > Or you could pepper WRITE_ONCE() over all the places that write
> > ->vm_file, and ensure that ->vm_file is always NULLed before its
> > reference is dropped... but that seems a bit more ugly to me.
> >
> > > > On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn <jannh@google.com> wrote:
> > > > > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > > > > can change from under us, therefore we make a copy of the vma and we
> > > > > > > pin pointer fields used when generating the output (currently only
> > > > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > > > space modifications, wait for them to end and retry. While we take
> > > > > > > the mmap_lock for reading during such contention, we do that momentarily
> > > > > > > only to record new mm_wr_seq counter. This change is designed to reduce
> > > > > > > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > > > > > > (often a low priority task, such as monitoring/data collection services)
> > > > > > > from blocking address space updates.
> > > > > > [...]
> > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > > > index b9e4fbbdf6e6..f9d50a61167c 100644
> > > > > > > --- a/fs/proc/task_mmu.c
> > > > > > > +++ b/fs/proc/task_mmu.c
> > > > > > [...]
> > > > > > > +/*
> > > > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > > > > > > + * show_map_vma.
> > > > > > > + */
> > > > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
> > > > > > > +{
> > > > > > > + struct vm_area_struct *copy = &priv->vma_copy;
> > > > > > > + int ret = -EAGAIN;
> > > > > > > +
> > > > > > > + memcpy(copy, vma, sizeof(*vma));
> > > > > > > + if (copy->vm_file && !get_file_rcu(©->vm_file))
> > > > > > > + goto out;
> > > > > >
> > > > > > I think this uses get_file_rcu() in a different way than intended.
> > > > > >
> > > > > > As I understand it, get_file_rcu() is supposed to be called on a
> > > > > > pointer which always points to a file with a non-zero refcount (except
> > > > > > when it is NULL). That's why it takes a file** instead of a file* - if
> > > > > > it observes a zero refcount, it assumes that the pointer must have
> > > > > > been updated in the meantime, and retries. Calling get_file_rcu() on a
> > > > > > pointer that points to a file with zero refcount, which I think can
> > > > > > happen with this patch, will cause an endless loop.
> > > > > > (Just as background: For other usecases, get_file_rcu() is supposed to
> > > > > > still behave nicely and not spuriously return NULL when the file* is
> > > > > > concurrently updated to point to another file*; that's what that loop
> > > > > > is for.)
> > > > >
> > > > > Ah, I see. I wasn't aware of this subtlety. I think this is fixable by
> > > > > checking the return value of get_file_rcu() and retrying speculation
> > > > > if it changed.
> > > >
> > > > I think you could probably still end up looping endlessly in get_file_rcu().
> >
> > (Just to be clear: What I meant here is that get_file_rcu() loops
> > *internally*; get_file_rcu() is not guaranteed to ever return if the
> > pointed-to file has a zero refcount.)
> >
> > > By "retrying speculation" I meant it in the sense of
> > > get_vma_snapshot() retry when it takes the mmap_read_lock and then
> > > does mmap_lock_speculate_try_begin to restart speculation. I'm also
> > > thinking about Liam's concern of guaranteeing forward progress for the
> > > reader. Thinking maybe I should not drop mmap_read_lock immediately on
> > > contention but generate some output (one vma or one page worth of
> > > vmas) before dropping mmap_read_lock and proceeding with speculation.
> >
> > Hm, yeah, I guess you need that for forward progress...
> >
> > > > > > (If my understanding is correct, maybe we should document that more
> > > > > > explicitly...)
> > > > >
> > > > > Good point. I'll add comments for get_file_rcu() as a separate patch.
> > > > >
> > > > > >
> > > > > > Also, I think you are introducing an implicit assumption that
> > > > > > remove_vma() does not NULL out the ->vm_file pointer (because that
> > > > > > could cause tearing and could theoretically lead to a torn pointer
> > > > > > being accessed here).
> > > > > >
> > > > > > One alternative might be to change the paths that drop references to
> > > > > > vma->vm_file (search for vma_close to find them) such that they first
> > > > > > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that,
> > > > > > maybe with a new helper like this:
> > > > > >
> > > > > > static void vma_fput(struct vm_area_struct *vma)
> > > > > > {
> > > > > > struct file *file = vma->vm_file;
> > > > > >
> > > > > > if (file) {
> > > > > > WRITE_ONCE(vma->vm_file, NULL);
> > > > > > fput(file);
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > Then on the lockless lookup path you could use get_file_rcu() on the
> > > > > > ->vm_file pointer _of the original VMA_, and store the returned file*
> > > > > > into copy->vm_file.
> > > > >
> > > > > Ack. Except for storing the return value of get_file_rcu(). I think
> > > > > once we detect that get_file_rcu() returns a different file we should
> > > > > bail out and retry. The change in file is an indication that the vma
> > > > > got changed from under us, so whatever we have is stale.
> > > >
> > > > What does "different file" mean here - what file* would you compare
> > > > the returned one against?
> > >
> > > Inside get_vma_snapshot() I would pass the original vma->vm_file to
> > > get_file_rcu() and check if it returns the same value. If the value
> > > got changed we jump to /* Address space got modified, vma might be
> > > stale. Re-lock and retry. */ section. That should work, right?
> >
> > Where do you get an "original vma->vm_file" from?
> >
> > To be clear, get_file_rcu(p) returns one of the values that *p had
> > while get_file_rcu(p) is running.
next prev parent reply other threads:[~2025-05-05 16:38 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-18 17:49 [PATCH v3 0/8] perform /proc/pid/maps read and PROCMAP_QUERY " Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 1/8] selftests/proc: add /proc/pid/maps tearing from vma split test Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 2/8] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 3/8] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping Suren Baghdasaryan
2025-04-18 18:30 ` Lorenzo Stoakes
2025-04-18 19:31 ` Suren Baghdasaryan
2025-04-18 19:55 ` Lorenzo Stoakes
2025-04-18 20:03 ` Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 4/8] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 5/8] selftests/proc: add verbose more for tests to facilitate debugging Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 6/8] mm: make vm_area_struct anon_name field RCU-safe Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU Suren Baghdasaryan
2025-04-22 22:48 ` Andrii Nakryiko
2025-04-23 21:49 ` Suren Baghdasaryan
2025-04-23 22:06 ` Andrii Nakryiko
2025-04-24 0:23 ` Liam R. Howlett
2025-04-24 15:20 ` Suren Baghdasaryan
2025-04-24 16:03 ` Andrii Nakryiko
2025-04-24 16:42 ` Liam R. Howlett
2025-04-24 17:38 ` Suren Baghdasaryan
2025-04-24 17:44 ` Andrii Nakryiko
2025-04-29 15:39 ` Jann Horn
2025-04-29 17:09 ` Suren Baghdasaryan
2025-04-29 17:20 ` Jann Horn
2025-04-29 18:04 ` Suren Baghdasaryan
2025-04-29 18:54 ` Jann Horn
2025-04-29 20:33 ` Suren Baghdasaryan
2025-04-29 20:43 ` Jann Horn
2025-05-05 13:50 ` Christian Brauner
2025-05-05 16:38 ` Suren Baghdasaryan [this message]
2025-04-18 17:49 ` [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl " Suren Baghdasaryan
2025-04-22 22:54 ` Andrii Nakryiko
2025-04-23 21:53 ` Suren Baghdasaryan
2025-04-29 15:55 ` Jann Horn
2025-04-29 17:14 ` Suren Baghdasaryan
2025-04-29 17:24 ` Jann Horn
2025-05-01 22:10 ` Andrii Nakryiko
2025-05-02 15:11 ` Jann Horn
2025-05-02 16:16 ` Suren Baghdasaryan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAJuCfpHYYGa4R11xJwZACNeXDxVroh-gUxHepdc2FfmgP_qBaA@mail.gmail.com \
--to=surenb@google.com \
--cc=Liam.Howlett@oracle.com \
--cc=adobriyan@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=andrii@kernel.org \
--cc=brauner@kernel.org \
--cc=christophe.leroy@csgroup.eu \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=jannh@google.com \
--cc=josef@toxicpanda.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux@weissschuh.net \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@kernel.org \
--cc=osalvador@suse.de \
--cc=paulmck@kernel.org \
--cc=peterx@redhat.com \
--cc=ryan.roberts@arm.com \
--cc=shuah@kernel.org \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=yebin10@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox