From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 130DAC5B543
	for <linux-mm@archiver.kernel.org>; Sun,  8 Jun 2025 01:41:51 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2459D6B0088; Sat,  7 Jun 2025 21:41:51 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1D06B6B0089; Sat,  7 Jun 2025 21:41:51 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0714C6B008A; Sat,  7 Jun 2025 21:41:51 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id CFB546B0088
	for <linux-mm@kvack.org>; Sat,  7 Jun 2025 21:41:50 -0400 (EDT)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 00CC81D3D0D
	for <linux-mm@kvack.org>; Sun,  8 Jun 2025 01:41:49 +0000 (UTC)
X-FDA: 83530531980.08.A4B3B73
Received: from mail-qt1-f182.google.com (mail-qt1-f182.google.com [209.85.160.182])
	by imf06.hostedemail.com (Postfix) with ESMTP id F3D72180009
	for <linux-mm@kvack.org>; Sun,  8 Jun 2025 01:41:47 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=wgtWewGg;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf06.hostedemail.com: domain of surenb@google.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=surenb@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1749346908;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=8GTi9soOH0gss15/tu50fw9s97WlbXlhLt35zROKtX0=;
	b=i1+H6MCaLcDTJ/0ivTj7roAmqba749iR1/YfV/l1D4YskXMlmrx0LM6NzX8t12zhjRbeCW
	Rh6xN9V5L83uTpYXMhtVoym5R4VIClkgHppeODnfjLNpwEfI56R6pYbAGv52ShmhaQRxHE
	4grm9g/3UpCsSCTBTy/kgZsW4d5RWdY=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=wgtWewGg;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf06.hostedemail.com: domain of surenb@google.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=surenb@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749346908; a=rsa-sha256;
	cv=none;
	b=mqYrxsLCGqEG1Ad1Qdc76DqixQ4zGDr0ZHn8OAKiFNCohutVaRUQihMyy1MseAX5B5rUpU
	tUEuDgb13jOqraVEhHZcxc8Rk0yW6iU13dttDCBFHm7f1ZW5yVq68fNeX9NjfG4vmi+buN
	BryeSpYvY8Wy2STRUg7LldCaxrR6Hq4=
Received: by mail-qt1-f182.google.com with SMTP id d75a77b69052e-4a5ac8fae12so348921cf.0
        for <linux-mm@kvack.org>; Sat, 07 Jun 2025 18:41:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1749346907; x=1749951707; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=8GTi9soOH0gss15/tu50fw9s97WlbXlhLt35zROKtX0=;
        b=wgtWewGgQenpXKg38FAFusHYX0/liX88k2J8KnzyiwmK4MyA5mZqv2h4cx3yyCCB8H
         bjHD47lLcggqZH3onFUu4dZvRbJcH+ht7aN75oDsHiOn//TvGpqTDMmKr5BEReLMdxIm
         feKy75cknenP5dgTP01S3KMFku3dFUdq5R7QgrhhnAj5z3YHchvvMn0V11FBTWTi5AzY
         9RqBc+LQN6vOuFvALsCUbhnvb7KHzi3X7JR/M9aJSUDMio9CQ1F34JSTKAWdDmXS9xAe
         DB5YXQvm4//pL3RsqZr1XGE9hLgDKxlcIbze9aHJUIapk8v/jVacM0QTB3BerPTJbOOX
         +DWA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1749346907; x=1749951707;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=8GTi9soOH0gss15/tu50fw9s97WlbXlhLt35zROKtX0=;
        b=YXIeZqPQXb8rVHqgraOSOUJ/IrlP/zyAKc8orDDZ/PX52y9HjyER1gg5L/motXSoEp
         WOVq8rxh1RDpzTHJs5hKcOXvqpB5G0kQmRCvq/2EueqNUDLyzgqkhcJQs+NQ/01Ncvd7
         1aW7AVZvxzZS3lEnlItH+xsmtQu+EKSfMa42Xd6nn75qz0d4j1FTxFQDVZlleO5pWXsh
         W7MUcWPzbFLR1sV1VffnuO1ZLlHzBcseRCD2mre53W010LxyD+4OxHRSQUue7M/tnXya
         hmpzNI55Qvx2MTN6koYldjzsWzd6hA/DesCZJitvxUsiPUo8A8mq6J54goBjYDmR8H/j
         dtRw==
X-Forwarded-Encrypted: i=1; AJvYcCUVmz52kaN6te/kKmD+zVWFX+wXIVyv+J+TIf2N7PHCoM10YrE160VZzfftTzrxiZQi0XRQ4wHToA==@kvack.org
X-Gm-Message-State: AOJu0YxLl0aA2rPtFniagoQLkqIchONrem3ln0Lp/UHHHbS3Qt0Fbj1z
	SHt10jabz8RKX5Ctz4KREDBU85m4H3e4yTaAmD0i9o7lJFbS4f1h2Q0QL6ZgOrt8Dxi+/nCQEG0
	yW1NuJP5VdTHfezOa+eZEpJLt6rRIA7VmQUuRb4NU
X-Gm-Gg: ASbGncuiG+yYf0HTLt2A0DoQ9PZmNe5mXF7IxjZY4bqw2DYGJuAi6bN/iGNw39CHNCa
	l5CNLXWMWJ5RJEBgGg7OMp5kpHHzDSGomi2nH0c+Tj19/GGPaaYMw02zlC07canHqo+w17GeMwl
	bfD4Vs8dajkHTjbbg5BKhhZclBnMVTJJdLLnAQYZNiBg==
X-Google-Smtp-Source: AGHT+IGVm2XdWm2145FxdXLlBbDc7jiX5IbrDtDFFW2CfvtZZzBD27WrrRgPQT7KReX+eYnpxO2PtHN7wBuToGpsc3c=
X-Received: by 2002:ac8:5fcd:0:b0:4a4:fdf2:375c with SMTP id
 d75a77b69052e-4a6f0717b8dmr2657791cf.9.1749346906354; Sat, 07 Jun 2025
 18:41:46 -0700 (PDT)
MIME-Version: 1.0
References: <20250604231151.799834-1-surenb@google.com> <20250604231151.799834-7-surenb@google.com>
 <4cefe058-d67b-491c-83ac-293544385e84@lucifer.local>
In-Reply-To: <4cefe058-d67b-491c-83ac-293544385e84@lucifer.local>
From: Suren Baghdasaryan <surenb@google.com>
Date: Sat, 7 Jun 2025 18:41:35 -0700
X-Gm-Features: AX0GCFvoWhKvEDoi5YmO-bXgDMioSqBVGlI1naa_Yxiq0jTRuVa6pR1xKb4SAM0
Message-ID: <CAJuCfpGxTdnEgKj1Mgu2bGK_6rxnBd=Sk5HiKKdp38uMtEHu3A@mail.gmail.com>
Subject: Re: [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: akpm@linux-foundation.org, Liam.Howlett@oracle.com, david@redhat.com, 
	vbabka@suse.cz, peterx@redhat.com, jannh@google.com, hannes@cmpxchg.org, 
	mhocko@kernel.org, paulmck@kernel.org, shuah@kernel.org, adobriyan@gmail.com, 
	brauner@kernel.org, josef@toxicpanda.com, yebin10@huawei.com, 
	linux@weissschuh.net, willy@infradead.org, osalvador@suse.de, 
	andrii@kernel.org, ryan.roberts@arm.com, christophe.leroy@csgroup.eu, 
	tjmercier@google.com, kaleshsingh@google.com, linux-kernel@vger.kernel.org, 
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, 
	linux-kselftest@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: F3D72180009
X-Stat-Signature: 8ekship3sesyjp89xxyze6eyqoe67td3
X-Rspam-User: 
X-HE-Tag: 1749346907-848951
X-HE-Meta: U2FsdGVkX1/lMimD/Kih0CuhcrriT+hQH5VkBVLdc4ymTeeg0ZQeFfSxiLv342aRwCtXttqyNltfxdr3me3ZkNRuF5vFcBh6DZEFWFmv2WiAfEy0I3SxbCTfieo7nXHFMNHaW0uT7EyNEsq0O8cXy1nTd9vLUAtrEOw4BrLjsHLlWX8SxmfE92y8L2F3dZFQj/tFFhnmMo3W67etWBFb2ngs7bunmKWH7pYoDtQDufSts3DUzzHtu689MZbc8YKgDwQ6qDmgWX4zx0p1xqAG2EJ2OMGtzYz0KDhDIUlpSX9EOrSA7JN58N2muv2WZ7K4be5nIJYGNKlu6pYZ+Nmf/lCOJvMxsmIS40AW1/r6ijGiJEiDzqRp4iPiSEzlI7N/g7TAlUxBQuTgg7KVsNtE0Ulu7Ju0/3EA508cpfDjNESodjVaxprH2MvTnDSi2XEv9CkQu7FIXrc4QNjwajeNeh2u3+KTpgxFb0Dmmeb6lTIJsnE40GO9mISTm6Ma+zh97ERfSnKayuoinYkiAhzYJ98KMaK8vo5JokAvDH0XkgBPTaQQujC27XfKBVPlMig3vwgyS9J1HA0RONM1bQNIg+UBfgpifb6YG5xyBjYS9h0pmrLz0A1H9JOyisD+OwRZeXUVPmpw6WbxFkpgwIbrekTLX2ZmNj4/FyMXmuM6cWWd1fw3bUtXhRFe/8sVDbcyNoMxTYdSfZ1q/jnOKwUqWfyYrNKasPihCcvxu7RPFVAcOYhoNiPGpk9Sj+BK/Yoh4FShyhSqWxDGqe33yM2Xs8/5Qwre3hZijOv0PSLqUzn6T3+NXyFdUK4LiAzJQh44cmy8YLa9iylksZirw47ajflQZRLv8rf/HZ8Q+ucFWz53Xd8Eotzm8+uuvLrUoigDv1AeGDBj1GJf8cJ2tqRPcWUKjYQMfk7ixyy/1SjmCokVcIqBNnI35gyiOYzEjRmM+gDf7o7INbxe4WuK1eo
 qXF2upRt
 1hVOvOm5DrMhpIfs9WVfXdlPUelvzD8LL/QBs5cz/G9NYi//8d8HhZxtB5m+9m4ELxiv5lrJ1NPjSIFnyr0z5T+NQ1eH0nYz7pySrrMAYnRK+S32STATMF2F9KMlmLFBfUNTDPBOkM7iggVhp0NoqaYUUpJY5Z7fa1YyIDmY2WUlC0bOoLFsF2SEY69uEptfJ8qTsMWaT+jnaFAKRiVtsb/IAIPVRGgVncEUmHQHjeVJWk+2iFDS0SFWCnyw7ybPjR8V2KY7PJlN5z2DyctDbUtbLGX9eLfDcDQ0FB2KkSJpS9vkz/0ozwFLTIgflUdimStEMLu7X78DJ9WdzyljuVVvk9AqeejIdQy8G
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Sat, Jun 7, 2025 at 10:43=E2=80=AFAM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Hi Suren,
>
> Forgive me but I am going to ask a lot of questions here :p just want to
> make sure I'm getting everything right here.

No worries and thank you for reviewing!

>
> On Wed, Jun 04, 2025 at 04:11:50PM -0700, Suren Baghdasaryan wrote:
> > With maple_tree supporting vma tree traversal under RCU and per-vma
> > locks, /proc/pid/maps can be read while holding individual vma locks
> > instead of locking the entire address space.
>
> Nice :)
>
> > Completely lockless approach would be quite complex with the main issue
> > being get_vma_name() using callbacks which might not work correctly wit=
h
> > a stable vma copy, requiring original (unstable) vma.
>
> Hmmm can you expand on what a 'completely lockless' design might comprise=
 of?

In my previous implementation
(https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/)
I was doing this under RCU while checking mmap_lock seq counter to
detect address space changes. That's what I meant by a completely
lockless approach here.

>
> It's super un-greppable and I've not got clangd set up with an allmod ker=
nel to
> triple-check but I'm seeing at least 2 (are there more?):
>
> gate_vma_name() which is:
>
>         return "[vsyscall]";
>
> special_mapping_name() which is:
>
>          return ((struct vm_special_mapping *)vma->vm_private_data)->name=
;
>
> Which I'm guessing is the issue because it's a double pointer deref...

Correct but in more general terms, depending on implementation of the
vm_ops.name callback, vma->vm_ops->name(vma) might not work correctly
with a vma copy. special_mapping_name() is an example of that.

>
> Seems such a silly issue to get stuck on, I wonder if we can't just chang=
e
> this to function correctly?

I was thinking about different ways to overcome that but once I
realized per-vma locks result in even less contention and the
implementation is simpler and more robust, I decided that per-vma
locks direction is better.

>
> > When per-vma lock acquisition fails, we take the mmap_lock for reading,
> > lock the vma, release the mmap_lock and continue. This guarantees the
> > reader to make forward progress even during lock contention. This will
>
> Ah that fabled constant forward progress ;)
>
> > interfere with the writer but for a very short time while we are
> > acquiring the per-vma lock and only when there was contention on the
> > vma reader is interested in.
> > One case requiring special handling is when vma changes between the
> > time it was found and the time it got locked. A problematic case would
> > be if vma got shrunk so that it's start moved higher in the address
> > space and a new vma was installed at the beginning:
> >
> > reader found:               |--------VMA A--------|
> > VMA is modified:            |-VMA B-|----VMA A----|
> > reader locks modified VMA A
> > reader reports VMA A:       |  gap  |----VMA A----|
> >
> > This would result in reporting a gap in the address space that does not
> > exist. To prevent this we retry the lookup after locking the vma, howev=
er
> > we do that only when we identify a gap and detect that the address spac=
e
> > was changed after we found the vma.
>
> OK so in this case we have
>
> 1. Find VMA A - nothing is locked yet, but presumably we are under RCU so
>    are... safe? From unmaps? Or are we? I guess actually the detach
>    mechanism sorts this out for us perhaps?

Yes, VMAs are RCU-safe and we do detect if it got detached after we
found it but before we locked it.

>
> 2. We got unlucky and did this immediately prior to VMA A having its
>    vma->vm_start, vm_end updated to reflect the split.

Yes, the split happened after we found it and before we locked it.

>
> 3. We lock VMA A, now position with an apparent gap after the prior VMA
> which, in practice does not exist.

Correct.

>
> So I am guessing that by observing sequence numbers you are able to detec=
t
> that a change has occurred and thus retry the operation in this situation=
?

Yes, we detect the gap and we detect that address space has changed,
so to endure we did not miss a split we fall back to mmap_read_lock,
lock the VMA while holding mmap_read_lock, drop mmap_read_lock and
retry.

>
> I know we previously discussed the possibility of this retry mechanism
> going on forever, I guess I will see the resolution to this in the code :=
)

Retry in this case won't go forever because we take mmap_read_lock
during the retry. In the worst case we will be constantly falling back
to mmap_read_lock but that's a very unlikely case (the writer should
be constantly splitting the vma right before the reader locks it).

>
> > This change is designed to reduce mmap_lock contention and prevent a
> > process reading /proc/pid/maps files (often a low priority task, such
> > as monitoring/data collection services) from blocking address space
> > updates. Note that this change has a userspace visible disadvantage:
> > it allows for sub-page data tearing as opposed to the previous mechanis=
m
> > where data tearing could happen only between pages of generated output
> > data. Since current userspace considers data tearing between pages to b=
e
> > acceptable, we assume is will be able to handle sub-page data tearing
> > as well.
>
> By tearing do you mean for instance seeing a VMA more than once due to
> e.g. a VMA expanding in a racey way?

Yes.

>
> Pedantic I know, but it might be worth goiing through all the merge case,
> split and remap scenarios and explaining what might happen in each one (o=
r
> perhaps do that as some form of documentation?)
>
> I can try to put together a list of all of the possibilities if that woul=
d
> be helpful.

Hmm. That might be an interesting exercise. I called out this
particular case because my tests caught it. I spent some time thinking
about other possible scenarios where we would report a gap in a place
where there are no gaps but could not think of anything else.

>
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  fs/proc/internal.h |   6 ++
> >  fs/proc/task_mmu.c | 177 +++++++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 175 insertions(+), 8 deletions(-)
>
> I really hate having all this logic in the proc/task_mmu.c file.
>
> This is really delicate stuff and I'd really like it to live in mm if
> possible.
>
> I reallise this might be a total pain, but I'm quite worried about us
> putting super-delicate, carefully written VMA handling code in different
> places.
>
> Also having stuff in mm/vma.c opens the door to userland testing which,
> when I finally have time to really expand that, would allow for some real=
ly
> nice stress testing here.

That would require some sizable refactoring. I assume code for smaps
reading and PROCMAP_QUERY would have to be moved as well?

>
> >
> > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > index 96122e91c645..3728c9012687 100644
> > --- a/fs/proc/internal.h
> > +++ b/fs/proc/internal.h
> > @@ -379,6 +379,12 @@ struct proc_maps_private {
> >       struct task_struct *task;
> >       struct mm_struct *mm;
> >       struct vma_iterator iter;
> > +     loff_t last_pos;
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     bool mmap_locked;
> > +     unsigned int mm_wr_seq;
>
> Is this the _last_ sequence number observed in the mm_struct? or rather,
> previous? Nitty but maybe worth renaming accordingly.

It's a copy of the mm->mm_wr_seq. I can add a comment if needed.

>
> > +     struct vm_area_struct *locked_vma;
> > +#endif
> >  #ifdef CONFIG_NUMA
> >       struct mempolicy *task_mempolicy;
> >  #endif
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 27972c0749e7..36d883c4f394 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -127,13 +127,172 @@ static void release_task_mempolicy(struct proc_m=
aps_private *priv)
> >  }
> >  #endif
> >
> > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *p=
riv,
> > -                                             loff_t *ppos)
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +
> > +static struct vm_area_struct *trylock_vma(struct proc_maps_private *pr=
iv,
> > +                                       struct vm_area_struct *vma,
> > +                                       unsigned long last_pos,
> > +                                       bool mm_unstable)
>
> This whole function is a bit weird tbh, you handle both the
> mm_unstable=3Dtrue and mm_unstable=3Dfalse cases, in the latter we don't =
try to
> lock at all...

Why do you think so? vma_start_read() is always called but in case
mm_unstable=3Dtrue we double check for the gaps to take care of the case
I mentioned in the changelog.

>
> Nitty (sorry I know this is mildly irritating review) but maybe needs to =
be
> renamed, or split up somehow?
>
> This is only trylocking in the mm_unstable case...

Nope, I think you misunderstood the intention, as I mentioned above.

>
> > +{
> > +     vma =3D vma_start_read(priv->mm, vma);
>
> Do we want to do this with mm_unstable =3D=3D false?

Yes, always. mm_unstable=3Dtrue only indicates that we are already
holding mmap_read_lock, so we don't need to double-check for gaps.
Perhaps I should add some comments to clarify what purpose this
parameter serves...

>
> I know (from my own documentation :)) taking a VMA read lock while holdin=
g
> an mmap read lock is fine (the reverse isn't) but maybe it's suboptimal?

Ah, right. I should use vma_start_read_locked() instead when we are
holding mmap_read_lock. That's why that function was introduced. Will
change.

>
> > +     if (IS_ERR_OR_NULL(vma))
> > +             return NULL;
>
> Hmm IS_ERR_OR_NULL() is generally a code smell (I learned this some years
> ago from people moaning at me on code review :)
>
> Sorry I know that's annoying but perhaps its indicative of an issue in th=
e
> interface? That's possibly out of scope here however.

lock_vma_under_rcu() returns NULL or EAGAIN to signal
lock_vma_under_rcu() that it should retry the VMA lookup. In here in
either case we retry under mmap_read_lock, that's why EAGAIN is
ignored.

>
> Why are we ignoring errors here though? I guess because we don't care if
> the VMA got detached from under us, we don't bother retrying like we do i=
n
> lock_vma_under_rcu()?

No, we take mmap_read_lock and retry in either case. Perhaps I should
split trylock_vma() into two separate functions - one for the case
when we are holding mmap_read_lock and another one when we don't? I
think that would have prevented many of your questions. I'll try that
and see how it looks.

>
> Should we just abstract that part of lock_vma_under_rcu() and use it?

trylock_vma() is not that similar to lock_vma_under_rcu() for that
IMO. Also lock_vma_under_rcu() is in the pagefault path which is very
hot, so I would not want to add conditions there to make it work for
trylock_vma().

>
> > +
> > +     /* Check if the vma we locked is the right one. */
>
> Well it might not be the right one :) but might still belong to the right
> mm, so maybe better to refer to the right virtual address space.

Ack. Will change to "Check if the vma belongs to the right address space. "

>
> > +     if (unlikely(vma->vm_mm !=3D priv->mm))
> > +             goto err;
> > +
> > +     /* vma should not be ahead of the last search position. */
>
> You mean behind the last search position? Surely a VMA being _ahead_ of i=
t
> is fine?

Yes, you are correct. "should not" should have been "should".

>
> > +     if (unlikely(last_pos >=3D vma->vm_end))
>
> Should that be >=3D? Wouldn't an =3D=3D just be an adjacent VMA? Why is t=
hat
> problematic? Or is last_pos inclusive?

last_pos is inclusive and vma->vm_end is not inclusive, so if last_pos
=3D=3D vma->vm_end that would mean the vma is behind the last_pos. Since
we are searching forward from the last_pos, we should not be finding a
vma before last_pos unless it mutated.

>
> > +             goto err;
>
> Am I correct in thinking thi is what is being checked?
>
>           last_pos
>              |
>              v
> |---------|
> |         |
> |---------|
>         vm_end
>    <--- vma 'next'??? How did we go backwards?

Exactly.

>
> When last_pos gets updated, is it possible for a shrink to race to cause
> this somehow?

No, we update last_pos only after we locked the vma and confirmed it's
the right one.

>
> Do we treat this as an entirely unexpected error condition? In which case
> is a WARN_ON_ONCE() warranted?

No, the VMA might have mutated from under us before we locked it. For
example it might have been remapped to a higher address.

>
> > +
> > +     /*
> > +      * vma ahead of last search position is possible but we need to
> > +      * verify that it was not shrunk after we found it, and another
> > +      * vma has not been installed ahead of it. Otherwise we might
> > +      * observe a gap that should not be there.
> > +      */
>
> OK so this is the juicy bit.

Yep, that's the case singled out in the changelog.

>
>
> > +     if (mm_unstable && last_pos < vma->vm_start) {
> > +             /* Verify only if the address space changed since vma loo=
kup. */
> > +             if ((priv->mm_wr_seq & 1) ||
>
> Can we wrap this into a helper? This is a 'you just have to know that odd
> seq number means a write operation is in effect'. I know you have a comme=
nt
> here, but I think something like:
>
>         if (has_mm_been_modified(priv) ||
>
> Would be a lot clearer.

Yeah, I was thinking about that. I think an even cleaner way would be
to remember the return value of mmap_lock_speculate_try_begin() and
pass it around. I was hoping to avoid that extra parameter but sounds
like for the sake of clarity that would be preferable?

>
> Again this speaks to the usefulness of abstracting all this logic from th=
e
> proc code, we are putting super delicate VMA stuff here and it's just not
> the right place.
>
> As an aside, I don't see coverage in the process_addrs documentation on
> sequence number odd/even or speculation?
>
> I think we probably need to cover this to maintain an up-to-date
> description of how the VMA locking mechanism works and is used?

I think that's a very low level technical detail which I should not
have exposed here. As I mentioned, I should simply store the return
value of mmap_lock_speculate_try_begin() instead of doing these tricky
mm_wr_seq checks.

>
> > +                 mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq))=
 {
>
> Nit, again unrelated to this series, but would be useful to add a comment
> to mmap_lock_speculate_retry() to indicate that a true return value
> indicates a retry is needed, or renaming it.

This is how seqcount API works in general. Note that
mmap_lock_speculate_retry() is just a wrapper around
read_seqcount_retry().

>
> Maybe mmap_lock_speculate_needs_retry()? Also I think that function needs=
 a
> comment.

See https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/seqlock.h=
#L395

>
> Naming is hard :P
>
> Anyway the totality of this expression is 'something changed' or 'read
> section retry required'.

Not quite. The expression is "something is changed from under us or
something was changing even before we started VMA lookup". Or in more
technical terms, mmap_write_lock was acquired while we were locking
the VMA or mmap_write_lock was already held even before we started the
VMA search.

>
> Under what circumstances would this happen?

See my previous comment and I hope that clarifies it.

>
> OK so we're into the 'retry' logic here:
>
> > +                     vma_iter_init(&priv->iter, priv->mm, last_pos);
>
> I'd definitely want Liam to confirm this is all above board and correct, =
as
> these operations are pretty sensitive.
>
> But assuming this is safe, we reset the iterator to the last position...
>
> > +                     if (vma !=3D vma_next(&priv->iter))
>
> Then assert the following VMA is the one we seek.
>
> > +                             goto err;
>
> Might this ever be the case in the course of ordinary operation? Is this
> really an error?

This simply means that the VMA we found before is not at the place we
found it anymore. The locking fails and we should retry.

>
> > +             }
> > +     }
> > +
> > +     priv->locked_vma =3D vma;
> > +
> > +     return vma;
> > +err:
>
> As queried above, is this really an error path or something we might expe=
ct
> to happen that could simply result in an expected fallback to mmap lock?

It's a failure to lock the VMA, which is handled by retrying under
mmap_read_lock. So, trylock_vma() failure does not mean a fault in the
logic. It's expected to happen occasionally.

>
> > +     vma_end_read(vma);
> > +     return NULL;
> > +}
> > +
> > +
> > +static void unlock_vma(struct proc_maps_private *priv)
> > +{
> > +     if (priv->locked_vma) {
> > +             vma_end_read(priv->locked_vma);
> > +             priv->locked_vma =3D NULL;
> > +     }
> > +}
> > +
> > +static const struct seq_operations proc_pid_maps_op;
> > +
> > +static inline bool lock_content(struct seq_file *m,
> > +                             struct proc_maps_private *priv)
>
> Pedantic I know but isn't 'lock_content' a bit generic?
>
> He says, not being able to think of a great alternative...
>
> OK maybe fine... :)

Yeah, I struggled with this myself. Help in naming is appreciated.

>
> > +{
> > +     /*
> > +      * smaps and numa_maps perform page table walk, therefore require
> > +      * mmap_lock but maps can be read with locked vma only.
> > +      */
> > +     if (m->op !=3D &proc_pid_maps_op) {
>
> Nit but is there a neater way of checking this? Actually I imagine not...
>
> But maybe worth, instead of forward-declaring proc_pid_maps_op, forward d=
eclare e.g.
>
> static inline bool is_maps_op(struct seq_file *m);
>
> And check e.g.
>
> if (is_maps_op(m)) { ... in the above.
>
> Yeah this is nitty not a massive del :)

I'll try that and see how it looks. Thanks!

>
> > +             if (mmap_read_lock_killable(priv->mm))
> > +                     return false;
> > +
> > +             priv->mmap_locked =3D true;
> > +     } else {
> > +             rcu_read_lock();
> > +             priv->locked_vma =3D NULL;
> > +             priv->mmap_locked =3D false;
> > +     }
> > +
> > +     return true;
> > +}
> > +
> > +static inline void unlock_content(struct proc_maps_private *priv)
> > +{
> > +     if (priv->mmap_locked) {
> > +             mmap_read_unlock(priv->mm);
> > +     } else {
> > +             unlock_vma(priv);
> > +             rcu_read_unlock();
>
> Does this always get called even in error cases?

What error cases do you have in mind? Error to lock a VMA is handled
by retrying and we should be happily proceeding. Please clarify.

>
> > +     }
> > +}
> > +
> > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *p=
riv,
> > +                                        loff_t last_pos)
>
> We really need a generalised RCU multi-VMA locking mechanism (we're looki=
ng
> into madvise VMA locking atm with a conservative single VMA lock at the
> moment, but in future we probably want to be able to span multiple for
> instance) and this really really feels like it doesn't belong in this pro=
c
> code.

Ok, I guess you are building a case to move more code into vma.c? I
see what you are doing :)

>
> >  {
> > -     struct vm_area_struct *vma =3D vma_next(&priv->iter);
> > +     struct vm_area_struct *vma;
> > +     int ret;
> > +
> > +     if (priv->mmap_locked)
> > +             return vma_next(&priv->iter);
> > +
> > +     unlock_vma(priv);
> > +     /*
> > +      * Record sequence number ahead of vma lookup.
> > +      * Odd seqcount means address space modification is in progress.
> > +      */
> > +     mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
>
> Hmm we're discarding the return value I guess we don't really care about
> that at this stage? Or do we? Do we want to assert the read critical
> section state here?

Yeah, as I mentioned, instead of relying on priv->mm_wr_seq being odd
I should record the return value of mmap_lock_speculate_try_begin().
In the functional sense these two are interchangeable.

>
> I guess since we have the mm_rq_seq which we use later it's the same thin=
g
> and doesn't matter.

Yep.

>
> ~~(off topic a bit)~~
>
> OK so off-topic again afaict we're doing something pretty horribly gross =
here.
>
> We pass &priv->mm_rw_seq as 'unsigned int *seq' field to
> mmap_lock_speculate_try_begin(), which in turn calls:
>
>         return raw_seqcount_try_begin(&mm->mm_lock_seq, *seq);
>
> And this is defined as a macro of:
>
> #define raw_seqcount_try_begin(s, start)                                \
> ({                                                                      \
>         start =3D raw_read_seqcount(s);                                  =
 \
>         !(start & 1);                                                   \
> })
>
> So surely this expands to:
>
>         *seq =3D raw_read_seqcount(&mm->mm_lock_seq);
>         !(*seq & 1) // return true if even, false if odd
>
> So we're basically ostensibly passing an unsigned int, but because we're
> calling a macro it's actually just 'text' and we're instead able to then
> reassign the underlying unsigned int * ptr and... ugh.
>
> ~~(/off topic a bit)~~

Aaaand we are back...

>
> > +     vma =3D vma_next(&priv->iter);
>
>
>
> > +     if (!vma)
> > +             return NULL;
> > +
> > +     vma =3D trylock_vma(priv, vma, last_pos, true);
> > +     if (vma)
> > +             return vma;
> > +
>
> Really feels like this should be a boolean... I guess neat to reset vma i=
f
> not locked though.

I guess I can change trylock_vma() to return boolean. We always return
the same vma or NULL I think.

>
> > +     /* Address space got modified, vma might be stale. Re-lock and re=
try */
>
> > +     rcu_read_unlock();
>
> Might we see a VMA possibly actually legit unmapped in a race here? Do we
> need to update last_pos/ppos to account for this? Otherwise we might just
> fail on the last_pos >=3D vma->vm_end check in trylock_vma() no?

Yes, it can happen and trylock_vma() will fail to lock the modified
VMA. That's by design. In such cases we retry the lookup from the same
last_pos.

>
> > +     ret =3D mmap_read_lock_killable(priv->mm);
>
> Shouldn't we set priv->mmap_locked here?

No, we will drop the mmap_read_lock shortly. priv->mmap_locked
indicates the overall mode we operate in. When priv->mmap_locked=3Dfalse
we can still temporarily take the mmap_read_lock when retrying and
then drop it after we found the VMA.

>
> I guess not as we are simply holding the mmap lock to definitely get the
> next VMA.

Correct.

>
> > +     rcu_read_lock();
> > +     if (ret)
> > +             return ERR_PTR(ret);
> > +
> > +     /* Lookup the vma at the last position again under mmap_read_lock=
 */
> > +     vma_iter_init(&priv->iter, priv->mm, last_pos);
> > +     vma =3D vma_next(&priv->iter);
> > +     if (vma) {
> > +             vma =3D trylock_vma(priv, vma, last_pos, false);
>
> Be good to use Liam's convention of /* mm_unstable =3D */ false to make t=
his
> clear.

Yeah, I'm thinking of splitting trylock_vma() into two separate
functions for mm_unstable=3Dtrue and mm_unstable=3Dfalse cases.

>
> Find it kinda weird again we're 'trylocking' something we already have
> locked via the mmap lock but I already mentioend this... :)
>
> > +             WARN_ON(!vma); /* mm is stable, has to succeed */
>
> I wonder if this is really useful, at any rate seems like there'd be a
> flood here so WARN_ON_ONCE()? Perhaps VM_WARN_ON_ONCE() given this really
> really ought not happen?

Well, I can't use BUG_ON(), so WARN_ON() is the next tool I have :) In
reality this should never happen, so
WARN_ON/WARN_ON_ONCE/WARN_ON_RATELIMITED/or whatever does not matter
much.

>
> > +     }
> > +     mmap_read_unlock(priv->mm);
> > +
> > +     return vma;
> > +}
> > +
> > +#else /* CONFIG_PER_VMA_LOCK */
> >
> > +static inline bool lock_content(struct seq_file *m,
> > +                             struct proc_maps_private *priv)
> > +{
> > +     return mmap_read_lock_killable(priv->mm) =3D=3D 0;
> > +}
> > +
> > +static inline void unlock_content(struct proc_maps_private *priv)
> > +{
> > +     mmap_read_unlock(priv->mm);
> > +}
> > +
> > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *p=
riv,
> > +                                        loff_t last_pos)
> > +{
> > +     return vma_next(&priv->iter);
> > +}
> > +
> > +#endif /* CONFIG_PER_VMA_LOCK */
> > +
> > +static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t =
*ppos)
> > +{
> > +     struct proc_maps_private *priv =3D m->private;
> > +     struct vm_area_struct *vma;
> > +
> > +     vma =3D get_next_vma(priv, *ppos);
> > +     if (IS_ERR(vma))
> > +             return vma;
> > +
> > +     /* Store previous position to be able to restart if needed */
> > +     priv->last_pos =3D *ppos;
> >       if (vma) {
> > -             *ppos =3D vma->vm_start;
> > +             /*
> > +              * Track the end of the reported vma to ensure position c=
hanges
> > +              * even if previous vma was merged with the next vma and =
we
> > +              * found the extended vma with the same vm_start.
> > +              */
>
> Right, so observing repetitions is acceptable in such circumstances? I me=
an
> I agree.

Yep, the VMA will be reported twice in such a case.

>
> > +             *ppos =3D vma->vm_end;
>
> If we store the end, does the last_pos logic which resets the VMA iterato=
r
> later work correctly in all cases?

I think so. By resetting to vma->vm_end we will start the next search
from the address right next to the last reported VMA, no?

>
> >       } else {
> >               *ppos =3D -2UL;
> >               vma =3D get_gate_vma(priv->mm);
>
> Is it always the case that !vma here implies a gate VMA (yuck yuck)? I se=
e
> this was the original logic, but maybe put a comment about this as it's
> weird and confusing? (and not your fault obviously :P)

What comment would you like to see here?

>
> Also, are all locks and state corectly handled in this case? Seems like o=
ne
> of this nasty edge case situations that could have jagged edges...

I think we are fine. get_next_vma() returned NULL, so we did not lock
any VMA and priv->locked_vma should be NULL.

>
> > @@ -163,19 +322,21 @@ static void *m_start(struct seq_file *m, loff_t *=
ppos)
> >               return NULL;
> >       }
> >
> > -     if (mmap_read_lock_killable(mm)) {
> > +     if (!lock_content(m, priv)) {
>
> Nice that this just slots in like this! :)
>
> >               mmput(mm);
> >               put_task_struct(priv->task);
> >               priv->task =3D NULL;
> >               return ERR_PTR(-EINTR);
> >       }
> >
> > +     if (last_addr > 0)
>
> last_addr is an unsigned long, this will always be true.

Not unless last_addr=3D=3D0. That's what I'm really checking here: is this
the first invocation of m_start(), in which case we are starting from
the beginning and not restarting from priv->last_pos. Should I add a
comment?

>
> You probably want to put an explicit check for -1UL, -2UL here or?
>
> God I hate this mechanism for indicating gate VMA... yuck yuck (again, th=
is
> bit not your fault :P)

No, I don't care here about -1UL, -2UL, just that last_addr=3D=3D0 or not.

>
> > +             *ppos =3D last_addr =3D priv->last_pos;
> >       vma_iter_init(&priv->iter, mm, last_addr);
> >       hold_task_mempolicy(priv);
> >       if (last_addr =3D=3D -2UL)
> >               return get_gate_vma(mm);
> >
> > -     return proc_get_vma(priv, ppos);
> > +     return proc_get_vma(m, ppos);
> >  }
> >
> >  static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
> > @@ -184,7 +345,7 @@ static void *m_next(struct seq_file *m, void *v, lo=
ff_t *ppos)
> >               *ppos =3D -1UL;
> >               return NULL;
> >       }
> > -     return proc_get_vma(m->private, ppos);
> > +     return proc_get_vma(m, ppos);
> >  }
> >
> >  static void m_stop(struct seq_file *m, void *v)
> > @@ -196,7 +357,7 @@ static void m_stop(struct seq_file *m, void *v)
> >               return;
> >
> >       release_task_mempolicy(priv);
> > -     mmap_read_unlock(mm);
> > +     unlock_content(priv);
> >       mmput(mm);
> >       put_task_struct(priv->task);
> >       priv->task =3D NULL;
> > --
> > 2.49.0.1266.g31b7d2e469-goog
> >
>
> Sorry to add to workload by digging into so many details here, but we
> really need to make sure all the i's are dotted and t's are crossed given
> how fiddly and fragile this stuff is :)
>
> Very much appreciate the work, this is a significant improvement and will
> have a great deal of real world impact!

Thanks for meticulously going over the code! This is really helpful.
Suren.

>
> Cheers, Lorenzo