From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7AE6EC3DA7F
	for <linux-mm@archiver.kernel.org>; Thu, 15 Aug 2024 19:44:57 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 0FA5C8D000C; Thu, 15 Aug 2024 15:44:57 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0847E8D000B; Thu, 15 Aug 2024 15:44:57 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E3FD88D000C; Thu, 15 Aug 2024 15:44:56 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id BBABE8D000B
	for <linux-mm@kvack.org>; Thu, 15 Aug 2024 15:44:56 -0400 (EDT)
Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 69A15A8E69
	for <linux-mm@kvack.org>; Thu, 15 Aug 2024 19:44:56 +0000 (UTC)
X-FDA: 82455507792.23.04B988F
Received: from mail-yw1-f177.google.com (mail-yw1-f177.google.com [209.85.128.177])
	by imf08.hostedemail.com (Postfix) with ESMTP id 8792316000E
	for <linux-mm@kvack.org>; Thu, 15 Aug 2024 19:44:54 +0000 (UTC)
Authentication-Results: imf08.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=jFh9O1dB;
	spf=pass (imf08.hostedemail.com: domain of surenb@google.com designates 209.85.128.177 as permitted sender) smtp.mailfrom=surenb@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723751057; a=rsa-sha256;
	cv=none;
	b=dXyylbHAgiAfxkksGew4wAGodpErF0YD19uChy9eqo+WXjdfXACKSz4Ox83svPgVTjSRiE
	ndTODbFZfUi+NMwJE339s8roE/duBns8wIPiE8M9KBguyIIRkytHMdiqkyZjQDuu5rLMlW
	jC0+1rv9JtP9ZotEEekTIc2x0GZ5eOE=
ARC-Authentication-Results: i=1;
	imf08.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=jFh9O1dB;
	spf=pass (imf08.hostedemail.com: domain of surenb@google.com designates 209.85.128.177 as permitted sender) smtp.mailfrom=surenb@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1723751057;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=6t6XhYIEZnyVh3jH9KKsmGdUUyO+tWe72GSzHcBC0Bs=;
	b=5F1uZdlkadnqwjfIM/wokTi0FwEbk98Vzjaif8/ePbJOKSE46onZF7FC4Km6Kp1H1lYVxp
	k3o7GOVRvirsoP1b+EwT3dUGXSwIdm7E9FwYfXa+hkKn9zkvpiECRNCYOIRxz/1I+3SOiS
	YtIk7zvBT06r8hp+IqgqrHTn1WLzO7Y=
Received: by mail-yw1-f177.google.com with SMTP id 00721157ae682-67b709024bfso14404487b3.3
        for <linux-mm@kvack.org>; Thu, 15 Aug 2024 12:44:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1723751093; x=1724355893; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=6t6XhYIEZnyVh3jH9KKsmGdUUyO+tWe72GSzHcBC0Bs=;
        b=jFh9O1dBHkWpfFqhLNjvU/9RYbK0ThqLZxnD9G4O7Meo+43hlnbLH0Cne5CxmEx6ed
         HQNJYu5Xq3Sl3RFD9mr0LrRFWVyOOeba6abpChxaeGQXtRJw7UCGCVA4a+W35V8qVKCH
         OhNuke4UhVBdDDlE6GOlWcqKzNXGBDEqj2xSWIzboBHS466Fzl1jTskbiYIht7p8HJA8
         y49MefJXFZ5/pwLK4n1YQveUhzLgxbPSkiQUVivawZP2Xr/xpWRW0AKDIUWE9ELA8dGr
         hky/RSetGEdXUtsUOz/1bXcPW3kQA/CESeMpgV4L3PHGQgwuFW9yoPMZyh2AvrxJxqJP
         RRQA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1723751093; x=1724355893;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=6t6XhYIEZnyVh3jH9KKsmGdUUyO+tWe72GSzHcBC0Bs=;
        b=pDi3nrDf1gxWbdcRZpSNjBv2FS7oDsDMlGgLE1H43QreNhaXdIOGkA9YHo4oBz9EB0
         1Rx13ukqAD7wlQ45xjkXkkkrvx8cOsbqEwp2l3z2n7tEHr1tkUvCdOeRX1nvkkPTU7c0
         rmH+U2g0yw1aOKJX3j2s0W2E/p+PZ8sgH/darBnyGGv4JIk+wx3CatXB2w8R9DoYKQyk
         4KJGU79e76WELBqkdRONSMkwvjF7jk12fS8a83qmRQ0rfaCeXN/zHev30HQSH0jX3HK2
         1a3pzQXUTdGGjZp2XT5Gnmtm7Hdk3RP6Y0QTpygGRlNdG80hfuQXj2KltFhEIj5V6vms
         G3AQ==
X-Forwarded-Encrypted: i=1; AJvYcCUr/TQb+N84MKvA3SqyhYpT+K3vKIIAyoF71ESUwyLPkRcL0/8q5NDuu838jIsV62ZOeon9yXDwOQ==@kvack.org
X-Gm-Message-State: AOJu0Yx03jHtQH0ZJEmFOxpS0M7+lk7PdoodTP3UhFuDvHEXYa9hnbNc
	H0P3x/5IfRBG+Jh/MtSvyqWZ1shL10zl5M9iiznmlA9l8R+tcNXyWiYFhdI+lSlzlRLXNHSotob
	9IhadojX2SkJqVNybowKJ+jZSeZqQmJD/lawX
X-Google-Smtp-Source: AGHT+IGFiu2dsnb0iBEpNTaU9KVmRuLZuyiJNu/FnHWtHgC8Ueh3DZGTkW2v1iGQuV1T8XRCWB8LYxP6YszCk0z50XQ=
X-Received: by 2002:a05:690c:4c8f:b0:664:a85d:47c6 with SMTP id
 00721157ae682-6b1bdb21ae8mr8811077b3.33.1723751093161; Thu, 15 Aug 2024
 12:44:53 -0700 (PDT)
MIME-Version: 1.0
References: <20240813042917.506057-1-andrii@kernel.org> <20240813042917.506057-14-andrii@kernel.org>
 <7byqni7pmnufzjj73eqee2hvpk47tzgwot32gez3lb2u5lucs2@5m7dvjrvtmv2>
 <CAJuCfpG8hCNjqmttb91yq5kPaSGaYLL1ozkHKqUjD7X3n_60+w@mail.gmail.com>
 <o46u6b2w4b2ijrh3yzj7rc4c3outqmmtzbgbnzhscfuqsu4i4u@uhv65maza2d5>
 <CAEf4BzZ6jSFr_75cWQdxZOHzR-MyJS1xUY-TkG0=2A8Z1gP42g@mail.gmail.com>
 <CAJuCfpGZT+ci0eDfTuLvo-3=jtEfMLYswnDJ0CQHfittou0GZQ@mail.gmail.com> <CAG48ez2VwmFU7ubongD1AnYJDf2-RrFod33Zvbjy1NwRj4-Y1A@mail.gmail.com>
In-Reply-To: <CAG48ez2VwmFU7ubongD1AnYJDf2-RrFod33Zvbjy1NwRj4-Y1A@mail.gmail.com>
From: Suren Baghdasaryan <surenb@google.com>
Date: Thu, 15 Aug 2024 12:44:37 -0700
Message-ID: <CAJuCfpF0Cvk0D_tLqiFmFze9ErgNe-_z1HHZoY34DcZ8CPd7yA@mail.gmail.com>
Subject: Re: [PATCH RFC v3 13/13] uprobes: add speculative lockless VMA to
 inode resolution
To: Jann Horn <jannh@google.com>
Cc: Christian Brauner <brauner@kernel.org>, Andrii Nakryiko <andrii.nakryiko@gmail.com>, 
	Mateusz Guzik <mjguzik@gmail.com>, Andrii Nakryiko <andrii@kernel.org>, 
	linux-trace-kernel@vger.kernel.org, peterz@infradead.org, oleg@redhat.com, 
	rostedt@goodmis.org, mhiramat@kernel.org, bpf@vger.kernel.org, 
	linux-kernel@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, 
	willy@infradead.org, akpm@linux-foundation.org, linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: 37zqb5izbcase7dkmsjtymq5nqskf7o1
X-Rspamd-Queue-Id: 8792316000E
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-HE-Tag: 1723751094-621240
X-HE-Meta: U2FsdGVkX18YLmKBuD7FtHrKLgfR20deG6igYgGAfTjdpHH+yEvjx/fnQSSlXHKOJn4SlfG2DUgr9X3DrlRQvCNGSPYqsI5ogju92DL+raOqdO99HbLgmBd6FCn0YbH3DdnamywJA7hGginz7O49lWvKsRYw22lTJ3MipuBd4z7CAJZhtU1XI2uSRj/RW6kfwM6HXM/HXwwEd2TMsI2P5AVCUHWLHi5zyIk6wKFeICwvo7iffHW4NAkZAcqj3BVDCSHwTnBrO4EawadWzgJwiR+ecy9zRaEBypSG1+mnG+ZGKA8EmMuxAYnP7kPsLomzhStpTlJA5VP8dG2AWc2nsWcQDO5S5KXen6Mq6cBiL+n6bA4+/4GHJcwp0RmB6NH51aIuCtPs5hHAeN4NhxSOwWVhV4igLJVlvi3AVHjNo2yJa6A5hmi+nlLa5Db0btIAR93KPevG4U39CGJPpy1FUHbdU9XGVAeMC+aqB99YAejAmYy8FIXJgMzRabJfQ1ZzK3GdNX7wMMj7oeNLwKqKQVk5U6gOR3X2Hw1HH1MlbqoP1nho3oJm8Xs9dFX4DIay9U2fNv5wZ9YauggFsjxgOkcTlrFzFXerfDnvzQj4l9blldWfJ9XSH76nrMiMk2RSZKfzH5rQCBfGV8M9dSd1hEcQ02JlZkuH7oFffjtW3mrv5yLLrD1qH9TvxZbWC3ul4Er9dZit80JkEMO4tQt6TdfCIRxfZk/eThKmB2tajOBJbF7Pza/JVXi1UhvbCRTZEii6nWECMMKmzHJ0ZvRk0EhQXU88stql9A9lHrozubbtBnaL2Nxcrno8CzxGdWb5EYXWoJZNFfp2WPx3Ggu4X/YBwQ/m/6S5ELvZcIyv2EFAi18egq77nSBWczPAisPzk8DGb+LcIqEL1ytJaQs1e2pKItao9FIdpW8H31FO0F41DlurMIhgGGh0ygAOEvLYBsZlFxboJM9IRRurxKF
 ETm3qJ01
 GHPLy/1FzN/2E9d0bnHKHuPkZ+bcGvI+RBKH6HUpErmbwA76N11Q+aHZ0YDKj8VaGbCfvMi8dKJjbVpvgWqHoBP513SAuRtAdAu4hW37/y6BeYw3oXzmrKJ5jRVYK9rlqf7QFZO8iS7rGB+yw/dnzYwUWEoGPBO5UtGPtzIdcVGLcOKfkm36F/B3ea5m18F0uI31bnDBbcRAxZ8lCQuNxdo1/aeAWe+4WO4y4JDre8bewwqaT+a4mVMOIlM8VkG/c66C9jc3WJbGuRsm/p0vuoL20S8TXCwSt7dnTfU0Q3+8Lrm9WDJBcCpgcdMuY3aCn3RxVFzm8Zmxb0TDhz1jkWMgHgRIFgFjyC59h
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Aug 15, 2024 at 11:58=E2=80=AFAM Jann Horn <jannh@google.com> wrote=
:
>
> +brauner for "struct file" lifetime
>
> On Thu, Aug 15, 2024 at 7:45=E2=80=AFPM Suren Baghdasaryan <surenb@google=
.com> wrote:
> > On Thu, Aug 15, 2024 at 9:47=E2=80=AFAM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Thu, Aug 15, 2024 at 6:44=E2=80=AFAM Mateusz Guzik <mjguzik@gmail.=
com> wrote:
> > > >
> > > > On Tue, Aug 13, 2024 at 08:36:03AM -0700, Suren Baghdasaryan wrote:
> > > > > On Mon, Aug 12, 2024 at 11:18=E2=80=AFPM Mateusz Guzik <mjguzik@g=
mail.com> wrote:
> > > > > >
> > > > > > On Mon, Aug 12, 2024 at 09:29:17PM -0700, Andrii Nakryiko wrote=
:
> > > > > > > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely =
access
> > > > > > > vma->vm_file->f_inode lockless only under rcu_read_lock() pro=
tection,
> > > > > > > attempting uprobe look up speculatively.
>
> Stupid question: Is this uprobe stuff actually such a hot codepath
> that it makes sense to optimize it to be faster than the page fault
> path?
>
> (Sidenote: I find it kinda interesting that this is sort of going back
> in the direction of the old Speculative Page Faults design.)
>
> > > > > > > We rely on newly added mmap_lock_speculation_{start,end}() he=
lpers to
> > > > > > > validate that mm_struct stays intact for entire duration of t=
his
> > > > > > > speculation. If not, we fall back to mmap_lock-protected look=
up.
> > > > > > >
> > > > > > > This allows to avoid contention on mmap_lock in absolutely ma=
jority of
> > > > > > > cases, nicely improving uprobe/uretprobe scalability.
> > > > > > >
> > > > > >
> > > > > > Here I have to admit to being mostly ignorant about the mm, so =
bear with
> > > > > > me. :>
> > > > > >
> > > > > > I note the result of find_active_uprobe_speculative is immediat=
ely stale
> > > > > > in face of modifications.
> > > > > >
> > > > > > The thing I'm after is that the mmap_lock_speculation business =
adds
> > > > > > overhead on archs where a release fence is not a de facto nop a=
nd I
> > > > > > don't believe the commit message justifies it. Definitely a bum=
mer to
> > > > > > add merely it for uprobes. If there are bigger plans concerning=
 it
> > > > > > that's a different story of course.
> > > > > >
> > > > > > With this in mind I have to ask if instead you could perhaps ge=
t away
> > > > > > with the already present per-vma sequence counter?
> > > > >
> > > > > per-vma sequence counter does not implement acquire/release logic=
, it
> > > > > relies on vma->vm_lock for synchronization. So if we want to use =
it,
> > > > > we would have to add additional memory barriers here. This is lik=
ely
> > > > > possible but as I mentioned before we would need to ensure the
> > > > > pagefault path does not regress. OTOH mm->mm_lock_seq already hal=
fway
> > > > > there (it implements acquire/release logic), we just had to ensur=
e
> > > > > mmap_write_lock() increments mm->mm_lock_seq.
> > > > >
> > > > > So, from the release fence overhead POV I think whether we use
> > > > > mm->mm_lock_seq or vma->vm_lock, we would still need a proper fen=
ce
> > > > > here.
> > > > >
> > > >
> > > > Per my previous e-mail I'm not particularly familiar with mm intern=
als,
> > > > so I'm going to handwave a little bit with my $0,03 concerning mult=
icore
> > > > in general and if you disagree with it that's your business. For th=
e
> > > > time being I have no interest in digging into any of this.
> > > >
> > > > Before I do, to prevent this thread from being a total waste, here =
are
> > > > some remarks concerning the patch with the assumption that the core=
 idea
> > > > lands.
> > > >
> > > > From the commit message:
> > > > > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely acce=
ss
> > > > > vma->vm_file->f_inode lockless only under rcu_read_lock() protect=
ion,
> > > > > attempting uprobe look up speculatively.
> > > >
> > > > Just in case I'll note a nit that this paragraph will need to be re=
moved
> > > > since the patch adding the flag is getting dropped.
> > >
> > > Yep, of course, I'll update all that for the next revision (I'll wait
> > > for non-RFC patches to land first before reposting).
> > >
> > > >
> > > > A non-nit which may or may not end up mattering is that the flag (w=
hich
> > > > *is* set on the filep slab cache) makes things more difficult to
> > > > validate. Normal RCU usage guarantees that the object itself wont b=
e
> > > > freed as long you follow the rules. However, the SLAB_TYPESAFE_BY_R=
CU
> > > > flag weakens it significantly -- the thing at hand will always be a
> > > > 'struct file', but it may get reallocated to *another* file from un=
der
> > > > you. Whether this aspect plays a role here I don't know.
> > >
> > > Yes, that's ok and is accounted for. We care about that memory not
> > > going even from under us (I'm not even sure if it matters that it is
> > > still a struct file, tbh; I think that shouldn't matter as we are
> > > prepared to deal with completely garbage values read from struct
> > > file).
> >
> > Correct, with SLAB_TYPESAFE_BY_RCU we do need an additional check that
> > vma->vm_file has not been freed and reused. That's where
> > mmap_lock_speculation_{start|end} helps us. For vma->vm_file to change
> > from under us one would have to take mmap_lock for write. If that
> > happens mmap_lock_speculation_{start|end} should detect that and
> > terminate our speculation.
> >
> > >
> > > >
> > > > > +static struct uprobe *find_active_uprobe_speculative(unsigned lo=
ng bp_vaddr)
> > > > > +{
> > > > > +     const vm_flags_t flags =3D VM_HUGETLB | VM_MAYEXEC | VM_MAY=
SHARE;
> > > > > +     struct mm_struct *mm =3D current->mm;
> > > > > +     struct uprobe *uprobe;
> > > > > +     struct vm_area_struct *vma;
> > > > > +     struct file *vm_file;
> > > > > +     struct inode *vm_inode;
> > > > > +     unsigned long vm_pgoff, vm_start;
> > > > > +     int seq;
> > > > > +     loff_t offset;
> > > > > +
> > > > > +     if (!mmap_lock_speculation_start(mm, &seq))
> > > > > +             return NULL;
> > > > > +
> > > > > +     rcu_read_lock();
> > > > > +
> > > >
> > > > I don't think there is a correctness problem here, but entering rcu
> > > > *after* deciding to speculatively do the lookup feels backwards.
> > >
> > > RCU should protect VMA and file, mm itself won't go anywhere, so this=
 seems ok.
> > >
> > > >
> > > > > +     vma =3D vma_lookup(mm, bp_vaddr);
> > > > > +     if (!vma)
> > > > > +             goto bail;
> > > > > +
> > > > > +     vm_file =3D data_race(vma->vm_file);
> > > > > +     if (!vm_file || (vma->vm_flags & flags) !=3D VM_MAYEXEC)
> > > > > +             goto bail;
> > > > > +
> > > >
> > > > If vma teardown is allowed to progress and the file got fput'ed...
> > > >
> > > > > +     vm_inode =3D data_race(vm_file->f_inode);
> > > >
> > > > ... the inode can be NULL, I don't know if that's handled.
> > > >
> > >
> > > Yep, inode pointer value is part of RB-tree key, so if it's NULL, we
> > > just won't find a matching uprobe. Same for any other "garbage"
> > > f_inode value. Importantly, we never should dereference such inode
> > > pointers, at least until we find a valid uprobe (in which case we kee=
p
> > > inode reference to it).
> > >
> > > > More importantly though, per my previous description of
> > > > SLAB_TYPESAFE_BY_RCU, by now the file could have been reallocated a=
nd
> > > > the inode you did find is completely unrelated.
> > > >
> > > > I understand the intent is to backpedal from everything should the =
mm
> > > > seqc change, but the above may happen to matter.
> > >
> > > Yes, I think we took that into account. All that we care about is
> > > memory "type safety", i.e., even if struct file's memory is reused,
> > > it's ok, we'll eventually detect the change and will discard wrong
> > > uprobe that we might by accident lookup (though probably in most case=
s
> > > we just won't find a uprobe at all).
> > >
> > > >
> > > > > +     vm_pgoff =3D data_race(vma->vm_pgoff);
> > > > > +     vm_start =3D data_race(vma->vm_start);
> > > > > +
> > > > > +     offset =3D (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - v=
m_start);
> > > > > +     uprobe =3D find_uprobe_rcu(vm_inode, offset);
> > > > > +     if (!uprobe)
> > > > > +             goto bail;
> > > > > +
> > > > > +     /* now double check that nothing about MM changed */
> > > > > +     if (!mmap_lock_speculation_end(mm, seq))
> > > > > +             goto bail;
> > > >
> > > > This leaks the reference obtained by find_uprobe_rcu().
> > >
> > > find_uprobe_rcu() doesn't obtain a reference, uprobe is RCU-protected=
,
> > > and if caller need a refcount bump it will have to use
> > > try_get_uprobe() (which might fail).
> > >
> > > >
> > > > > +
> > > > > +     rcu_read_unlock();
> > > > > +
> > > > > +     /* happy case, we speculated successfully */
> > > > > +     return uprobe;
> > > > > +bail:
> > > > > +     rcu_read_unlock();
> > > > > +     return NULL;
> > > > > +}
> > > >
> > > > Now to some handwaving, here it is:
> > > >
> > > > The core of my concern is that adding more work to down_write on th=
e
> > > > mmap semaphore comes with certain side-effects and plausibly more t=
han a
> > > > sufficient speed up can be achieved without doing it.
> >
> > AFAIK writers of mmap_lock are not considered a fast path. In a sense
> > yes, we made any writer a bit heavier but OTOH we also made
> > mm->mm_lock_seq a proper sequence count which allows us to locklessly
> > check if mmap_lock is write-locked. I think you asked whether there
> > will be other uses for mmap_lock_speculation_{start|end} and yes. For
> > example, I am planning to use them for printing /proc/{pid}/maps
> > without taking mmap_lock (when it's uncontended).
>
> What would be the goal of this - to avoid cacheline bouncing of the
> mmap lock between readers? Or to allow mmap_write_lock() to preempt
> /proc/{pid}/maps readers who started out uncontended?

The latter, from my early patchset which I need to refine
(https://lore.kernel.org/all/20240123231014.3801041-3-surenb@google.com/):

This change is designed to reduce mmap_lock contention and prevent a
process reading /proc/pid/maps files (often a low priority task, such as
monitoring/data collection services) from blocking address space updates.

>
> Is the idea that you'd change show_map_vma() to first do something
> like get_file_active() to increment the file refcount (because
> otherwise the dentry can be freed under you and you need the dentry
> for path printing), then recheck your sequence count on the mm or vma
> (to avoid accessing the dentry of an unrelated file that hasn't become
> userspace-visible yet and may not have a proper dentry pointer yet),
> then print the file path, drop the file reference again, and in the
> end recheck the sequence count again before actually returning the
> printed data to userspace?

Yeah, you can see the details in that link I posted above. See
get_vma_snapshot() function.

>
> > If we have VMA seq
> > counter-based detection it would be better (see below).
> >
> > > >
> > > > An mm-wide mechanism is just incredibly coarse-grained and it may h=
appen
> > > > to perform poorly when faced with a program which likes to mess wit=
h its
> > > > address space -- the fast path is going to keep failing and only
> > > > inducing *more* overhead as the code decides to down_read the mmap
> > > > semaphore.
> > > >
> > > > Furthermore there may be work currently synchronized with down_writ=
e
> > > > which perhaps can transition to "merely" down_read, but by the time=
 it
> > > > happens this and possibly other consumers expect a change in the
> > > > sequence counter, messing with it.
> > > >
> > > > To my understanding the kernel supports parallel faults with per-vm=
a
> > > > locking. I would find it surprising if the same machinery could not=
 be
> > > > used to sort out uprobe handling above.
> >
> > From all the above, my understanding of your objection is that
> > checking mmap_lock during our speculation is too coarse-grained and
> > you would prefer to use the VMA seq counter to check that the VMA we
> > are working on is unchanged. I agree, that would be ideal. I had a
> > quick chat with Jann about this and the conclusion we came to is that
> > we would need to add an additional smp_wmb() barrier inside
> > vma_start_write() and a smp_rmb() in the speculation code:
> >
> > static inline void vma_start_write(struct vm_area_struct *vma)
> > {
> >         int mm_lock_seq;
> >
> >         if (__is_vma_write_locked(vma, &mm_lock_seq))
> >                 return;
> >
> >         down_write(&vma->vm_lock->lock);
> >         /*
> >          * We should use WRITE_ONCE() here because we can have concurre=
nt reads
> >          * from the early lockless pessimistic check in vma_start_read(=
).
> >          * We don't really care about the correctness of that early che=
ck, but
> >          * we should use WRITE_ONCE() for cleanliness and to keep KCSAN=
 happy.
> >          */
> >         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > +        smp_wmb();
> >         up_write(&vma->vm_lock->lock);
> > }
> >
> > Note: up_write(&vma->vm_lock->lock) in the vma_start_write() is not
> > enough because it's one-way permeable (it's a "RELEASE operation") and
> > later vma->vm_file store (or any other VMA modification) can move
> > before our vma->vm_lock_seq store.
> >
> > This makes vma_start_write() heavier but again, it's write-locking, so
> > should not be considered a fast path.
> > With this change we can use the code suggested by Andrii in
> > https://lore.kernel.org/all/CAEf4BzZeLg0WsYw2M7KFy0+APrPaPVBY7FbawB9vjc=
A2+6k69Q@mail.gmail.com/
> > with an additional smp_rmb():
> >
> > rcu_read_lock()
> > vma =3D find_vma(...)
> > if (!vma) /* bail */
>
> And maybe add some comments like:
>
> /*
>  * Load the current VMA lock sequence - we will detect if anyone concurre=
ntly
>  * locks the VMA after this point.
>  * Pairs with smp_wmb() in vma_start_write().
>  */
> > vm_lock_seq =3D smp_load_acquire(&vma->vm_lock_seq);
> /*
>  * Now we just have to detect if the VMA is already locked with its curre=
nt
>  * sequence count.
>  *
>  * The following load is ordered against the vm_lock_seq load above (usin=
g
>  * smp_load_acquire() for the load above), and pairs with implicit memory
>  * ordering between the mm_lock_seq write in mmap_write_unlock() and the
>  * vm_lock_seq write in the next vma_start_write() after that (which can =
only
>  * occur after an mmap_write_lock()).
>  */
> > mm_lock_seq =3D smp_load_acquire(&vma->mm->mm_lock_seq);
> > /* I think vm_lock has to be acquired first to avoid the race */
> > if (mm_lock_seq =3D=3D vm_lock_seq)
> >         /* bail, vma is write-locked */
> > ... perform uprobe lookup logic based on vma->vm_file->f_inode ...
> /*
>  * Order the speculative accesses above against the following vm_lock_seq
>  * recheck.
>  */
> > smp_rmb();
> > if (vma->vm_lock_seq !=3D vm_lock_seq)
>
> (As I said on the other thread: Since this now relies on
> vma->vm_lock_seq not wrapping back to the same value for correctness,
> I'd like to see vma->vm_lock_seq being at least an "unsigned long", or
> even better, an atomic64_t... though I realize we don't currently do
> that for seqlocks either.)
>
> >         /* bail, VMA might have changed */
> >
> > The smp_rmb() is needed so that vma->vm_lock_seq load does not get
> > reordered and moved up before speculation.
> >
> > I'm CC'ing Jann since he understands memory barriers way better than
> > me and will keep me honest.