* [PATCH 0/2] uprobes,mm: speculative lockless VMA-to-uprobe lookup
@ 2024-09-06 5:12 Andrii Nakryiko
2024-09-06 5:12 ` [PATCH 1/2] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
` (2 more replies)
0 siblings, 3 replies; 30+ messages in thread
From: Andrii Nakryiko @ 2024-09-06 5:12 UTC (permalink / raw)
To: linux-trace-kernel, peterz, oleg
Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
surenb, akpm, linux-mm, mjguzik, brauner, jannh, Andrii Nakryiko
Implement speculative (lockless) resolution of VMA to inode to uprobe,
bypassing the need to take mmap_lock for reads, if possible. Patch #1 by Suren
adds mm_struct helpers that help detect whether mm_struct were changed, which
is used by uprobe logic to validate that speculative results can be trusted
after all the lookup logic results in a valid uprobe instance.
I ran a few will-it-scale benchmarks to sanity check that patch #1 doesn't
introduce any noticeable regressions. Which it seems it doesn't.
Andrii Nakryiko (1):
uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
Suren Baghdasaryan (1):
mm: introduce mmap_lock_speculation_{start|end}
include/linux/mm_types.h | 3 +++
include/linux/mmap_lock.h | 53 +++++++++++++++++++++++++++++++--------
kernel/events/uprobes.c | 51 +++++++++++++++++++++++++++++++++++++
kernel/fork.c | 3 ---
4 files changed, 97 insertions(+), 13 deletions(-)
--
2.43.5
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 1/2] mm: introduce mmap_lock_speculation_{start|end}
2024-09-06 5:12 [PATCH 0/2] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
@ 2024-09-06 5:12 ` Andrii Nakryiko
2024-09-09 12:35 ` Jann Horn
2024-09-06 5:12 ` [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
2024-09-10 16:06 ` [PATCH 0/2] uprobes,mm: speculative lockless VMA-to-uprobe lookup Jann Horn
2 siblings, 1 reply; 30+ messages in thread
From: Andrii Nakryiko @ 2024-09-06 5:12 UTC (permalink / raw)
To: linux-trace-kernel, peterz, oleg
Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
surenb, akpm, linux-mm, mjguzik, brauner, jannh, Andrii Nakryiko
From: Suren Baghdasaryan <surenb@google.com>
Add helper functions to speculatively perform operations without
read-locking mmap_lock, expecting that mmap_lock will not be
write-locked and mm is not modified from under us.
A few will-it-scale ([0]) bechmarks were run with and without the
changes in this patch on a beefy server. The test script below was used.
Most of background activity on the server was stopped, but there could
still be some sporadic sources of noise. And frankly, will-it-scale
benchmarks themselves aren't 100% noise-free and do fluctuate somewhat.
Also malloc1 benchmark was getting stuck for some reason, so it was
skipped from benchmarks. But I think it was still useful as a bit of
sanity check, but take all of them with a grain of salt.
Benchmark script:
# cat will_it_scale.sh
#!/bin/bash
set -eufo pipefail
for b in page_fault1 page_fault2 page_fault3 malloc2; do
for p in 40; do
echo BENCH $b CPU$p $(will-it-scale/${b}_threads -m -t$p -s60 | grep average)
done;
done;
Before (w/o this patch)
=======================
BENCH page_fault1 CPU40 average:5403940
BENCH page_fault2 CPU40 average:5019159
BENCH page_fault3 CPU40 average:971057
BENCH malloc2 CPU40 average:1364730680
After (w/ this patch)
=====================
BENCH page_fault1 CPU40 average:5485892
BENCH page_fault2 CPU40 average:5047923
BENCH page_fault3 CPU40 average:982953
BENCH malloc2 CPU40 average:1361339890
Results seem to be within the noise of measurements, but perhaps mm
folks might have better benchmarks to try.
[0] https://github.com/antonblanchard/will-it-scale
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
include/linux/mm_types.h | 3 +++
include/linux/mmap_lock.h | 53 +++++++++++++++++++++++++++++++--------
kernel/fork.c | 3 ---
3 files changed, 46 insertions(+), 13 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 485424979254..d5e3f907eea4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -876,6 +876,9 @@ struct mm_struct {
* Roughly speaking, incrementing the sequence number is
* equivalent to releasing locks on VMAs; reading the sequence
* number can be part of taking a read lock on a VMA.
+ * Incremented every time mmap_lock is write-locked/unlocked.
+ * Initialized to 0, therefore odd values indicate mmap_lock
+ * is write-locked and even values that it's released.
*
* Can be modified under write mmap_lock using RELEASE
* semantics.
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index de9dc20b01ba..5410ce741d75 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -71,15 +71,12 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
}
#ifdef CONFIG_PER_VMA_LOCK
-/*
- * Drop all currently-held per-VMA locks.
- * This is called from the mmap_lock implementation directly before releasing
- * a write-locked mmap_lock (or downgrading it to read-locked).
- * This should normally NOT be called manually from other places.
- * If you want to call this manually anyway, keep in mind that this will release
- * *all* VMA write locks, including ones from further up the stack.
- */
-static inline void vma_end_write_all(struct mm_struct *mm)
+static inline void init_mm_lock_seq(struct mm_struct *mm)
+{
+ mm->mm_lock_seq = 0;
+}
+
+static inline void inc_mm_lock_seq(struct mm_struct *mm)
{
mmap_assert_write_locked(mm);
/*
@@ -91,19 +88,52 @@ static inline void vma_end_write_all(struct mm_struct *mm)
*/
smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
}
+
+static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
+{
+ /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
+ *seq = smp_load_acquire(&mm->mm_lock_seq);
+ /* Allow speculation if mmap_lock is not write-locked */
+ return (*seq & 1) == 0;
+}
+
+static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
+{
+ /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
+ return seq == smp_load_acquire(&mm->mm_lock_seq);
+}
+
#else
-static inline void vma_end_write_all(struct mm_struct *mm) {}
+static inline void init_mm_lock_seq(struct mm_struct *mm) {}
+static inline void inc_mm_lock_seq(struct mm_struct *mm) {}
+static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
+static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq) { return false; }
#endif
+/*
+ * Drop all currently-held per-VMA locks.
+ * This is called from the mmap_lock implementation directly before releasing
+ * a write-locked mmap_lock (or downgrading it to read-locked).
+ * This should normally NOT be called manually from other places.
+ * If you want to call this manually anyway, keep in mind that this will release
+ * *all* VMA write locks, including ones from further up the stack.
+ */
+static inline void vma_end_write_all(struct mm_struct *mm)
+{
+ inc_mm_lock_seq(mm);
+}
+
static inline void mmap_init_lock(struct mm_struct *mm)
{
init_rwsem(&mm->mmap_lock);
+ init_mm_lock_seq(mm);
}
static inline void mmap_write_lock(struct mm_struct *mm)
{
__mmap_lock_trace_start_locking(mm, true);
down_write(&mm->mmap_lock);
+ inc_mm_lock_seq(mm);
__mmap_lock_trace_acquire_returned(mm, true, true);
}
@@ -111,6 +141,7 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
{
__mmap_lock_trace_start_locking(mm, true);
down_write_nested(&mm->mmap_lock, subclass);
+ inc_mm_lock_seq(mm);
__mmap_lock_trace_acquire_returned(mm, true, true);
}
@@ -120,6 +151,8 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
__mmap_lock_trace_start_locking(mm, true);
ret = down_write_killable(&mm->mmap_lock);
+ if (!ret)
+ inc_mm_lock_seq(mm);
__mmap_lock_trace_acquire_returned(mm, true, ret == 0);
return ret;
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 18bdc87209d0..c44b71d354ee 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1259,9 +1259,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
seqcount_init(&mm->write_protect_seq);
mmap_init_lock(mm);
INIT_LIST_HEAD(&mm->mmlist);
-#ifdef CONFIG_PER_VMA_LOCK
- mm->mm_lock_seq = 0;
-#endif
mm_pgtables_bytes_init(mm);
mm->map_count = 0;
mm->locked_vm = 0;
--
2.43.5
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
2024-09-06 5:12 [PATCH 0/2] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
2024-09-06 5:12 ` [PATCH 1/2] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
@ 2024-09-06 5:12 ` Andrii Nakryiko
2024-09-08 1:22 ` Liam R. Howlett
` (2 more replies)
2024-09-10 16:06 ` [PATCH 0/2] uprobes,mm: speculative lockless VMA-to-uprobe lookup Jann Horn
2 siblings, 3 replies; 30+ messages in thread
From: Andrii Nakryiko @ 2024-09-06 5:12 UTC (permalink / raw)
To: linux-trace-kernel, peterz, oleg
Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
surenb, akpm, linux-mm, mjguzik, brauner, jannh, Andrii Nakryiko
Given filp_cachep is already marked SLAB_TYPESAFE_BY_RCU, we can safely
access vma->vm_file->f_inode field locklessly under just rcu_read_lock()
protection, which enables looking up uprobe from uprobes_tree completely
locklessly and speculatively without the need to acquire mmap_lock for
reads. In most cases, anyway, under the assumption that there are no
parallel mm and/or VMA modifications. The underlying struct file's
memory won't go away from under us (even if struct file can be reused in
the meantime).
We rely on newly added mmap_lock_speculation_{start,end}() helpers to
validate that mm_struct stays intact for entire duration of this
speculation. If not, we fall back to mmap_lock-protected lookup.
The speculative logic is written in such a way that it will safely
handle any garbage values that might be read from vma or file structs.
Benchmarking results speak for themselves.
BEFORE (latest tip/perf/core)
=============================
uprobe-nop ( 1 cpus): 3.384 ± 0.004M/s ( 3.384M/s/cpu)
uprobe-nop ( 2 cpus): 5.456 ± 0.005M/s ( 2.728M/s/cpu)
uprobe-nop ( 3 cpus): 7.863 ± 0.015M/s ( 2.621M/s/cpu)
uprobe-nop ( 4 cpus): 9.442 ± 0.008M/s ( 2.360M/s/cpu)
uprobe-nop ( 5 cpus): 11.036 ± 0.013M/s ( 2.207M/s/cpu)
uprobe-nop ( 6 cpus): 10.884 ± 0.019M/s ( 1.814M/s/cpu)
uprobe-nop ( 7 cpus): 7.897 ± 0.145M/s ( 1.128M/s/cpu)
uprobe-nop ( 8 cpus): 10.021 ± 0.128M/s ( 1.253M/s/cpu)
uprobe-nop (10 cpus): 9.932 ± 0.170M/s ( 0.993M/s/cpu)
uprobe-nop (12 cpus): 8.369 ± 0.056M/s ( 0.697M/s/cpu)
uprobe-nop (14 cpus): 8.678 ± 0.017M/s ( 0.620M/s/cpu)
uprobe-nop (16 cpus): 7.392 ± 0.003M/s ( 0.462M/s/cpu)
uprobe-nop (24 cpus): 5.326 ± 0.178M/s ( 0.222M/s/cpu)
uprobe-nop (32 cpus): 5.426 ± 0.059M/s ( 0.170M/s/cpu)
uprobe-nop (40 cpus): 5.262 ± 0.070M/s ( 0.132M/s/cpu)
uprobe-nop (48 cpus): 6.121 ± 0.010M/s ( 0.128M/s/cpu)
uprobe-nop (56 cpus): 6.252 ± 0.035M/s ( 0.112M/s/cpu)
uprobe-nop (64 cpus): 7.644 ± 0.023M/s ( 0.119M/s/cpu)
uprobe-nop (72 cpus): 7.781 ± 0.001M/s ( 0.108M/s/cpu)
uprobe-nop (80 cpus): 8.992 ± 0.048M/s ( 0.112M/s/cpu)
AFTER
=====
uprobe-nop ( 1 cpus): 3.534 ± 0.033M/s ( 3.534M/s/cpu)
uprobe-nop ( 2 cpus): 6.701 ± 0.007M/s ( 3.351M/s/cpu)
uprobe-nop ( 3 cpus): 10.031 ± 0.007M/s ( 3.344M/s/cpu)
uprobe-nop ( 4 cpus): 13.003 ± 0.012M/s ( 3.251M/s/cpu)
uprobe-nop ( 5 cpus): 16.274 ± 0.006M/s ( 3.255M/s/cpu)
uprobe-nop ( 6 cpus): 19.563 ± 0.024M/s ( 3.261M/s/cpu)
uprobe-nop ( 7 cpus): 22.696 ± 0.054M/s ( 3.242M/s/cpu)
uprobe-nop ( 8 cpus): 24.534 ± 0.010M/s ( 3.067M/s/cpu)
uprobe-nop (10 cpus): 30.475 ± 0.117M/s ( 3.047M/s/cpu)
uprobe-nop (12 cpus): 33.371 ± 0.017M/s ( 2.781M/s/cpu)
uprobe-nop (14 cpus): 38.864 ± 0.004M/s ( 2.776M/s/cpu)
uprobe-nop (16 cpus): 41.476 ± 0.020M/s ( 2.592M/s/cpu)
uprobe-nop (24 cpus): 64.696 ± 0.021M/s ( 2.696M/s/cpu)
uprobe-nop (32 cpus): 85.054 ± 0.027M/s ( 2.658M/s/cpu)
uprobe-nop (40 cpus): 101.979 ± 0.032M/s ( 2.549M/s/cpu)
uprobe-nop (48 cpus): 110.518 ± 0.056M/s ( 2.302M/s/cpu)
uprobe-nop (56 cpus): 117.737 ± 0.020M/s ( 2.102M/s/cpu)
uprobe-nop (64 cpus): 124.613 ± 0.079M/s ( 1.947M/s/cpu)
uprobe-nop (72 cpus): 133.239 ± 0.032M/s ( 1.851M/s/cpu)
uprobe-nop (80 cpus): 142.037 ± 0.138M/s ( 1.775M/s/cpu)
Previously total throughput was maxing out at 11mln/s, and gradually
declining past 8 cores. With this change, it now keeps growing with each
added CPU, reaching 142mln/s at 80 CPUs (this was measured on a 80-core
Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz).
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
kernel/events/uprobes.c | 51 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index a2e6a57f79f2..b7e0baa83de1 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2081,6 +2081,53 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
return is_trap_insn(&opcode);
}
+static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
+{
+ const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE;
+ struct mm_struct *mm = current->mm;
+ struct uprobe *uprobe;
+ struct vm_area_struct *vma;
+ struct file *vm_file;
+ struct inode *vm_inode;
+ unsigned long vm_pgoff, vm_start;
+ int seq;
+ loff_t offset;
+
+ if (!mmap_lock_speculation_start(mm, &seq))
+ return NULL;
+
+ rcu_read_lock();
+
+ vma = vma_lookup(mm, bp_vaddr);
+ if (!vma)
+ goto bail;
+
+ vm_file = data_race(vma->vm_file);
+ if (!vm_file || (vma->vm_flags & flags) != VM_MAYEXEC)
+ goto bail;
+
+ vm_inode = data_race(vm_file->f_inode);
+ vm_pgoff = data_race(vma->vm_pgoff);
+ vm_start = data_race(vma->vm_start);
+
+ offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
+ uprobe = find_uprobe_rcu(vm_inode, offset);
+ if (!uprobe)
+ goto bail;
+
+ /* now double check that nothing about MM changed */
+ if (!mmap_lock_speculation_end(mm, seq))
+ goto bail;
+
+ rcu_read_unlock();
+
+ /* happy case, we speculated successfully */
+ return uprobe;
+bail:
+ rcu_read_unlock();
+ return NULL;
+}
+
/* assumes being inside RCU protected region */
static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swbp)
{
@@ -2088,6 +2135,10 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
struct uprobe *uprobe = NULL;
struct vm_area_struct *vma;
+ uprobe = find_active_uprobe_speculative(bp_vaddr);
+ if (uprobe)
+ return uprobe;
+
mmap_read_lock(mm);
vma = vma_lookup(mm, bp_vaddr);
if (vma) {
--
2.43.5
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
2024-09-06 5:12 ` [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
@ 2024-09-08 1:22 ` Liam R. Howlett
2024-09-09 1:08 ` Andrii Nakryiko
2024-09-09 13:12 ` Jann Horn
2024-09-15 15:04 ` Oleg Nesterov
2 siblings, 1 reply; 30+ messages in thread
From: Liam R. Howlett @ 2024-09-08 1:22 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
linux-kernel, jolsa, paulmck, willy, surenb, akpm, linux-mm,
mjguzik, brauner, jannh
* Andrii Nakryiko <andrii@kernel.org> [240906 01:12]:
...
> ---
> kernel/events/uprobes.c | 51 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 51 insertions(+)
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index a2e6a57f79f2..b7e0baa83de1 100644
...
> @@ -2088,6 +2135,10 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
I'm having issues locating this function in akpm/mm-unstable. What
tree/commits am I missing to do a full review of this code?
> struct uprobe *uprobe = NULL;
> struct vm_area_struct *vma;
>
> + uprobe = find_active_uprobe_speculative(bp_vaddr);
> + if (uprobe)
> + return uprobe;
> +
> mmap_read_lock(mm);
> vma = vma_lookup(mm, bp_vaddr);
> if (vma) {
> --
> 2.43.5
>
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
2024-09-08 1:22 ` Liam R. Howlett
@ 2024-09-09 1:08 ` Andrii Nakryiko
0 siblings, 0 replies; 30+ messages in thread
From: Andrii Nakryiko @ 2024-09-09 1:08 UTC (permalink / raw)
To: Liam R. Howlett, Andrii Nakryiko, linux-trace-kernel, peterz,
oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck,
willy, surenb, akpm, linux-mm, mjguzik, brauner, jannh
On Sat, Sep 7, 2024 at 6:22 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Andrii Nakryiko <andrii@kernel.org> [240906 01:12]:
>
> ...
>
> > ---
> > kernel/events/uprobes.c | 51 +++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 51 insertions(+)
> >
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index a2e6a57f79f2..b7e0baa83de1 100644
> ...
>
> > @@ -2088,6 +2135,10 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
>
> I'm having issues locating this function in akpm/mm-unstable. What
> tree/commits am I missing to do a full review of this code?
Hey Liam,
These patches are based on tip/perf/core, find_active_uprobe_rcu()
just landed a few days ago. See [0].
[0] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=perf/core
>
> > struct uprobe *uprobe = NULL;
> > struct vm_area_struct *vma;
> >
> > + uprobe = find_active_uprobe_speculative(bp_vaddr);
> > + if (uprobe)
> > + return uprobe;
> > +
> > mmap_read_lock(mm);
> > vma = vma_lookup(mm, bp_vaddr);
> > if (vma) {
> > --
> > 2.43.5
> >
> >
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 1/2] mm: introduce mmap_lock_speculation_{start|end}
2024-09-06 5:12 ` [PATCH 1/2] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
@ 2024-09-09 12:35 ` Jann Horn
2024-09-10 2:09 ` Suren Baghdasaryan
0 siblings, 1 reply; 30+ messages in thread
From: Jann Horn @ 2024-09-09 12:35 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
linux-kernel, jolsa, paulmck, willy, surenb, akpm, linux-mm,
mjguzik, brauner
On Fri, Sep 6, 2024 at 7:12 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
> +{
> + /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> + return seq == smp_load_acquire(&mm->mm_lock_seq);
> +}
A load-acquire can't provide "end of locked section" semantics - a
load-acquire is a one-way barrier, you can basically use it for
"acquire lock" semantics but not for "release lock" semantics, because
the CPU will prevent reordering the load with *later* loads but not
with *earlier* loads. So if you do:
mmap_lock_speculation_start()
[locked reads go here]
mmap_lock_speculation_end()
then the CPU is allowed to reorder your instructions like this:
mmap_lock_speculation_start()
mmap_lock_speculation_end()
[locked reads go here]
so the lock is broken.
> static inline void mmap_write_lock(struct mm_struct *mm)
> {
> __mmap_lock_trace_start_locking(mm, true);
> down_write(&mm->mmap_lock);
> + inc_mm_lock_seq(mm);
> __mmap_lock_trace_acquire_returned(mm, true, true);
> }
Similarly, inc_mm_lock_seq(), which does a store-release, can only
provide "release lock" semantics, not "take lock" semantics, because
the CPU can reorder it with later stores; for example, this code:
inc_mm_lock_seq()
[locked stuff goes here]
inc_mm_lock_seq()
can be reordered into this:
[locked stuff goes here]
inc_mm_lock_seq()
inc_mm_lock_seq()
so the lock is broken.
For "taking a lock" with a memory store, or "dropping a lock" with a
memory load, you need heavier memory barriers, see
Documentation/memory-barriers.txt.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
2024-09-06 5:12 ` [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
2024-09-08 1:22 ` Liam R. Howlett
@ 2024-09-09 13:12 ` Jann Horn
2024-09-09 21:29 ` Andrii Nakryiko
2024-09-15 15:04 ` Oleg Nesterov
2 siblings, 1 reply; 30+ messages in thread
From: Jann Horn @ 2024-09-09 13:12 UTC (permalink / raw)
To: Andrii Nakryiko, brauner
Cc: linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
linux-kernel, jolsa, paulmck, willy, surenb, akpm, linux-mm,
mjguzik, Miklos Szeredi, Amir Goldstein, linux-fsdevel
On Fri, Sep 6, 2024 at 7:12 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> Given filp_cachep is already marked SLAB_TYPESAFE_BY_RCU, we can safely
> access vma->vm_file->f_inode field locklessly under just rcu_read_lock()
No, not every file is SLAB_TYPESAFE_BY_RCU - see for example
ovl_mmap(), which uses backing_file_mmap(), which does
vma_set_file(vma, file) where "file" comes from ovl_mmap()'s
"realfile", which comes from file->private_data, which is set in
ovl_open() to the return value of ovl_open_realfile(), which comes
from backing_file_open(), which allocates a file with
alloc_empty_backing_file(), which uses a normal kzalloc() without any
RCU stuff, with this comment:
* This is only for kernel internal use, and the allocate file must not be
* installed into file tables or such.
And when a backing_file is freed, you can see on the path
__fput() -> file_free()
that files with FMODE_BACKING are directly freed with kfree(), no RCU delay.
So the RCU-ness of "struct file" is an implementation detail of the
VFS, and you can't rely on it for ->vm_file unless you get the VFS to
change how backing file lifetimes work, which might slow down some
other workload, or you find a way to figure out whether you're dealing
with a backing file without actually accessing the file.
> +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> +{
> + const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE;
> + struct mm_struct *mm = current->mm;
> + struct uprobe *uprobe;
> + struct vm_area_struct *vma;
> + struct file *vm_file;
> + struct inode *vm_inode;
> + unsigned long vm_pgoff, vm_start;
> + int seq;
> + loff_t offset;
> +
> + if (!mmap_lock_speculation_start(mm, &seq))
> + return NULL;
> +
> + rcu_read_lock();
> +
> + vma = vma_lookup(mm, bp_vaddr);
> + if (!vma)
> + goto bail;
> +
> + vm_file = data_race(vma->vm_file);
A plain "data_race()" says "I'm fine with this load tearing", but
you're relying on this load not tearing (since you access the vm_file
pointer below).
You're also relying on the "struct file" that vma->vm_file points to
being populated at this point, which means you need CONSUME semantics
here, which READ_ONCE() will give you, and something like RELEASE
semantics on any pairing store that populates vma->vm_file, which
means they'd all have to become something like smp_store_release()).
You might want to instead add another recheck of the sequence count
(which would involve at least a read memory barrier after the
preceding patch is fixed) after loading the ->vm_file pointer to
ensure that no one was concurrently changing the ->vm_file pointer
before you do memory accesses through it.
> + if (!vm_file || (vma->vm_flags & flags) != VM_MAYEXEC)
> + goto bail;
missing data_race() annotation on the vma->vm_flags access
> + vm_inode = data_race(vm_file->f_inode);
As noted above, this doesn't work because you can't rely on having RCU
lifetime for the file. One *very* ugly hack you could do, if you think
this code is so performance-sensitive that you're willing to do fairly
atrocious things here, would be to do a "yes I am intentionally doing
a UAF read and I know the address might not even be mapped at this
point, it's fine, trust me" pattern, where you use
copy_from_kernel_nofault(), kind of like in prepend_copy() in
fs/d_path.c, and then immediately recheck the sequence count before
doing *anything* with this vm_inode pointer you just loaded.
> + vm_pgoff = data_race(vma->vm_pgoff);
> + vm_start = data_race(vma->vm_start);
> +
> + offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
> + uprobe = find_uprobe_rcu(vm_inode, offset);
> + if (!uprobe)
> + goto bail;
> +
> + /* now double check that nothing about MM changed */
> + if (!mmap_lock_speculation_end(mm, seq))
> + goto bail;
> +
> + rcu_read_unlock();
> +
> + /* happy case, we speculated successfully */
> + return uprobe;
> +bail:
> + rcu_read_unlock();
> + return NULL;
> +}
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
2024-09-09 13:12 ` Jann Horn
@ 2024-09-09 21:29 ` Andrii Nakryiko
2024-09-10 15:39 ` Jann Horn
2024-09-10 16:32 ` Suren Baghdasaryan
0 siblings, 2 replies; 30+ messages in thread
From: Andrii Nakryiko @ 2024-09-09 21:29 UTC (permalink / raw)
To: Jann Horn, surenb, Liam Howlett
Cc: Andrii Nakryiko, brauner, linux-trace-kernel, peterz, oleg,
rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
akpm, linux-mm, mjguzik, Miklos Szeredi, Amir Goldstein,
linux-fsdevel
On Mon, Sep 9, 2024 at 6:13 AM Jann Horn <jannh@google.com> wrote:
>
> On Fri, Sep 6, 2024 at 7:12 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > Given filp_cachep is already marked SLAB_TYPESAFE_BY_RCU, we can safely
> > access vma->vm_file->f_inode field locklessly under just rcu_read_lock()
>
> No, not every file is SLAB_TYPESAFE_BY_RCU - see for example
> ovl_mmap(), which uses backing_file_mmap(), which does
> vma_set_file(vma, file) where "file" comes from ovl_mmap()'s
> "realfile", which comes from file->private_data, which is set in
> ovl_open() to the return value of ovl_open_realfile(), which comes
> from backing_file_open(), which allocates a file with
> alloc_empty_backing_file(), which uses a normal kzalloc() without any
> RCU stuff, with this comment:
>
> * This is only for kernel internal use, and the allocate file must not be
> * installed into file tables or such.
>
> And when a backing_file is freed, you can see on the path
> __fput() -> file_free()
> that files with FMODE_BACKING are directly freed with kfree(), no RCU delay.
Good catch on FMODE_BACKING, I didn't realize there is this exception, thanks!
I think the way forward would be to detect that the backing file is in
FMODE_BACKING and fall back to mmap_lock-protected code path.
I guess I have the question to Liam and Suren, do you think it would
be ok to add another bool after `bool detached` in struct
vm_area_struct (guarded by CONFIG_PER_VMA_LOCK), or should we try to
add an extra bit into vm_flags_t? The latter would work without
CONFIG_PER_VMA_LOCK, but I don't know what's acceptable with mm folks.
This flag can be set in vma_set_file() when swapping backing file and
wherever else vma->vm_file might be set/updated (I need to audit the
code).
>
> So the RCU-ness of "struct file" is an implementation detail of the
> VFS, and you can't rely on it for ->vm_file unless you get the VFS to
> change how backing file lifetimes work, which might slow down some
> other workload, or you find a way to figure out whether you're dealing
> with a backing file without actually accessing the file.
>
> > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> > +{
> > + const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE;
> > + struct mm_struct *mm = current->mm;
> > + struct uprobe *uprobe;
> > + struct vm_area_struct *vma;
> > + struct file *vm_file;
> > + struct inode *vm_inode;
> > + unsigned long vm_pgoff, vm_start;
> > + int seq;
> > + loff_t offset;
> > +
> > + if (!mmap_lock_speculation_start(mm, &seq))
> > + return NULL;
> > +
> > + rcu_read_lock();
> > +
> > + vma = vma_lookup(mm, bp_vaddr);
> > + if (!vma)
> > + goto bail;
> > +
> > + vm_file = data_race(vma->vm_file);
>
> A plain "data_race()" says "I'm fine with this load tearing", but
> you're relying on this load not tearing (since you access the vm_file
> pointer below).
> You're also relying on the "struct file" that vma->vm_file points to
> being populated at this point, which means you need CONSUME semantics
> here, which READ_ONCE() will give you, and something like RELEASE
> semantics on any pairing store that populates vma->vm_file, which
> means they'd all have to become something like smp_store_release()).
vma->vm_file should be set in VMA before it is installed and is never
modified afterwards, isn't that the case? So maybe no extra barrier
are needed and READ_ONCE() would be enough.
>
> You might want to instead add another recheck of the sequence count
> (which would involve at least a read memory barrier after the
> preceding patch is fixed) after loading the ->vm_file pointer to
> ensure that no one was concurrently changing the ->vm_file pointer
> before you do memory accesses through it.
>
> > + if (!vm_file || (vma->vm_flags & flags) != VM_MAYEXEC)
> > + goto bail;
>
> missing data_race() annotation on the vma->vm_flags access
ack
>
> > + vm_inode = data_race(vm_file->f_inode);
>
> As noted above, this doesn't work because you can't rely on having RCU
> lifetime for the file. One *very* ugly hack you could do, if you think
> this code is so performance-sensitive that you're willing to do fairly
> atrocious things here, would be to do a "yes I am intentionally doing
> a UAF read and I know the address might not even be mapped at this
> point, it's fine, trust me" pattern, where you use
> copy_from_kernel_nofault(), kind of like in prepend_copy() in
> fs/d_path.c, and then immediately recheck the sequence count before
> doing *anything* with this vm_inode pointer you just loaded.
>
>
yeah, let's leave it as a very unfortunate plan B and try to solve it
a bit cleaner.
>
> > + vm_pgoff = data_race(vma->vm_pgoff);
> > + vm_start = data_race(vma->vm_start);
> > +
> > + offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
> > + uprobe = find_uprobe_rcu(vm_inode, offset);
> > + if (!uprobe)
> > + goto bail;
> > +
> > + /* now double check that nothing about MM changed */
> > + if (!mmap_lock_speculation_end(mm, seq))
> > + goto bail;
> > +
> > + rcu_read_unlock();
> > +
> > + /* happy case, we speculated successfully */
> > + return uprobe;
> > +bail:
> > + rcu_read_unlock();
> > + return NULL;
> > +}
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 1/2] mm: introduce mmap_lock_speculation_{start|end}
2024-09-09 12:35 ` Jann Horn
@ 2024-09-10 2:09 ` Suren Baghdasaryan
2024-09-10 15:31 ` Jann Horn
2024-09-11 21:34 ` Andrii Nakryiko
0 siblings, 2 replies; 30+ messages in thread
From: Suren Baghdasaryan @ 2024-09-10 2:09 UTC (permalink / raw)
To: Jann Horn
Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, akpm,
linux-mm, mjguzik, brauner
On Mon, Sep 9, 2024 at 5:35 AM Jann Horn <jannh@google.com> wrote:
>
> On Fri, Sep 6, 2024 at 7:12 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
> > +{
> > + /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> > + return seq == smp_load_acquire(&mm->mm_lock_seq);
> > +}
>
> A load-acquire can't provide "end of locked section" semantics - a
> load-acquire is a one-way barrier, you can basically use it for
> "acquire lock" semantics but not for "release lock" semantics, because
> the CPU will prevent reordering the load with *later* loads but not
> with *earlier* loads. So if you do:
>
> mmap_lock_speculation_start()
> [locked reads go here]
> mmap_lock_speculation_end()
>
> then the CPU is allowed to reorder your instructions like this:
>
> mmap_lock_speculation_start()
> mmap_lock_speculation_end()
> [locked reads go here]
>
> so the lock is broken.
Hi Jann,
Thanks for the review!
Yeah, you are right, we do need an smp_rmb() before we compare
mm->mm_lock_seq with the stored seq.
Otherwise reads might get reordered this way:
CPU1 CPU2
mmap_lock_speculation_start() // seq = mm->mm_lock_seq
reloaded_seq = mm->mm_lock_seq; // reordered read
mmap_write_lock() // inc_mm_lock_seq(mm)
vma->vm_file = ...;
mmap_write_unlock() // inc_mm_lock_seq(mm)
<speculate>
mmap_lock_speculation_end() // return (reloaded_seq == seq)
>
> > static inline void mmap_write_lock(struct mm_struct *mm)
> > {
> > __mmap_lock_trace_start_locking(mm, true);
> > down_write(&mm->mmap_lock);
> > + inc_mm_lock_seq(mm);
> > __mmap_lock_trace_acquire_returned(mm, true, true);
> > }
>
> Similarly, inc_mm_lock_seq(), which does a store-release, can only
> provide "release lock" semantics, not "take lock" semantics, because
> the CPU can reorder it with later stores; for example, this code:
>
> inc_mm_lock_seq()
> [locked stuff goes here]
> inc_mm_lock_seq()
>
> can be reordered into this:
>
> [locked stuff goes here]
> inc_mm_lock_seq()
> inc_mm_lock_seq()
>
> so the lock is broken.
Ugh, yes. We do need smp_wmb() AFTER the inc_mm_lock_seq(). Whenever
we use inc_mm_lock_seq() for "take lock" semantics, it's preceded by a
down_write(&mm->mmap_lock) with implied ACQUIRE ordering. So I thought
we can use it but I realize now that this reordering is still
possible:
CPU1 CPU2
mmap_write_lock()
down_write(&mm->mmap_lock);
vma->vm_file = ...;
mmap_lock_speculation_start() // seq = mm->mm_lock_seq
<speculate>
mmap_lock_speculation_end() // return (mm->mm_lock_seq == seq)
inc_mm_lock_seq(mm);
mmap_write_unlock() // inc_mm_lock_seq(mm)
Is that what you were describing?
Thanks,
Suren.
>
> For "taking a lock" with a memory store, or "dropping a lock" with a
> memory load, you need heavier memory barriers, see
> Documentation/memory-barriers.txt.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 1/2] mm: introduce mmap_lock_speculation_{start|end}
2024-09-10 2:09 ` Suren Baghdasaryan
@ 2024-09-10 15:31 ` Jann Horn
2024-09-11 21:34 ` Andrii Nakryiko
1 sibling, 0 replies; 30+ messages in thread
From: Jann Horn @ 2024-09-10 15:31 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, akpm,
linux-mm, mjguzik, brauner
On Tue, Sep 10, 2024 at 4:09 AM Suren Baghdasaryan <surenb@google.com> wrote:
> On Mon, Sep 9, 2024 at 5:35 AM Jann Horn <jannh@google.com> wrote:
> > On Fri, Sep 6, 2024 at 7:12 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > static inline void mmap_write_lock(struct mm_struct *mm)
> > > {
> > > __mmap_lock_trace_start_locking(mm, true);
> > > down_write(&mm->mmap_lock);
> > > + inc_mm_lock_seq(mm);
> > > __mmap_lock_trace_acquire_returned(mm, true, true);
> > > }
> >
> > Similarly, inc_mm_lock_seq(), which does a store-release, can only
> > provide "release lock" semantics, not "take lock" semantics, because
> > the CPU can reorder it with later stores; for example, this code:
> >
> > inc_mm_lock_seq()
> > [locked stuff goes here]
> > inc_mm_lock_seq()
> >
> > can be reordered into this:
> >
> > [locked stuff goes here]
> > inc_mm_lock_seq()
> > inc_mm_lock_seq()
> >
> > so the lock is broken.
>
> Ugh, yes. We do need smp_wmb() AFTER the inc_mm_lock_seq(). Whenever
> we use inc_mm_lock_seq() for "take lock" semantics, it's preceded by a
> down_write(&mm->mmap_lock) with implied ACQUIRE ordering. So I thought
> we can use it but I realize now that this reordering is still
> possible:
> CPU1 CPU2
> mmap_write_lock()
> down_write(&mm->mmap_lock);
> vma->vm_file = ...;
>
> mmap_lock_speculation_start() // seq = mm->mm_lock_seq
> <speculate>
> mmap_lock_speculation_end() // return (mm->mm_lock_seq == seq)
>
> inc_mm_lock_seq(mm);
> mmap_write_unlock() // inc_mm_lock_seq(mm)
>
> Is that what you were describing?
Yeah, that's the scenario I was thinking of (though I did not spend
the time to look at the surroundings to see if there are other implied
barriers that happen to stop this).
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
2024-09-09 21:29 ` Andrii Nakryiko
@ 2024-09-10 15:39 ` Jann Horn
2024-09-10 20:56 ` Andrii Nakryiko
2024-09-10 16:32 ` Suren Baghdasaryan
1 sibling, 1 reply; 30+ messages in thread
From: Jann Horn @ 2024-09-10 15:39 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: surenb, Liam Howlett, Andrii Nakryiko, brauner,
linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
linux-kernel, jolsa, paulmck, willy, akpm, linux-mm, mjguzik,
Miklos Szeredi, Amir Goldstein, linux-fsdevel
On Mon, Sep 9, 2024 at 11:29 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Mon, Sep 9, 2024 at 6:13 AM Jann Horn <jannh@google.com> wrote:
> > On Fri, Sep 6, 2024 at 7:12 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> > > +{
> > > + const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE;
> > > + struct mm_struct *mm = current->mm;
> > > + struct uprobe *uprobe;
> > > + struct vm_area_struct *vma;
> > > + struct file *vm_file;
> > > + struct inode *vm_inode;
> > > + unsigned long vm_pgoff, vm_start;
> > > + int seq;
> > > + loff_t offset;
> > > +
> > > + if (!mmap_lock_speculation_start(mm, &seq))
> > > + return NULL;
> > > +
> > > + rcu_read_lock();
> > > +
> > > + vma = vma_lookup(mm, bp_vaddr);
> > > + if (!vma)
> > > + goto bail;
> > > +
> > > + vm_file = data_race(vma->vm_file);
> >
> > A plain "data_race()" says "I'm fine with this load tearing", but
> > you're relying on this load not tearing (since you access the vm_file
> > pointer below).
> > You're also relying on the "struct file" that vma->vm_file points to
> > being populated at this point, which means you need CONSUME semantics
> > here, which READ_ONCE() will give you, and something like RELEASE
> > semantics on any pairing store that populates vma->vm_file, which
> > means they'd all have to become something like smp_store_release()).
>
> vma->vm_file should be set in VMA before it is installed and is never
> modified afterwards, isn't that the case? So maybe no extra barrier
> are needed and READ_ONCE() would be enough.
Ah, right, I'm not sure what I was thinking there.
I... guess you only _really_ need the READ_ONCE() if something can
actually ever change the ->vm_file pointer, otherwise just a plain
load with no annotation whatsoever would be good enough? I'm fairly
sure nothing can ever change the ->vm_file pointer of a live VMA, and
I think _currently_ it looks like nothing will NULL out the ->vm_file
pointer on free either... though that last part is probably not
something you should rely on...
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/2] uprobes,mm: speculative lockless VMA-to-uprobe lookup
2024-09-06 5:12 [PATCH 0/2] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
2024-09-06 5:12 ` [PATCH 1/2] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
2024-09-06 5:12 ` [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
@ 2024-09-10 16:06 ` Jann Horn
2024-09-10 17:58 ` Andrii Nakryiko
2 siblings, 1 reply; 30+ messages in thread
From: Jann Horn @ 2024-09-10 16:06 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
linux-kernel, jolsa, paulmck, willy, surenb, akpm, linux-mm,
mjguzik, brauner
On Fri, Sep 6, 2024 at 7:12 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> Implement speculative (lockless) resolution of VMA to inode to uprobe,
> bypassing the need to take mmap_lock for reads, if possible. Patch #1 by Suren
> adds mm_struct helpers that help detect whether mm_struct were changed, which
> is used by uprobe logic to validate that speculative results can be trusted
> after all the lookup logic results in a valid uprobe instance.
Random thought: It would be nice if you could skip the MM stuff
entirely and instead go through the GUP-fast path, but I guess going
from a uprobe-created anon page to the corresponding uprobe is hard...
but maybe if you used the anon_vma pointer as a lookup key to find the
uprobe, it could work? Though then you'd need hooks in the anon_vma
code... maybe not such a great idea.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
2024-09-09 21:29 ` Andrii Nakryiko
2024-09-10 15:39 ` Jann Horn
@ 2024-09-10 16:32 ` Suren Baghdasaryan
2024-09-10 20:58 ` Andrii Nakryiko
1 sibling, 1 reply; 30+ messages in thread
From: Suren Baghdasaryan @ 2024-09-10 16:32 UTC (permalink / raw)
To: Andrii Nakryiko, Christian Brauner
Cc: Jann Horn, Liam Howlett, Andrii Nakryiko, linux-trace-kernel,
peterz, oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa,
paulmck, willy, akpm, linux-mm, mjguzik, Miklos Szeredi,
Amir Goldstein, linux-fsdevel
On Mon, Sep 9, 2024 at 2:29 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Sep 9, 2024 at 6:13 AM Jann Horn <jannh@google.com> wrote:
> >
> > On Fri, Sep 6, 2024 at 7:12 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > Given filp_cachep is already marked SLAB_TYPESAFE_BY_RCU, we can safely
> > > access vma->vm_file->f_inode field locklessly under just rcu_read_lock()
> >
> > No, not every file is SLAB_TYPESAFE_BY_RCU - see for example
> > ovl_mmap(), which uses backing_file_mmap(), which does
> > vma_set_file(vma, file) where "file" comes from ovl_mmap()'s
> > "realfile", which comes from file->private_data, which is set in
> > ovl_open() to the return value of ovl_open_realfile(), which comes
> > from backing_file_open(), which allocates a file with
> > alloc_empty_backing_file(), which uses a normal kzalloc() without any
> > RCU stuff, with this comment:
> >
> > * This is only for kernel internal use, and the allocate file must not be
> > * installed into file tables or such.
> >
> > And when a backing_file is freed, you can see on the path
> > __fput() -> file_free()
> > that files with FMODE_BACKING are directly freed with kfree(), no RCU delay.
>
> Good catch on FMODE_BACKING, I didn't realize there is this exception, thanks!
>
> I think the way forward would be to detect that the backing file is in
> FMODE_BACKING and fall back to mmap_lock-protected code path.
>
> I guess I have the question to Liam and Suren, do you think it would
> be ok to add another bool after `bool detached` in struct
> vm_area_struct (guarded by CONFIG_PER_VMA_LOCK), or should we try to
> add an extra bit into vm_flags_t? The latter would work without
> CONFIG_PER_VMA_LOCK, but I don't know what's acceptable with mm folks.
>
> This flag can be set in vma_set_file() when swapping backing file and
> wherever else vma->vm_file might be set/updated (I need to audit the
> code).
I understand that this would work but I'm not very eager to leak
vm_file attributes like FMODE_BACKING into vm_area_struct.
Instead maybe that exception can be avoided? Treating all vm_files
equally as RCU-safe would be a much simpler solution. I see that this
exception was introduced in [1] and I don't know if this was done for
performance reasons or something else. Christian, CCing you here to
please clarify.
[1] https://lore.kernel.org/all/20231005-sakralbau-wappnen-f5c31755ed70@brauner/
>
> >
> > So the RCU-ness of "struct file" is an implementation detail of the
> > VFS, and you can't rely on it for ->vm_file unless you get the VFS to
> > change how backing file lifetimes work, which might slow down some
> > other workload, or you find a way to figure out whether you're dealing
> > with a backing file without actually accessing the file.
> >
> > > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> > > +{
> > > + const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE;
> > > + struct mm_struct *mm = current->mm;
> > > + struct uprobe *uprobe;
> > > + struct vm_area_struct *vma;
> > > + struct file *vm_file;
> > > + struct inode *vm_inode;
> > > + unsigned long vm_pgoff, vm_start;
> > > + int seq;
> > > + loff_t offset;
> > > +
> > > + if (!mmap_lock_speculation_start(mm, &seq))
> > > + return NULL;
> > > +
> > > + rcu_read_lock();
> > > +
> > > + vma = vma_lookup(mm, bp_vaddr);
> > > + if (!vma)
> > > + goto bail;
> > > +
> > > + vm_file = data_race(vma->vm_file);
> >
> > A plain "data_race()" says "I'm fine with this load tearing", but
> > you're relying on this load not tearing (since you access the vm_file
> > pointer below).
> > You're also relying on the "struct file" that vma->vm_file points to
> > being populated at this point, which means you need CONSUME semantics
> > here, which READ_ONCE() will give you, and something like RELEASE
> > semantics on any pairing store that populates vma->vm_file, which
> > means they'd all have to become something like smp_store_release()).
>
> vma->vm_file should be set in VMA before it is installed and is never
> modified afterwards, isn't that the case? So maybe no extra barrier
> are needed and READ_ONCE() would be enough.
>
> >
> > You might want to instead add another recheck of the sequence count
> > (which would involve at least a read memory barrier after the
> > preceding patch is fixed) after loading the ->vm_file pointer to
> > ensure that no one was concurrently changing the ->vm_file pointer
> > before you do memory accesses through it.
> >
> > > + if (!vm_file || (vma->vm_flags & flags) != VM_MAYEXEC)
> > > + goto bail;
> >
> > missing data_race() annotation on the vma->vm_flags access
>
> ack
>
> >
> > > + vm_inode = data_race(vm_file->f_inode);
> >
> > As noted above, this doesn't work because you can't rely on having RCU
> > lifetime for the file. One *very* ugly hack you could do, if you think
> > this code is so performance-sensitive that you're willing to do fairly
> > atrocious things here, would be to do a "yes I am intentionally doing
> > a UAF read and I know the address might not even be mapped at this
> > point, it's fine, trust me" pattern, where you use
> > copy_from_kernel_nofault(), kind of like in prepend_copy() in
> > fs/d_path.c, and then immediately recheck the sequence count before
> > doing *anything* with this vm_inode pointer you just loaded.
> >
> >
>
> yeah, let's leave it as a very unfortunate plan B and try to solve it
> a bit cleaner.
>
>
> >
> > > + vm_pgoff = data_race(vma->vm_pgoff);
> > > + vm_start = data_race(vma->vm_start);
> > > +
> > > + offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
> > > + uprobe = find_uprobe_rcu(vm_inode, offset);
> > > + if (!uprobe)
> > > + goto bail;
> > > +
> > > + /* now double check that nothing about MM changed */
> > > + if (!mmap_lock_speculation_end(mm, seq))
> > > + goto bail;
> > > +
> > > + rcu_read_unlock();
> > > +
> > > + /* happy case, we speculated successfully */
> > > + return uprobe;
> > > +bail:
> > > + rcu_read_unlock();
> > > + return NULL;
> > > +}
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/2] uprobes,mm: speculative lockless VMA-to-uprobe lookup
2024-09-10 16:06 ` [PATCH 0/2] uprobes,mm: speculative lockless VMA-to-uprobe lookup Jann Horn
@ 2024-09-10 17:58 ` Andrii Nakryiko
2024-09-10 18:13 ` Jann Horn
0 siblings, 1 reply; 30+ messages in thread
From: Andrii Nakryiko @ 2024-09-10 17:58 UTC (permalink / raw)
To: Jann Horn
Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, surenb, akpm,
linux-mm, mjguzik, brauner
On Tue, Sep 10, 2024 at 9:06 AM Jann Horn <jannh@google.com> wrote:
>
> On Fri, Sep 6, 2024 at 7:12 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > Implement speculative (lockless) resolution of VMA to inode to uprobe,
> > bypassing the need to take mmap_lock for reads, if possible. Patch #1 by Suren
> > adds mm_struct helpers that help detect whether mm_struct were changed, which
> > is used by uprobe logic to validate that speculative results can be trusted
> > after all the lookup logic results in a valid uprobe instance.
>
> Random thought: It would be nice if you could skip the MM stuff
> entirely and instead go through the GUP-fast path, but I guess going
> from a uprobe-created anon page to the corresponding uprobe is hard...
> but maybe if you used the anon_vma pointer as a lookup key to find the
> uprobe, it could work? Though then you'd need hooks in the anon_vma
> code... maybe not such a great idea.
So I'm not crystal clear on all the details here, so maybe you can
elaborate a bit. But keep in mind that a) there could be multiple
uprobes within a single user page, so lookup has to take at least
offset within the page into account somehow. But also b) single uprobe
can be installed in many independent anon VMAs across many processes.
So anon vma itself can't be part of the key.
Though maybe we could have left some sort of "cookie" stashed
somewhere to help with lookup. But then again, multiple uprobes per
page.
It does feel like lockless VMA to inode resolution would be a cleaner
solution, let's see if we can get there somehow.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/2] uprobes,mm: speculative lockless VMA-to-uprobe lookup
2024-09-10 17:58 ` Andrii Nakryiko
@ 2024-09-10 18:13 ` Jann Horn
0 siblings, 0 replies; 30+ messages in thread
From: Jann Horn @ 2024-09-10 18:13 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Andrii Nakryiko, linux-trace-kernel, peterz, oleg, rostedt,
mhiramat, bpf, linux-kernel, jolsa, paulmck, willy, surenb, akpm,
linux-mm, mjguzik, brauner
On Tue, Sep 10, 2024 at 7:58 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Tue, Sep 10, 2024 at 9:06 AM Jann Horn <jannh@google.com> wrote:
> > On Fri, Sep 6, 2024 at 7:12 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > Implement speculative (lockless) resolution of VMA to inode to uprobe,
> > > bypassing the need to take mmap_lock for reads, if possible. Patch #1 by Suren
> > > adds mm_struct helpers that help detect whether mm_struct were changed, which
> > > is used by uprobe logic to validate that speculative results can be trusted
> > > after all the lookup logic results in a valid uprobe instance.
> >
> > Random thought: It would be nice if you could skip the MM stuff
> > entirely and instead go through the GUP-fast path, but I guess going
> > from a uprobe-created anon page to the corresponding uprobe is hard...
> > but maybe if you used the anon_vma pointer as a lookup key to find the
> > uprobe, it could work? Though then you'd need hooks in the anon_vma
> > code... maybe not such a great idea.
>
> So I'm not crystal clear on all the details here, so maybe you can
> elaborate a bit. But keep in mind that a) there could be multiple
> uprobes within a single user page, so lookup has to take at least
> offset within the page into account somehow. But also b) single uprobe
I think anonymous pages have the same pgoff numbering as file pages;
so the page's mapping and pgoff pointers together should almost give
you the same amount of information as what you are currently looking
for (the file and the offset inside it), except that you'd get an
anon_vma pointer corresponding to the file instead of directly getting
the file.
> can be installed in many independent anon VMAs across many processes.
> So anon vma itself can't be part of the key.
Yeah, I guess to make that work you'd have to somehow track which
anon_vmas exist for which mappings.
(An anon_vma is tied to one specific file, see anon_vma_compatible().)
> Though maybe we could have left some sort of "cookie" stashed
> somewhere to help with lookup. But then again, multiple uprobes per
> page.
>
> It does feel like lockless VMA to inode resolution would be a cleaner
> solution, let's see if we can get there somehow.
Mh, yes, I was just thinking it would be nice if we could keep this
lockless complexity out of the mmap locking code... but I guess it's
not much more straightforward than what you're doing.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
2024-09-10 15:39 ` Jann Horn
@ 2024-09-10 20:56 ` Andrii Nakryiko
0 siblings, 0 replies; 30+ messages in thread
From: Andrii Nakryiko @ 2024-09-10 20:56 UTC (permalink / raw)
To: Jann Horn
Cc: surenb, Liam Howlett, Andrii Nakryiko, brauner,
linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
linux-kernel, jolsa, paulmck, willy, akpm, linux-mm, mjguzik,
Miklos Szeredi, Amir Goldstein, linux-fsdevel
On Tue, Sep 10, 2024 at 8:39 AM Jann Horn <jannh@google.com> wrote:
>
> On Mon, Sep 9, 2024 at 11:29 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> > On Mon, Sep 9, 2024 at 6:13 AM Jann Horn <jannh@google.com> wrote:
> > > On Fri, Sep 6, 2024 at 7:12 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> > > > +{
> > > > + const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE;
> > > > + struct mm_struct *mm = current->mm;
> > > > + struct uprobe *uprobe;
> > > > + struct vm_area_struct *vma;
> > > > + struct file *vm_file;
> > > > + struct inode *vm_inode;
> > > > + unsigned long vm_pgoff, vm_start;
> > > > + int seq;
> > > > + loff_t offset;
> > > > +
> > > > + if (!mmap_lock_speculation_start(mm, &seq))
> > > > + return NULL;
> > > > +
> > > > + rcu_read_lock();
> > > > +
> > > > + vma = vma_lookup(mm, bp_vaddr);
> > > > + if (!vma)
> > > > + goto bail;
> > > > +
> > > > + vm_file = data_race(vma->vm_file);
> > >
> > > A plain "data_race()" says "I'm fine with this load tearing", but
> > > you're relying on this load not tearing (since you access the vm_file
> > > pointer below).
> > > You're also relying on the "struct file" that vma->vm_file points to
> > > being populated at this point, which means you need CONSUME semantics
> > > here, which READ_ONCE() will give you, and something like RELEASE
> > > semantics on any pairing store that populates vma->vm_file, which
> > > means they'd all have to become something like smp_store_release()).
> >
> > vma->vm_file should be set in VMA before it is installed and is never
> > modified afterwards, isn't that the case? So maybe no extra barrier
> > are needed and READ_ONCE() would be enough.
>
> Ah, right, I'm not sure what I was thinking there.
>
> I... guess you only _really_ need the READ_ONCE() if something can
> actually ever change the ->vm_file pointer, otherwise just a plain
> load with no annotation whatsoever would be good enough? I'm fairly
yep, probably, I was just trying to be cautious :)
> sure nothing can ever change the ->vm_file pointer of a live VMA, and
> I think _currently_ it looks like nothing will NULL out the ->vm_file
> pointer on free either... though that last part is probably not
> something you should rely on...
This seems to be rather important, but similarly to how vm_file can't
be modified, it seems reasonable to assume that it won't be set to
NULL (it's a modification to set it to a new NULL value, isn't it?). I
mean, we can probably just add a NULL check and rely on the atomicity
of setting a pointer, so not a big deal, but seems like a pretty
reasonable assumption to make.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
2024-09-10 16:32 ` Suren Baghdasaryan
@ 2024-09-10 20:58 ` Andrii Nakryiko
2024-09-12 11:17 ` Christian Brauner
0 siblings, 1 reply; 30+ messages in thread
From: Andrii Nakryiko @ 2024-09-10 20:58 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Christian Brauner, Jann Horn, Liam Howlett, Andrii Nakryiko,
linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
linux-kernel, jolsa, paulmck, willy, akpm, linux-mm, mjguzik,
Miklos Szeredi, Amir Goldstein, linux-fsdevel
On Tue, Sep 10, 2024 at 9:32 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Sep 9, 2024 at 2:29 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Sep 9, 2024 at 6:13 AM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Fri, Sep 6, 2024 at 7:12 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > Given filp_cachep is already marked SLAB_TYPESAFE_BY_RCU, we can safely
> > > > access vma->vm_file->f_inode field locklessly under just rcu_read_lock()
> > >
> > > No, not every file is SLAB_TYPESAFE_BY_RCU - see for example
> > > ovl_mmap(), which uses backing_file_mmap(), which does
> > > vma_set_file(vma, file) where "file" comes from ovl_mmap()'s
> > > "realfile", which comes from file->private_data, which is set in
> > > ovl_open() to the return value of ovl_open_realfile(), which comes
> > > from backing_file_open(), which allocates a file with
> > > alloc_empty_backing_file(), which uses a normal kzalloc() without any
> > > RCU stuff, with this comment:
> > >
> > > * This is only for kernel internal use, and the allocate file must not be
> > > * installed into file tables or such.
> > >
> > > And when a backing_file is freed, you can see on the path
> > > __fput() -> file_free()
> > > that files with FMODE_BACKING are directly freed with kfree(), no RCU delay.
> >
> > Good catch on FMODE_BACKING, I didn't realize there is this exception, thanks!
> >
> > I think the way forward would be to detect that the backing file is in
> > FMODE_BACKING and fall back to mmap_lock-protected code path.
> >
> > I guess I have the question to Liam and Suren, do you think it would
> > be ok to add another bool after `bool detached` in struct
> > vm_area_struct (guarded by CONFIG_PER_VMA_LOCK), or should we try to
> > add an extra bit into vm_flags_t? The latter would work without
> > CONFIG_PER_VMA_LOCK, but I don't know what's acceptable with mm folks.
> >
> > This flag can be set in vma_set_file() when swapping backing file and
> > wherever else vma->vm_file might be set/updated (I need to audit the
> > code).
>
> I understand that this would work but I'm not very eager to leak
> vm_file attributes like FMODE_BACKING into vm_area_struct.
> Instead maybe that exception can be avoided? Treating all vm_files
I agree, that would be best, of course. It seems like [1] was an
optimization to avoid kfree_rcu() calls, not sure how big of a deal it
is to undo that, given we do have a use case that calls for it now.
Let's see what Christian thinks.
> equally as RCU-safe would be a much simpler solution. I see that this
> exception was introduced in [1] and I don't know if this was done for
> performance reasons or something else. Christian, CCing you here to
> please clarify.
>
> [1] https://lore.kernel.org/all/20231005-sakralbau-wappnen-f5c31755ed70@brauner/
>
> >
> > >
> > > So the RCU-ness of "struct file" is an implementation detail of the
> > > VFS, and you can't rely on it for ->vm_file unless you get the VFS to
> > > change how backing file lifetimes work, which might slow down some
> > > other workload, or you find a way to figure out whether you're dealing
> > > with a backing file without actually accessing the file.
> > >
> > > > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> > > > +{
> > > > + const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE;
> > > > + struct mm_struct *mm = current->mm;
> > > > + struct uprobe *uprobe;
> > > > + struct vm_area_struct *vma;
> > > > + struct file *vm_file;
> > > > + struct inode *vm_inode;
> > > > + unsigned long vm_pgoff, vm_start;
> > > > + int seq;
> > > > + loff_t offset;
> > > > +
> > > > + if (!mmap_lock_speculation_start(mm, &seq))
> > > > + return NULL;
> > > > +
> > > > + rcu_read_lock();
> > > > +
> > > > + vma = vma_lookup(mm, bp_vaddr);
> > > > + if (!vma)
> > > > + goto bail;
> > > > +
> > > > + vm_file = data_race(vma->vm_file);
> > >
> > > A plain "data_race()" says "I'm fine with this load tearing", but
> > > you're relying on this load not tearing (since you access the vm_file
> > > pointer below).
> > > You're also relying on the "struct file" that vma->vm_file points to
> > > being populated at this point, which means you need CONSUME semantics
> > > here, which READ_ONCE() will give you, and something like RELEASE
> > > semantics on any pairing store that populates vma->vm_file, which
> > > means they'd all have to become something like smp_store_release()).
> >
> > vma->vm_file should be set in VMA before it is installed and is never
> > modified afterwards, isn't that the case? So maybe no extra barrier
> > are needed and READ_ONCE() would be enough.
> >
> > >
> > > You might want to instead add another recheck of the sequence count
> > > (which would involve at least a read memory barrier after the
> > > preceding patch is fixed) after loading the ->vm_file pointer to
> > > ensure that no one was concurrently changing the ->vm_file pointer
> > > before you do memory accesses through it.
> > >
> > > > + if (!vm_file || (vma->vm_flags & flags) != VM_MAYEXEC)
> > > > + goto bail;
> > >
> > > missing data_race() annotation on the vma->vm_flags access
> >
> > ack
> >
> > >
> > > > + vm_inode = data_race(vm_file->f_inode);
> > >
> > > As noted above, this doesn't work because you can't rely on having RCU
> > > lifetime for the file. One *very* ugly hack you could do, if you think
> > > this code is so performance-sensitive that you're willing to do fairly
> > > atrocious things here, would be to do a "yes I am intentionally doing
> > > a UAF read and I know the address might not even be mapped at this
> > > point, it's fine, trust me" pattern, where you use
> > > copy_from_kernel_nofault(), kind of like in prepend_copy() in
> > > fs/d_path.c, and then immediately recheck the sequence count before
> > > doing *anything* with this vm_inode pointer you just loaded.
> > >
> > >
> >
> > yeah, let's leave it as a very unfortunate plan B and try to solve it
> > a bit cleaner.
> >
> >
> > >
> > > > + vm_pgoff = data_race(vma->vm_pgoff);
> > > > + vm_start = data_race(vma->vm_start);
> > > > +
> > > > + offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
> > > > + uprobe = find_uprobe_rcu(vm_inode, offset);
> > > > + if (!uprobe)
> > > > + goto bail;
> > > > +
> > > > + /* now double check that nothing about MM changed */
> > > > + if (!mmap_lock_speculation_end(mm, seq))
> > > > + goto bail;
> > > > +
> > > > + rcu_read_unlock();
> > > > +
> > > > + /* happy case, we speculated successfully */
> > > > + return uprobe;
> > > > +bail:
> > > > + rcu_read_unlock();
> > > > + return NULL;
> > > > +}
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 1/2] mm: introduce mmap_lock_speculation_{start|end}
2024-09-10 2:09 ` Suren Baghdasaryan
2024-09-10 15:31 ` Jann Horn
@ 2024-09-11 21:34 ` Andrii Nakryiko
2024-09-11 21:48 ` Suren Baghdasaryan
1 sibling, 1 reply; 30+ messages in thread
From: Andrii Nakryiko @ 2024-09-11 21:34 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Jann Horn, Andrii Nakryiko, linux-trace-kernel, peterz, oleg,
rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
akpm, linux-mm, mjguzik, brauner
On Mon, Sep 9, 2024 at 7:09 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Sep 9, 2024 at 5:35 AM Jann Horn <jannh@google.com> wrote:
> >
> > On Fri, Sep 6, 2024 at 7:12 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
> > > +{
> > > + /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> > > + return seq == smp_load_acquire(&mm->mm_lock_seq);
> > > +}
> >
> > A load-acquire can't provide "end of locked section" semantics - a
> > load-acquire is a one-way barrier, you can basically use it for
> > "acquire lock" semantics but not for "release lock" semantics, because
> > the CPU will prevent reordering the load with *later* loads but not
> > with *earlier* loads. So if you do:
> >
> > mmap_lock_speculation_start()
> > [locked reads go here]
> > mmap_lock_speculation_end()
> >
> > then the CPU is allowed to reorder your instructions like this:
> >
> > mmap_lock_speculation_start()
> > mmap_lock_speculation_end()
> > [locked reads go here]
> >
> > so the lock is broken.
>
> Hi Jann,
> Thanks for the review!
> Yeah, you are right, we do need an smp_rmb() before we compare
> mm->mm_lock_seq with the stored seq.
>
> Otherwise reads might get reordered this way:
>
> CPU1 CPU2
> mmap_lock_speculation_start() // seq = mm->mm_lock_seq
> reloaded_seq = mm->mm_lock_seq; // reordered read
> mmap_write_lock() // inc_mm_lock_seq(mm)
> vma->vm_file = ...;
> mmap_write_unlock() // inc_mm_lock_seq(mm)
> <speculate>
> mmap_lock_speculation_end() // return (reloaded_seq == seq)
>
> >
> > > static inline void mmap_write_lock(struct mm_struct *mm)
> > > {
> > > __mmap_lock_trace_start_locking(mm, true);
> > > down_write(&mm->mmap_lock);
> > > + inc_mm_lock_seq(mm);
> > > __mmap_lock_trace_acquire_returned(mm, true, true);
> > > }
> >
> > Similarly, inc_mm_lock_seq(), which does a store-release, can only
> > provide "release lock" semantics, not "take lock" semantics, because
> > the CPU can reorder it with later stores; for example, this code:
> >
> > inc_mm_lock_seq()
> > [locked stuff goes here]
> > inc_mm_lock_seq()
> >
> > can be reordered into this:
> >
> > [locked stuff goes here]
> > inc_mm_lock_seq()
> > inc_mm_lock_seq()
> >
> > so the lock is broken.
>
> Ugh, yes. We do need smp_wmb() AFTER the inc_mm_lock_seq(). Whenever
Suren, can you share with me an updated patch for mm_lock_seq with the
right memory barriers? Do you think this might have a noticeable
impact on performance? What sort of benchmark do mm folks use to
quantify changes like that?
> we use inc_mm_lock_seq() for "take lock" semantics, it's preceded by a
> down_write(&mm->mmap_lock) with implied ACQUIRE ordering. So I thought
> we can use it but I realize now that this reordering is still
> possible:
> CPU1 CPU2
> mmap_write_lock()
> down_write(&mm->mmap_lock);
> vma->vm_file = ...;
>
> mmap_lock_speculation_start() // seq = mm->mm_lock_seq
> <speculate>
> mmap_lock_speculation_end() // return (mm->mm_lock_seq == seq)
>
> inc_mm_lock_seq(mm);
> mmap_write_unlock() // inc_mm_lock_seq(mm)
>
> Is that what you were describing?
> Thanks,
> Suren.
>
> >
> > For "taking a lock" with a memory store, or "dropping a lock" with a
> > memory load, you need heavier memory barriers, see
> > Documentation/memory-barriers.txt.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 1/2] mm: introduce mmap_lock_speculation_{start|end}
2024-09-11 21:34 ` Andrii Nakryiko
@ 2024-09-11 21:48 ` Suren Baghdasaryan
2024-09-12 21:02 ` [PATCH v2 1/1] " Suren Baghdasaryan
0 siblings, 1 reply; 30+ messages in thread
From: Suren Baghdasaryan @ 2024-09-11 21:48 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Jann Horn, Andrii Nakryiko, linux-trace-kernel, peterz, oleg,
rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
akpm, linux-mm, mjguzik, brauner
On Wed, Sep 11, 2024 at 2:35 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Sep 9, 2024 at 7:09 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, Sep 9, 2024 at 5:35 AM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Fri, Sep 6, 2024 at 7:12 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
> > > > +{
> > > > + /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> > > > + return seq == smp_load_acquire(&mm->mm_lock_seq);
> > > > +}
> > >
> > > A load-acquire can't provide "end of locked section" semantics - a
> > > load-acquire is a one-way barrier, you can basically use it for
> > > "acquire lock" semantics but not for "release lock" semantics, because
> > > the CPU will prevent reordering the load with *later* loads but not
> > > with *earlier* loads. So if you do:
> > >
> > > mmap_lock_speculation_start()
> > > [locked reads go here]
> > > mmap_lock_speculation_end()
> > >
> > > then the CPU is allowed to reorder your instructions like this:
> > >
> > > mmap_lock_speculation_start()
> > > mmap_lock_speculation_end()
> > > [locked reads go here]
> > >
> > > so the lock is broken.
> >
> > Hi Jann,
> > Thanks for the review!
> > Yeah, you are right, we do need an smp_rmb() before we compare
> > mm->mm_lock_seq with the stored seq.
> >
> > Otherwise reads might get reordered this way:
> >
> > CPU1 CPU2
> > mmap_lock_speculation_start() // seq = mm->mm_lock_seq
> > reloaded_seq = mm->mm_lock_seq; // reordered read
> > mmap_write_lock() // inc_mm_lock_seq(mm)
> > vma->vm_file = ...;
> > mmap_write_unlock() // inc_mm_lock_seq(mm)
> > <speculate>
> > mmap_lock_speculation_end() // return (reloaded_seq == seq)
> >
> > >
> > > > static inline void mmap_write_lock(struct mm_struct *mm)
> > > > {
> > > > __mmap_lock_trace_start_locking(mm, true);
> > > > down_write(&mm->mmap_lock);
> > > > + inc_mm_lock_seq(mm);
> > > > __mmap_lock_trace_acquire_returned(mm, true, true);
> > > > }
> > >
> > > Similarly, inc_mm_lock_seq(), which does a store-release, can only
> > > provide "release lock" semantics, not "take lock" semantics, because
> > > the CPU can reorder it with later stores; for example, this code:
> > >
> > > inc_mm_lock_seq()
> > > [locked stuff goes here]
> > > inc_mm_lock_seq()
> > >
> > > can be reordered into this:
> > >
> > > [locked stuff goes here]
> > > inc_mm_lock_seq()
> > > inc_mm_lock_seq()
> > >
> > > so the lock is broken.
> >
> > Ugh, yes. We do need smp_wmb() AFTER the inc_mm_lock_seq(). Whenever
>
> Suren, can you share with me an updated patch for mm_lock_seq with the
> right memory barriers? Do you think this might have a noticeable
> impact on performance? What sort of benchmark do mm folks use to
> quantify changes like that?
Yes, I think I can get it to you before leaving for LPC.
It might end up affecting paths where we take mmap_lock for write
(mmap/munmap/mprotect/etc) but these are not considered fast paths.
I'll think about possible tests we can run to evaluate it.
>
> > we use inc_mm_lock_seq() for "take lock" semantics, it's preceded by a
> > down_write(&mm->mmap_lock) with implied ACQUIRE ordering. So I thought
> > we can use it but I realize now that this reordering is still
> > possible:
> > CPU1 CPU2
> > mmap_write_lock()
> > down_write(&mm->mmap_lock);
> > vma->vm_file = ...;
> >
> > mmap_lock_speculation_start() // seq = mm->mm_lock_seq
> > <speculate>
> > mmap_lock_speculation_end() // return (mm->mm_lock_seq == seq)
> >
> > inc_mm_lock_seq(mm);
> > mmap_write_unlock() // inc_mm_lock_seq(mm)
> >
> > Is that what you were describing?
> > Thanks,
> > Suren.
> >
> > >
> > > For "taking a lock" with a memory store, or "dropping a lock" with a
> > > memory load, you need heavier memory barriers, see
> > > Documentation/memory-barriers.txt.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
2024-09-10 20:58 ` Andrii Nakryiko
@ 2024-09-12 11:17 ` Christian Brauner
2024-09-12 17:54 ` Andrii Nakryiko
0 siblings, 1 reply; 30+ messages in thread
From: Christian Brauner @ 2024-09-12 11:17 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Suren Baghdasaryan, Jann Horn, Liam Howlett, Andrii Nakryiko,
linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
linux-kernel, jolsa, paulmck, willy, akpm, linux-mm, mjguzik,
Miklos Szeredi, Amir Goldstein, linux-fsdevel
On Tue, Sep 10, 2024 at 01:58:10PM GMT, Andrii Nakryiko wrote:
> On Tue, Sep 10, 2024 at 9:32 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, Sep 9, 2024 at 2:29 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Mon, Sep 9, 2024 at 6:13 AM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > On Fri, Sep 6, 2024 at 7:12 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > > Given filp_cachep is already marked SLAB_TYPESAFE_BY_RCU, we can safely
> > > > > access vma->vm_file->f_inode field locklessly under just rcu_read_lock()
> > > >
> > > > No, not every file is SLAB_TYPESAFE_BY_RCU - see for example
> > > > ovl_mmap(), which uses backing_file_mmap(), which does
> > > > vma_set_file(vma, file) where "file" comes from ovl_mmap()'s
> > > > "realfile", which comes from file->private_data, which is set in
> > > > ovl_open() to the return value of ovl_open_realfile(), which comes
> > > > from backing_file_open(), which allocates a file with
> > > > alloc_empty_backing_file(), which uses a normal kzalloc() without any
> > > > RCU stuff, with this comment:
> > > >
> > > > * This is only for kernel internal use, and the allocate file must not be
> > > > * installed into file tables or such.
> > > >
> > > > And when a backing_file is freed, you can see on the path
> > > > __fput() -> file_free()
> > > > that files with FMODE_BACKING are directly freed with kfree(), no RCU delay.
> > >
> > > Good catch on FMODE_BACKING, I didn't realize there is this exception, thanks!
> > >
> > > I think the way forward would be to detect that the backing file is in
> > > FMODE_BACKING and fall back to mmap_lock-protected code path.
> > >
> > > I guess I have the question to Liam and Suren, do you think it would
> > > be ok to add another bool after `bool detached` in struct
> > > vm_area_struct (guarded by CONFIG_PER_VMA_LOCK), or should we try to
> > > add an extra bit into vm_flags_t? The latter would work without
> > > CONFIG_PER_VMA_LOCK, but I don't know what's acceptable with mm folks.
> > >
> > > This flag can be set in vma_set_file() when swapping backing file and
> > > wherever else vma->vm_file might be set/updated (I need to audit the
> > > code).
> >
> > I understand that this would work but I'm not very eager to leak
> > vm_file attributes like FMODE_BACKING into vm_area_struct.
> > Instead maybe that exception can be avoided? Treating all vm_files
>
> I agree, that would be best, of course. It seems like [1] was an
> optimization to avoid kfree_rcu() calls, not sure how big of a deal it
> is to undo that, given we do have a use case that calls for it now.
> Let's see what Christian thinks.
Do you just mean?
diff --git a/fs/file_table.c b/fs/file_table.c
index 7ce4d5dac080..03e58b28e539 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -68,7 +68,7 @@ static inline void file_free(struct file *f)
put_cred(f->f_cred);
if (unlikely(f->f_mode & FMODE_BACKING)) {
path_put(backing_file_user_path(f));
- kfree(backing_file(f));
+ kfree_rcu(backing_file(f));
} else {
kmem_cache_free(filp_cachep, f);
}
Then the only thing you can do with FMODE_BACKING is to skip it. I think
that should be fine since backing files right now are only used by
overlayfs and I don't think the kfree_rcu() will be a performance issue.
>
> > equally as RCU-safe would be a much simpler solution. I see that this
> > exception was introduced in [1] and I don't know if this was done for
> > performance reasons or something else. Christian, CCing you here to
> > please clarify.
> >
> > [1] https://lore.kernel.org/all/20231005-sakralbau-wappnen-f5c31755ed70@brauner/
> >
> > >
> > > >
> > > > So the RCU-ness of "struct file" is an implementation detail of the
> > > > VFS, and you can't rely on it for ->vm_file unless you get the VFS to
> > > > change how backing file lifetimes work, which might slow down some
> > > > other workload, or you find a way to figure out whether you're dealing
> > > > with a backing file without actually accessing the file.
> > > >
> > > > > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> > > > > +{
> > > > > + const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE;
> > > > > + struct mm_struct *mm = current->mm;
> > > > > + struct uprobe *uprobe;
> > > > > + struct vm_area_struct *vma;
> > > > > + struct file *vm_file;
> > > > > + struct inode *vm_inode;
> > > > > + unsigned long vm_pgoff, vm_start;
> > > > > + int seq;
> > > > > + loff_t offset;
> > > > > +
> > > > > + if (!mmap_lock_speculation_start(mm, &seq))
> > > > > + return NULL;
> > > > > +
> > > > > + rcu_read_lock();
> > > > > +
> > > > > + vma = vma_lookup(mm, bp_vaddr);
> > > > > + if (!vma)
> > > > > + goto bail;
> > > > > +
> > > > > + vm_file = data_race(vma->vm_file);
> > > >
> > > > A plain "data_race()" says "I'm fine with this load tearing", but
> > > > you're relying on this load not tearing (since you access the vm_file
> > > > pointer below).
> > > > You're also relying on the "struct file" that vma->vm_file points to
> > > > being populated at this point, which means you need CONSUME semantics
> > > > here, which READ_ONCE() will give you, and something like RELEASE
> > > > semantics on any pairing store that populates vma->vm_file, which
> > > > means they'd all have to become something like smp_store_release()).
> > >
> > > vma->vm_file should be set in VMA before it is installed and is never
> > > modified afterwards, isn't that the case? So maybe no extra barrier
> > > are needed and READ_ONCE() would be enough.
> > >
> > > >
> > > > You might want to instead add another recheck of the sequence count
> > > > (which would involve at least a read memory barrier after the
> > > > preceding patch is fixed) after loading the ->vm_file pointer to
> > > > ensure that no one was concurrently changing the ->vm_file pointer
> > > > before you do memory accesses through it.
> > > >
> > > > > + if (!vm_file || (vma->vm_flags & flags) != VM_MAYEXEC)
> > > > > + goto bail;
> > > >
> > > > missing data_race() annotation on the vma->vm_flags access
> > >
> > > ack
> > >
> > > >
> > > > > + vm_inode = data_race(vm_file->f_inode);
> > > >
> > > > As noted above, this doesn't work because you can't rely on having RCU
> > > > lifetime for the file. One *very* ugly hack you could do, if you think
> > > > this code is so performance-sensitive that you're willing to do fairly
> > > > atrocious things here, would be to do a "yes I am intentionally doing
> > > > a UAF read and I know the address might not even be mapped at this
> > > > point, it's fine, trust me" pattern, where you use
> > > > copy_from_kernel_nofault(), kind of like in prepend_copy() in
> > > > fs/d_path.c, and then immediately recheck the sequence count before
> > > > doing *anything* with this vm_inode pointer you just loaded.
> > > >
> > > >
> > >
> > > yeah, let's leave it as a very unfortunate plan B and try to solve it
> > > a bit cleaner.
> > >
> > >
> > > >
> > > > > + vm_pgoff = data_race(vma->vm_pgoff);
> > > > > + vm_start = data_race(vma->vm_start);
> > > > > +
> > > > > + offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
> > > > > + uprobe = find_uprobe_rcu(vm_inode, offset);
> > > > > + if (!uprobe)
> > > > > + goto bail;
> > > > > +
> > > > > + /* now double check that nothing about MM changed */
> > > > > + if (!mmap_lock_speculation_end(mm, seq))
> > > > > + goto bail;
> > > > > +
> > > > > + rcu_read_unlock();
> > > > > +
> > > > > + /* happy case, we speculated successfully */
> > > > > + return uprobe;
> > > > > +bail:
> > > > > + rcu_read_unlock();
> > > > > + return NULL;
> > > > > +}
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
2024-09-12 11:17 ` Christian Brauner
@ 2024-09-12 17:54 ` Andrii Nakryiko
0 siblings, 0 replies; 30+ messages in thread
From: Andrii Nakryiko @ 2024-09-12 17:54 UTC (permalink / raw)
To: Christian Brauner
Cc: Suren Baghdasaryan, Jann Horn, Liam Howlett, Andrii Nakryiko,
linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
linux-kernel, jolsa, paulmck, willy, akpm, linux-mm, mjguzik,
Miklos Szeredi, Amir Goldstein, linux-fsdevel
On Thu, Sep 12, 2024 at 4:17 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, Sep 10, 2024 at 01:58:10PM GMT, Andrii Nakryiko wrote:
> > On Tue, Sep 10, 2024 at 9:32 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Mon, Sep 9, 2024 at 2:29 PM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Mon, Sep 9, 2024 at 6:13 AM Jann Horn <jannh@google.com> wrote:
> > > > >
> > > > > On Fri, Sep 6, 2024 at 7:12 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > > > Given filp_cachep is already marked SLAB_TYPESAFE_BY_RCU, we can safely
> > > > > > access vma->vm_file->f_inode field locklessly under just rcu_read_lock()
> > > > >
> > > > > No, not every file is SLAB_TYPESAFE_BY_RCU - see for example
> > > > > ovl_mmap(), which uses backing_file_mmap(), which does
> > > > > vma_set_file(vma, file) where "file" comes from ovl_mmap()'s
> > > > > "realfile", which comes from file->private_data, which is set in
> > > > > ovl_open() to the return value of ovl_open_realfile(), which comes
> > > > > from backing_file_open(), which allocates a file with
> > > > > alloc_empty_backing_file(), which uses a normal kzalloc() without any
> > > > > RCU stuff, with this comment:
> > > > >
> > > > > * This is only for kernel internal use, and the allocate file must not be
> > > > > * installed into file tables or such.
> > > > >
> > > > > And when a backing_file is freed, you can see on the path
> > > > > __fput() -> file_free()
> > > > > that files with FMODE_BACKING are directly freed with kfree(), no RCU delay.
> > > >
> > > > Good catch on FMODE_BACKING, I didn't realize there is this exception, thanks!
> > > >
> > > > I think the way forward would be to detect that the backing file is in
> > > > FMODE_BACKING and fall back to mmap_lock-protected code path.
> > > >
> > > > I guess I have the question to Liam and Suren, do you think it would
> > > > be ok to add another bool after `bool detached` in struct
> > > > vm_area_struct (guarded by CONFIG_PER_VMA_LOCK), or should we try to
> > > > add an extra bit into vm_flags_t? The latter would work without
> > > > CONFIG_PER_VMA_LOCK, but I don't know what's acceptable with mm folks.
> > > >
> > > > This flag can be set in vma_set_file() when swapping backing file and
> > > > wherever else vma->vm_file might be set/updated (I need to audit the
> > > > code).
> > >
> > > I understand that this would work but I'm not very eager to leak
> > > vm_file attributes like FMODE_BACKING into vm_area_struct.
> > > Instead maybe that exception can be avoided? Treating all vm_files
> >
> > I agree, that would be best, of course. It seems like [1] was an
> > optimization to avoid kfree_rcu() calls, not sure how big of a deal it
> > is to undo that, given we do have a use case that calls for it now.
> > Let's see what Christian thinks.
>
> Do you just mean?
>
> diff --git a/fs/file_table.c b/fs/file_table.c
> index 7ce4d5dac080..03e58b28e539 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -68,7 +68,7 @@ static inline void file_free(struct file *f)
> put_cred(f->f_cred);
> if (unlikely(f->f_mode & FMODE_BACKING)) {
> path_put(backing_file_user_path(f));
> - kfree(backing_file(f));
> + kfree_rcu(backing_file(f));
> } else {
> kmem_cache_free(filp_cachep, f);
> }
>
> Then the only thing you can do with FMODE_BACKING is to skip it. I think
> that should be fine since backing files right now are only used by
> overlayfs and I don't think the kfree_rcu() will be a performance issue.
Yes, something along those lines. Ok, great, if it's ok to add back
kfree_rcu(), then I think that resolves the main problem I was running
into. I'll incorporate adding back RCU-delated freeing as a separate
patch into the future patch set, thanks!
>
> >
> > > equally as RCU-safe would be a much simpler solution. I see that this
> > > exception was introduced in [1] and I don't know if this was done for
> > > performance reasons or something else. Christian, CCing you here to
> > > please clarify.
> > >
> > > [1] https://lore.kernel.org/all/20231005-sakralbau-wappnen-f5c31755ed70@brauner/
> > >
> > > >
> > > > >
> > > > > So the RCU-ness of "struct file" is an implementation detail of the
> > > > > VFS, and you can't rely on it for ->vm_file unless you get the VFS to
> > > > > change how backing file lifetimes work, which might slow down some
> > > > > other workload, or you find a way to figure out whether you're dealing
> > > > > with a backing file without actually accessing the file.
> > > > >
> > > > > > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> > > > > > +{
> > > > > > + const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE;
> > > > > > + struct mm_struct *mm = current->mm;
> > > > > > + struct uprobe *uprobe;
> > > > > > + struct vm_area_struct *vma;
> > > > > > + struct file *vm_file;
> > > > > > + struct inode *vm_inode;
> > > > > > + unsigned long vm_pgoff, vm_start;
> > > > > > + int seq;
> > > > > > + loff_t offset;
> > > > > > +
> > > > > > + if (!mmap_lock_speculation_start(mm, &seq))
> > > > > > + return NULL;
> > > > > > +
> > > > > > + rcu_read_lock();
> > > > > > +
> > > > > > + vma = vma_lookup(mm, bp_vaddr);
> > > > > > + if (!vma)
> > > > > > + goto bail;
> > > > > > +
> > > > > > + vm_file = data_race(vma->vm_file);
> > > > >
> > > > > A plain "data_race()" says "I'm fine with this load tearing", but
> > > > > you're relying on this load not tearing (since you access the vm_file
> > > > > pointer below).
> > > > > You're also relying on the "struct file" that vma->vm_file points to
> > > > > being populated at this point, which means you need CONSUME semantics
> > > > > here, which READ_ONCE() will give you, and something like RELEASE
> > > > > semantics on any pairing store that populates vma->vm_file, which
> > > > > means they'd all have to become something like smp_store_release()).
> > > >
> > > > vma->vm_file should be set in VMA before it is installed and is never
> > > > modified afterwards, isn't that the case? So maybe no extra barrier
> > > > are needed and READ_ONCE() would be enough.
> > > >
> > > > >
> > > > > You might want to instead add another recheck of the sequence count
> > > > > (which would involve at least a read memory barrier after the
> > > > > preceding patch is fixed) after loading the ->vm_file pointer to
> > > > > ensure that no one was concurrently changing the ->vm_file pointer
> > > > > before you do memory accesses through it.
> > > > >
> > > > > > + if (!vm_file || (vma->vm_flags & flags) != VM_MAYEXEC)
> > > > > > + goto bail;
> > > > >
> > > > > missing data_race() annotation on the vma->vm_flags access
> > > >
> > > > ack
> > > >
> > > > >
> > > > > > + vm_inode = data_race(vm_file->f_inode);
> > > > >
> > > > > As noted above, this doesn't work because you can't rely on having RCU
> > > > > lifetime for the file. One *very* ugly hack you could do, if you think
> > > > > this code is so performance-sensitive that you're willing to do fairly
> > > > > atrocious things here, would be to do a "yes I am intentionally doing
> > > > > a UAF read and I know the address might not even be mapped at this
> > > > > point, it's fine, trust me" pattern, where you use
> > > > > copy_from_kernel_nofault(), kind of like in prepend_copy() in
> > > > > fs/d_path.c, and then immediately recheck the sequence count before
> > > > > doing *anything* with this vm_inode pointer you just loaded.
> > > > >
> > > > >
> > > >
> > > > yeah, let's leave it as a very unfortunate plan B and try to solve it
> > > > a bit cleaner.
> > > >
> > > >
> > > > >
> > > > > > + vm_pgoff = data_race(vma->vm_pgoff);
> > > > > > + vm_start = data_race(vma->vm_start);
> > > > > > +
> > > > > > + offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
> > > > > > + uprobe = find_uprobe_rcu(vm_inode, offset);
> > > > > > + if (!uprobe)
> > > > > > + goto bail;
> > > > > > +
> > > > > > + /* now double check that nothing about MM changed */
> > > > > > + if (!mmap_lock_speculation_end(mm, seq))
> > > > > > + goto bail;
> > > > > > +
> > > > > > + rcu_read_unlock();
> > > > > > +
> > > > > > + /* happy case, we speculated successfully */
> > > > > > + return uprobe;
> > > > > > +bail:
> > > > > > + rcu_read_unlock();
> > > > > > + return NULL;
> > > > > > +}
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v2 1/1] mm: introduce mmap_lock_speculation_{start|end}
2024-09-11 21:48 ` Suren Baghdasaryan
@ 2024-09-12 21:02 ` Suren Baghdasaryan
2024-09-12 21:04 ` Suren Baghdasaryan
2024-09-12 22:52 ` Jann Horn
0 siblings, 2 replies; 30+ messages in thread
From: Suren Baghdasaryan @ 2024-09-12 21:02 UTC (permalink / raw)
To: linux-trace-kernel, peterz, oleg
Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
surenb, akpm, linux-mm, mjguzik, brauner, jannh, andrii
Add helper functions to speculatively perform operations without
read-locking mmap_lock, expecting that mmap_lock will not be
write-locked and mm is not modified from under us.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
Changes since v1 [1]:
- Made memory barriers in inc_mm_lock_seq and mmap_lock_speculation_end
more strict, per Jann Horn
[1] https://lore.kernel.org/all/20240906051205.530219-2-andrii@kernel.org/
include/linux/mm_types.h | 3 ++
include/linux/mmap_lock.h | 74 ++++++++++++++++++++++++++++++++-------
kernel/fork.c | 3 --
3 files changed, 65 insertions(+), 15 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e3bdf8e38bc..5d8cdebd42bc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -887,6 +887,9 @@ struct mm_struct {
* Roughly speaking, incrementing the sequence number is
* equivalent to releasing locks on VMAs; reading the sequence
* number can be part of taking a read lock on a VMA.
+ * Incremented every time mmap_lock is write-locked/unlocked.
+ * Initialized to 0, therefore odd values indicate mmap_lock
+ * is write-locked and even values that it's released.
*
* Can be modified under write mmap_lock using RELEASE
* semantics.
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index de9dc20b01ba..a281519d0c12 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -71,39 +71,86 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
}
#ifdef CONFIG_PER_VMA_LOCK
+static inline void init_mm_lock_seq(struct mm_struct *mm)
+{
+ mm->mm_lock_seq = 0;
+}
+
/*
- * Drop all currently-held per-VMA locks.
- * This is called from the mmap_lock implementation directly before releasing
- * a write-locked mmap_lock (or downgrading it to read-locked).
- * This should normally NOT be called manually from other places.
- * If you want to call this manually anyway, keep in mind that this will release
- * *all* VMA write locks, including ones from further up the stack.
+ * Increment mm->mm_lock_seq when mmap_lock is write-locked (ACQUIRE semantics)
+ * or write-unlocked (RELEASE semantics).
*/
-static inline void vma_end_write_all(struct mm_struct *mm)
+static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire)
{
mmap_assert_write_locked(mm);
/*
* Nobody can concurrently modify mm->mm_lock_seq due to exclusive
* mmap_lock being held.
- * We need RELEASE semantics here to ensure that preceding stores into
- * the VMA take effect before we unlock it with this store.
- * Pairs with ACQUIRE semantics in vma_start_read().
*/
- smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
+
+ if (acquire) {
+ WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
+ /*
+ * For ACQUIRE semantics we should ensure no following stores are
+ * reordered to appear before the mm->mm_lock_seq modification.
+ */
+ smp_wmb();
+ } else {
+ /*
+ * We need RELEASE semantics here to ensure that preceding stores
+ * into the VMA take effect before we unlock it with this store.
+ * Pairs with ACQUIRE semantics in vma_start_read().
+ */
+ smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
+ }
+}
+
+static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
+{
+ /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
+ *seq = smp_load_acquire(&mm->mm_lock_seq);
+ /* Allow speculation if mmap_lock is not write-locked */
+ return (*seq & 1) == 0;
+}
+
+static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
+{
+ /* Pairs with ACQUIRE semantics in inc_mm_lock_seq(). */
+ smp_rmb();
+ return seq == READ_ONCE(mm->mm_lock_seq);
}
+
#else
-static inline void vma_end_write_all(struct mm_struct *mm) {}
+static inline void init_mm_lock_seq(struct mm_struct *mm) {}
+static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire) {}
+static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
+static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq) { return false; }
#endif
+/*
+ * Drop all currently-held per-VMA locks.
+ * This is called from the mmap_lock implementation directly before releasing
+ * a write-locked mmap_lock (or downgrading it to read-locked).
+ * This should normally NOT be called manually from other places.
+ * If you want to call this manually anyway, keep in mind that this will release
+ * *all* VMA write locks, including ones from further up the stack.
+ */
+static inline void vma_end_write_all(struct mm_struct *mm)
+{
+ inc_mm_lock_seq(mm, false);
+}
+
static inline void mmap_init_lock(struct mm_struct *mm)
{
init_rwsem(&mm->mmap_lock);
+ init_mm_lock_seq(mm);
}
static inline void mmap_write_lock(struct mm_struct *mm)
{
__mmap_lock_trace_start_locking(mm, true);
down_write(&mm->mmap_lock);
+ inc_mm_lock_seq(mm, true);
__mmap_lock_trace_acquire_returned(mm, true, true);
}
@@ -111,6 +158,7 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
{
__mmap_lock_trace_start_locking(mm, true);
down_write_nested(&mm->mmap_lock, subclass);
+ inc_mm_lock_seq(mm, true);
__mmap_lock_trace_acquire_returned(mm, true, true);
}
@@ -120,6 +168,8 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
__mmap_lock_trace_start_locking(mm, true);
ret = down_write_killable(&mm->mmap_lock);
+ if (!ret)
+ inc_mm_lock_seq(mm, true);
__mmap_lock_trace_acquire_returned(mm, true, ret == 0);
return ret;
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 61070248a7d3..c86e87ed172b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1259,9 +1259,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
seqcount_init(&mm->write_protect_seq);
mmap_init_lock(mm);
INIT_LIST_HEAD(&mm->mmlist);
-#ifdef CONFIG_PER_VMA_LOCK
- mm->mm_lock_seq = 0;
-#endif
mm_pgtables_bytes_init(mm);
mm->map_count = 0;
mm->locked_vm = 0;
base-commit: 015bdfcb183759674ba1bd732c3393014e35708b
--
2.46.0.662.g92d0881bb0-goog
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/1] mm: introduce mmap_lock_speculation_{start|end}
2024-09-12 21:02 ` [PATCH v2 1/1] " Suren Baghdasaryan
@ 2024-09-12 21:04 ` Suren Baghdasaryan
2024-09-12 22:19 ` Andrii Nakryiko
2024-09-12 22:52 ` Jann Horn
1 sibling, 1 reply; 30+ messages in thread
From: Suren Baghdasaryan @ 2024-09-12 21:04 UTC (permalink / raw)
To: linux-trace-kernel, peterz, oleg
Cc: rostedt, mhiramat, bpf, linux-kernel, jolsa, paulmck, willy,
akpm, linux-mm, mjguzik, brauner, jannh, andrii
On Thu, Sep 12, 2024 at 2:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Add helper functions to speculatively perform operations without
> read-locking mmap_lock, expecting that mmap_lock will not be
> write-locked and mm is not modified from under us.
Here you go. I hope I got the ordering right this time around, but I
would feel much better if Jann reviewed it before it's included in
your next patchset :)
Thanks,
Suren.
>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
> Changes since v1 [1]:
> - Made memory barriers in inc_mm_lock_seq and mmap_lock_speculation_end
> more strict, per Jann Horn
>
> [1] https://lore.kernel.org/all/20240906051205.530219-2-andrii@kernel.org/
>
> include/linux/mm_types.h | 3 ++
> include/linux/mmap_lock.h | 74 ++++++++++++++++++++++++++++++++-------
> kernel/fork.c | 3 --
> 3 files changed, 65 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6e3bdf8e38bc..5d8cdebd42bc 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -887,6 +887,9 @@ struct mm_struct {
> * Roughly speaking, incrementing the sequence number is
> * equivalent to releasing locks on VMAs; reading the sequence
> * number can be part of taking a read lock on a VMA.
> + * Incremented every time mmap_lock is write-locked/unlocked.
> + * Initialized to 0, therefore odd values indicate mmap_lock
> + * is write-locked and even values that it's released.
> *
> * Can be modified under write mmap_lock using RELEASE
> * semantics.
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index de9dc20b01ba..a281519d0c12 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -71,39 +71,86 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
> }
>
> #ifdef CONFIG_PER_VMA_LOCK
> +static inline void init_mm_lock_seq(struct mm_struct *mm)
> +{
> + mm->mm_lock_seq = 0;
> +}
> +
> /*
> - * Drop all currently-held per-VMA locks.
> - * This is called from the mmap_lock implementation directly before releasing
> - * a write-locked mmap_lock (or downgrading it to read-locked).
> - * This should normally NOT be called manually from other places.
> - * If you want to call this manually anyway, keep in mind that this will release
> - * *all* VMA write locks, including ones from further up the stack.
> + * Increment mm->mm_lock_seq when mmap_lock is write-locked (ACQUIRE semantics)
> + * or write-unlocked (RELEASE semantics).
> */
> -static inline void vma_end_write_all(struct mm_struct *mm)
> +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire)
> {
> mmap_assert_write_locked(mm);
> /*
> * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
> * mmap_lock being held.
> - * We need RELEASE semantics here to ensure that preceding stores into
> - * the VMA take effect before we unlock it with this store.
> - * Pairs with ACQUIRE semantics in vma_start_read().
> */
> - smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> +
> + if (acquire) {
> + WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> + /*
> + * For ACQUIRE semantics we should ensure no following stores are
> + * reordered to appear before the mm->mm_lock_seq modification.
> + */
> + smp_wmb();
> + } else {
> + /*
> + * We need RELEASE semantics here to ensure that preceding stores
> + * into the VMA take effect before we unlock it with this store.
> + * Pairs with ACQUIRE semantics in vma_start_read().
> + */
> + smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> + }
> +}
> +
> +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
> +{
> + /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> + *seq = smp_load_acquire(&mm->mm_lock_seq);
> + /* Allow speculation if mmap_lock is not write-locked */
> + return (*seq & 1) == 0;
> +}
> +
> +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
> +{
> + /* Pairs with ACQUIRE semantics in inc_mm_lock_seq(). */
> + smp_rmb();
> + return seq == READ_ONCE(mm->mm_lock_seq);
> }
> +
> #else
> -static inline void vma_end_write_all(struct mm_struct *mm) {}
> +static inline void init_mm_lock_seq(struct mm_struct *mm) {}
> +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire) {}
> +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
> +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq) { return false; }
> #endif
>
> +/*
> + * Drop all currently-held per-VMA locks.
> + * This is called from the mmap_lock implementation directly before releasing
> + * a write-locked mmap_lock (or downgrading it to read-locked).
> + * This should normally NOT be called manually from other places.
> + * If you want to call this manually anyway, keep in mind that this will release
> + * *all* VMA write locks, including ones from further up the stack.
> + */
> +static inline void vma_end_write_all(struct mm_struct *mm)
> +{
> + inc_mm_lock_seq(mm, false);
> +}
> +
> static inline void mmap_init_lock(struct mm_struct *mm)
> {
> init_rwsem(&mm->mmap_lock);
> + init_mm_lock_seq(mm);
> }
>
> static inline void mmap_write_lock(struct mm_struct *mm)
> {
> __mmap_lock_trace_start_locking(mm, true);
> down_write(&mm->mmap_lock);
> + inc_mm_lock_seq(mm, true);
> __mmap_lock_trace_acquire_returned(mm, true, true);
> }
>
> @@ -111,6 +158,7 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
> {
> __mmap_lock_trace_start_locking(mm, true);
> down_write_nested(&mm->mmap_lock, subclass);
> + inc_mm_lock_seq(mm, true);
> __mmap_lock_trace_acquire_returned(mm, true, true);
> }
>
> @@ -120,6 +168,8 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
>
> __mmap_lock_trace_start_locking(mm, true);
> ret = down_write_killable(&mm->mmap_lock);
> + if (!ret)
> + inc_mm_lock_seq(mm, true);
> __mmap_lock_trace_acquire_returned(mm, true, ret == 0);
> return ret;
> }
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 61070248a7d3..c86e87ed172b 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1259,9 +1259,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> seqcount_init(&mm->write_protect_seq);
> mmap_init_lock(mm);
> INIT_LIST_HEAD(&mm->mmlist);
> -#ifdef CONFIG_PER_VMA_LOCK
> - mm->mm_lock_seq = 0;
> -#endif
> mm_pgtables_bytes_init(mm);
> mm->map_count = 0;
> mm->locked_vm = 0;
>
> base-commit: 015bdfcb183759674ba1bd732c3393014e35708b
> --
> 2.46.0.662.g92d0881bb0-goog
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/1] mm: introduce mmap_lock_speculation_{start|end}
2024-09-12 21:04 ` Suren Baghdasaryan
@ 2024-09-12 22:19 ` Andrii Nakryiko
2024-09-12 22:24 ` Suren Baghdasaryan
0 siblings, 1 reply; 30+ messages in thread
From: Andrii Nakryiko @ 2024-09-12 22:19 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
linux-kernel, jolsa, paulmck, willy, akpm, linux-mm, mjguzik,
brauner, jannh, andrii
On Thu, Sep 12, 2024 at 2:04 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Sep 12, 2024 at 2:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > Add helper functions to speculatively perform operations without
> > read-locking mmap_lock, expecting that mmap_lock will not be
> > write-locked and mm is not modified from under us.
>
> Here you go. I hope I got the ordering right this time around, but I
> would feel much better if Jann reviewed it before it's included in
> your next patchset :)
Thanks a lot! And yes, I'll give it a bit of time for reviews before
sending a new revision.
Did you by any chance get any new ideas for possible benchmarks
(beyond what I did with will-it-scale)?
> Thanks,
> Suren.
>
> >
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> > Changes since v1 [1]:
> > - Made memory barriers in inc_mm_lock_seq and mmap_lock_speculation_end
> > more strict, per Jann Horn
> >
> > [1] https://lore.kernel.org/all/20240906051205.530219-2-andrii@kernel.org/
> >
> > include/linux/mm_types.h | 3 ++
> > include/linux/mmap_lock.h | 74 ++++++++++++++++++++++++++++++++-------
> > kernel/fork.c | 3 --
> > 3 files changed, 65 insertions(+), 15 deletions(-)
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 6e3bdf8e38bc..5d8cdebd42bc 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -887,6 +887,9 @@ struct mm_struct {
> > * Roughly speaking, incrementing the sequence number is
> > * equivalent to releasing locks on VMAs; reading the sequence
> > * number can be part of taking a read lock on a VMA.
> > + * Incremented every time mmap_lock is write-locked/unlocked.
> > + * Initialized to 0, therefore odd values indicate mmap_lock
> > + * is write-locked and even values that it's released.
> > *
> > * Can be modified under write mmap_lock using RELEASE
> > * semantics.
> > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> > index de9dc20b01ba..a281519d0c12 100644
> > --- a/include/linux/mmap_lock.h
> > +++ b/include/linux/mmap_lock.h
> > @@ -71,39 +71,86 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
> > }
> >
> > #ifdef CONFIG_PER_VMA_LOCK
> > +static inline void init_mm_lock_seq(struct mm_struct *mm)
> > +{
> > + mm->mm_lock_seq = 0;
> > +}
> > +
> > /*
> > - * Drop all currently-held per-VMA locks.
> > - * This is called from the mmap_lock implementation directly before releasing
> > - * a write-locked mmap_lock (or downgrading it to read-locked).
> > - * This should normally NOT be called manually from other places.
> > - * If you want to call this manually anyway, keep in mind that this will release
> > - * *all* VMA write locks, including ones from further up the stack.
> > + * Increment mm->mm_lock_seq when mmap_lock is write-locked (ACQUIRE semantics)
> > + * or write-unlocked (RELEASE semantics).
> > */
> > -static inline void vma_end_write_all(struct mm_struct *mm)
> > +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire)
> > {
> > mmap_assert_write_locked(mm);
> > /*
> > * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
> > * mmap_lock being held.
> > - * We need RELEASE semantics here to ensure that preceding stores into
> > - * the VMA take effect before we unlock it with this store.
> > - * Pairs with ACQUIRE semantics in vma_start_read().
> > */
> > - smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > +
> > + if (acquire) {
> > + WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > + /*
> > + * For ACQUIRE semantics we should ensure no following stores are
> > + * reordered to appear before the mm->mm_lock_seq modification.
> > + */
> > + smp_wmb();
> > + } else {
> > + /*
> > + * We need RELEASE semantics here to ensure that preceding stores
> > + * into the VMA take effect before we unlock it with this store.
> > + * Pairs with ACQUIRE semantics in vma_start_read().
> > + */
> > + smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > + }
> > +}
> > +
> > +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
> > +{
> > + /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> > + *seq = smp_load_acquire(&mm->mm_lock_seq);
> > + /* Allow speculation if mmap_lock is not write-locked */
> > + return (*seq & 1) == 0;
> > +}
> > +
> > +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
> > +{
> > + /* Pairs with ACQUIRE semantics in inc_mm_lock_seq(). */
> > + smp_rmb();
> > + return seq == READ_ONCE(mm->mm_lock_seq);
> > }
> > +
> > #else
> > -static inline void vma_end_write_all(struct mm_struct *mm) {}
> > +static inline void init_mm_lock_seq(struct mm_struct *mm) {}
> > +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire) {}
> > +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
> > +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq) { return false; }
> > #endif
> >
> > +/*
> > + * Drop all currently-held per-VMA locks.
> > + * This is called from the mmap_lock implementation directly before releasing
> > + * a write-locked mmap_lock (or downgrading it to read-locked).
> > + * This should normally NOT be called manually from other places.
> > + * If you want to call this manually anyway, keep in mind that this will release
> > + * *all* VMA write locks, including ones from further up the stack.
> > + */
> > +static inline void vma_end_write_all(struct mm_struct *mm)
> > +{
> > + inc_mm_lock_seq(mm, false);
> > +}
> > +
> > static inline void mmap_init_lock(struct mm_struct *mm)
> > {
> > init_rwsem(&mm->mmap_lock);
> > + init_mm_lock_seq(mm);
> > }
> >
> > static inline void mmap_write_lock(struct mm_struct *mm)
> > {
> > __mmap_lock_trace_start_locking(mm, true);
> > down_write(&mm->mmap_lock);
> > + inc_mm_lock_seq(mm, true);
> > __mmap_lock_trace_acquire_returned(mm, true, true);
> > }
> >
> > @@ -111,6 +158,7 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
> > {
> > __mmap_lock_trace_start_locking(mm, true);
> > down_write_nested(&mm->mmap_lock, subclass);
> > + inc_mm_lock_seq(mm, true);
> > __mmap_lock_trace_acquire_returned(mm, true, true);
> > }
> >
> > @@ -120,6 +168,8 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
> >
> > __mmap_lock_trace_start_locking(mm, true);
> > ret = down_write_killable(&mm->mmap_lock);
> > + if (!ret)
> > + inc_mm_lock_seq(mm, true);
> > __mmap_lock_trace_acquire_returned(mm, true, ret == 0);
> > return ret;
> > }
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 61070248a7d3..c86e87ed172b 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1259,9 +1259,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> > seqcount_init(&mm->write_protect_seq);
> > mmap_init_lock(mm);
> > INIT_LIST_HEAD(&mm->mmlist);
> > -#ifdef CONFIG_PER_VMA_LOCK
> > - mm->mm_lock_seq = 0;
> > -#endif
> > mm_pgtables_bytes_init(mm);
> > mm->map_count = 0;
> > mm->locked_vm = 0;
> >
> > base-commit: 015bdfcb183759674ba1bd732c3393014e35708b
> > --
> > 2.46.0.662.g92d0881bb0-goog
> >
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/1] mm: introduce mmap_lock_speculation_{start|end}
2024-09-12 22:19 ` Andrii Nakryiko
@ 2024-09-12 22:24 ` Suren Baghdasaryan
0 siblings, 0 replies; 30+ messages in thread
From: Suren Baghdasaryan @ 2024-09-12 22:24 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
linux-kernel, jolsa, paulmck, willy, akpm, linux-mm, mjguzik,
brauner, jannh, andrii
On Thu, Sep 12, 2024 at 3:20 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Sep 12, 2024 at 2:04 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Sep 12, 2024 at 2:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > Add helper functions to speculatively perform operations without
> > > read-locking mmap_lock, expecting that mmap_lock will not be
> > > write-locked and mm is not modified from under us.
> >
> > Here you go. I hope I got the ordering right this time around, but I
> > would feel much better if Jann reviewed it before it's included in
> > your next patchset :)
>
> Thanks a lot! And yes, I'll give it a bit of time for reviews before
> sending a new revision.
>
> Did you by any chance get any new ideas for possible benchmarks
> (beyond what I did with will-it-scale)?
Hmm. You could use Mel Gorman's mmtests suite I guess.
>
>
> > Thanks,
> > Suren.
> >
> > >
> > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > > Changes since v1 [1]:
> > > - Made memory barriers in inc_mm_lock_seq and mmap_lock_speculation_end
> > > more strict, per Jann Horn
> > >
> > > [1] https://lore.kernel.org/all/20240906051205.530219-2-andrii@kernel.org/
> > >
> > > include/linux/mm_types.h | 3 ++
> > > include/linux/mmap_lock.h | 74 ++++++++++++++++++++++++++++++++-------
> > > kernel/fork.c | 3 --
> > > 3 files changed, 65 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 6e3bdf8e38bc..5d8cdebd42bc 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -887,6 +887,9 @@ struct mm_struct {
> > > * Roughly speaking, incrementing the sequence number is
> > > * equivalent to releasing locks on VMAs; reading the sequence
> > > * number can be part of taking a read lock on a VMA.
> > > + * Incremented every time mmap_lock is write-locked/unlocked.
> > > + * Initialized to 0, therefore odd values indicate mmap_lock
> > > + * is write-locked and even values that it's released.
> > > *
> > > * Can be modified under write mmap_lock using RELEASE
> > > * semantics.
> > > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> > > index de9dc20b01ba..a281519d0c12 100644
> > > --- a/include/linux/mmap_lock.h
> > > +++ b/include/linux/mmap_lock.h
> > > @@ -71,39 +71,86 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
> > > }
> > >
> > > #ifdef CONFIG_PER_VMA_LOCK
> > > +static inline void init_mm_lock_seq(struct mm_struct *mm)
> > > +{
> > > + mm->mm_lock_seq = 0;
> > > +}
> > > +
> > > /*
> > > - * Drop all currently-held per-VMA locks.
> > > - * This is called from the mmap_lock implementation directly before releasing
> > > - * a write-locked mmap_lock (or downgrading it to read-locked).
> > > - * This should normally NOT be called manually from other places.
> > > - * If you want to call this manually anyway, keep in mind that this will release
> > > - * *all* VMA write locks, including ones from further up the stack.
> > > + * Increment mm->mm_lock_seq when mmap_lock is write-locked (ACQUIRE semantics)
> > > + * or write-unlocked (RELEASE semantics).
> > > */
> > > -static inline void vma_end_write_all(struct mm_struct *mm)
> > > +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire)
> > > {
> > > mmap_assert_write_locked(mm);
> > > /*
> > > * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
> > > * mmap_lock being held.
> > > - * We need RELEASE semantics here to ensure that preceding stores into
> > > - * the VMA take effect before we unlock it with this store.
> > > - * Pairs with ACQUIRE semantics in vma_start_read().
> > > */
> > > - smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > > +
> > > + if (acquire) {
> > > + WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > > + /*
> > > + * For ACQUIRE semantics we should ensure no following stores are
> > > + * reordered to appear before the mm->mm_lock_seq modification.
> > > + */
> > > + smp_wmb();
> > > + } else {
> > > + /*
> > > + * We need RELEASE semantics here to ensure that preceding stores
> > > + * into the VMA take effect before we unlock it with this store.
> > > + * Pairs with ACQUIRE semantics in vma_start_read().
> > > + */
> > > + smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > > + }
> > > +}
> > > +
> > > +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
> > > +{
> > > + /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> > > + *seq = smp_load_acquire(&mm->mm_lock_seq);
> > > + /* Allow speculation if mmap_lock is not write-locked */
> > > + return (*seq & 1) == 0;
> > > +}
> > > +
> > > +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
> > > +{
> > > + /* Pairs with ACQUIRE semantics in inc_mm_lock_seq(). */
> > > + smp_rmb();
> > > + return seq == READ_ONCE(mm->mm_lock_seq);
> > > }
> > > +
> > > #else
> > > -static inline void vma_end_write_all(struct mm_struct *mm) {}
> > > +static inline void init_mm_lock_seq(struct mm_struct *mm) {}
> > > +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire) {}
> > > +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
> > > +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq) { return false; }
> > > #endif
> > >
> > > +/*
> > > + * Drop all currently-held per-VMA locks.
> > > + * This is called from the mmap_lock implementation directly before releasing
> > > + * a write-locked mmap_lock (or downgrading it to read-locked).
> > > + * This should normally NOT be called manually from other places.
> > > + * If you want to call this manually anyway, keep in mind that this will release
> > > + * *all* VMA write locks, including ones from further up the stack.
> > > + */
> > > +static inline void vma_end_write_all(struct mm_struct *mm)
> > > +{
> > > + inc_mm_lock_seq(mm, false);
> > > +}
> > > +
> > > static inline void mmap_init_lock(struct mm_struct *mm)
> > > {
> > > init_rwsem(&mm->mmap_lock);
> > > + init_mm_lock_seq(mm);
> > > }
> > >
> > > static inline void mmap_write_lock(struct mm_struct *mm)
> > > {
> > > __mmap_lock_trace_start_locking(mm, true);
> > > down_write(&mm->mmap_lock);
> > > + inc_mm_lock_seq(mm, true);
> > > __mmap_lock_trace_acquire_returned(mm, true, true);
> > > }
> > >
> > > @@ -111,6 +158,7 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
> > > {
> > > __mmap_lock_trace_start_locking(mm, true);
> > > down_write_nested(&mm->mmap_lock, subclass);
> > > + inc_mm_lock_seq(mm, true);
> > > __mmap_lock_trace_acquire_returned(mm, true, true);
> > > }
> > >
> > > @@ -120,6 +168,8 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
> > >
> > > __mmap_lock_trace_start_locking(mm, true);
> > > ret = down_write_killable(&mm->mmap_lock);
> > > + if (!ret)
> > > + inc_mm_lock_seq(mm, true);
> > > __mmap_lock_trace_acquire_returned(mm, true, ret == 0);
> > > return ret;
> > > }
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 61070248a7d3..c86e87ed172b 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -1259,9 +1259,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> > > seqcount_init(&mm->write_protect_seq);
> > > mmap_init_lock(mm);
> > > INIT_LIST_HEAD(&mm->mmlist);
> > > -#ifdef CONFIG_PER_VMA_LOCK
> > > - mm->mm_lock_seq = 0;
> > > -#endif
> > > mm_pgtables_bytes_init(mm);
> > > mm->map_count = 0;
> > > mm->locked_vm = 0;
> > >
> > > base-commit: 015bdfcb183759674ba1bd732c3393014e35708b
> > > --
> > > 2.46.0.662.g92d0881bb0-goog
> > >
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/1] mm: introduce mmap_lock_speculation_{start|end}
2024-09-12 21:02 ` [PATCH v2 1/1] " Suren Baghdasaryan
2024-09-12 21:04 ` Suren Baghdasaryan
@ 2024-09-12 22:52 ` Jann Horn
2024-09-24 17:15 ` Matthew Wilcox
1 sibling, 1 reply; 30+ messages in thread
From: Jann Horn @ 2024-09-12 22:52 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: linux-trace-kernel, peterz, oleg, rostedt, mhiramat, bpf,
linux-kernel, jolsa, paulmck, willy, akpm, linux-mm, mjguzik,
brauner, andrii
On Thu, Sep 12, 2024 at 11:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
> Add helper functions to speculatively perform operations without
> read-locking mmap_lock, expecting that mmap_lock will not be
> write-locked and mm is not modified from under us.
I think this is okay now, except for some comments that should be
fixed up. (Plus my gripe about the sequence count being 32-bit.)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6e3bdf8e38bc..5d8cdebd42bc 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -887,6 +887,9 @@ struct mm_struct {
> * Roughly speaking, incrementing the sequence number is
> * equivalent to releasing locks on VMAs; reading the sequence
> * number can be part of taking a read lock on a VMA.
> + * Incremented every time mmap_lock is write-locked/unlocked.
> + * Initialized to 0, therefore odd values indicate mmap_lock
> + * is write-locked and even values that it's released.
FWIW, I would still feel happier if this was a 64-bit number, though I
guess at least with uprobes the attack surface is not that large even
if you can wrap that counter... 2^31 counter increments are not all
that much, especially if someone introduces a kernel path in the
future that lets you repeatedly take the mmap_lock for writing within
a single syscall without doing much work, or maybe on some machine
where syscalls are really fast. I really don't like hinging memory
safety on how fast or slow some piece of code can run, unless we can
make strong arguments about it based on how many memory writes a CPU
core is capable of doing per second or stuff like that.
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index de9dc20b01ba..a281519d0c12 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -71,39 +71,86 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
> }
>
> #ifdef CONFIG_PER_VMA_LOCK
> +static inline void init_mm_lock_seq(struct mm_struct *mm)
> +{
> + mm->mm_lock_seq = 0;
> +}
> +
> /*
> - * Drop all currently-held per-VMA locks.
> - * This is called from the mmap_lock implementation directly before releasing
> - * a write-locked mmap_lock (or downgrading it to read-locked).
> - * This should normally NOT be called manually from other places.
> - * If you want to call this manually anyway, keep in mind that this will release
> - * *all* VMA write locks, including ones from further up the stack.
> + * Increment mm->mm_lock_seq when mmap_lock is write-locked (ACQUIRE semantics)
> + * or write-unlocked (RELEASE semantics).
> */
> -static inline void vma_end_write_all(struct mm_struct *mm)
> +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire)
> {
> mmap_assert_write_locked(mm);
Not a memory barriers thing, but maybe you could throw in some kind of
VM_WARN_ON() in the branches below that checks that the sequence
number is odd/even as expected, just to make extra sure...
> /*
> * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
> * mmap_lock being held.
> - * We need RELEASE semantics here to ensure that preceding stores into
> - * the VMA take effect before we unlock it with this store.
> - * Pairs with ACQUIRE semantics in vma_start_read().
> */
> - smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> +
> + if (acquire) {
> + WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> + /*
> + * For ACQUIRE semantics we should ensure no following stores are
> + * reordered to appear before the mm->mm_lock_seq modification.
> + */
> + smp_wmb();
This is not really a full ACQUIRE; smp_wmb() only orders *stores*, not
loads, while a real ACQUIRE also prevents reads from being reordered
up above the atomic access. Please reword the comment to make it clear
that we don't have a full ACQUIRE here.
We can still have subsequent loads reordered up before the
mm->mm_lock_seq increment. But I guess that's probably fine as long as
nobody does anything exceedingly weird that involves lockless users
*writing* data that we have to read consistently, which wouldn't
really make sense...
So yeah, I guess this is probably fine, and it matches what
do_raw_write_seqcount_begin() is doing.
> + } else {
> + /*
> + * We need RELEASE semantics here to ensure that preceding stores
> + * into the VMA take effect before we unlock it with this store.
> + * Pairs with ACQUIRE semantics in vma_start_read().
> + */
> + smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> + }
> +}
> +
> +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
> +{
> + /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> + *seq = smp_load_acquire(&mm->mm_lock_seq);
> + /* Allow speculation if mmap_lock is not write-locked */
> + return (*seq & 1) == 0;
> +}
> +
> +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
> +{
> + /* Pairs with ACQUIRE semantics in inc_mm_lock_seq(). */
(see above, it's not actually a full ACQUIRE)
> + smp_rmb();
> + return seq == READ_ONCE(mm->mm_lock_seq);
> }
> +
> #else
> -static inline void vma_end_write_all(struct mm_struct *mm) {}
> +static inline void init_mm_lock_seq(struct mm_struct *mm) {}
> +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire) {}
> +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
> +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq) { return false; }
> #endif
>
> +/*
> + * Drop all currently-held per-VMA locks.
> + * This is called from the mmap_lock implementation directly before releasing
> + * a write-locked mmap_lock (or downgrading it to read-locked).
> + * This should normally NOT be called manually from other places.
> + * If you want to call this manually anyway, keep in mind that this will release
> + * *all* VMA write locks, including ones from further up the stack.
Outdated comment - now you are absolutely not allowed to call
vma_end_write_all() manually anymore, it would mess up the odd/even
state of the counter.
> + */
> +static inline void vma_end_write_all(struct mm_struct *mm)
> +{
> + inc_mm_lock_seq(mm, false);
> +}
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
2024-09-06 5:12 ` [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
2024-09-08 1:22 ` Liam R. Howlett
2024-09-09 13:12 ` Jann Horn
@ 2024-09-15 15:04 ` Oleg Nesterov
2024-09-17 8:19 ` Andrii Nakryiko
2 siblings, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2024-09-15 15:04 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: linux-trace-kernel, peterz, rostedt, mhiramat, bpf, linux-kernel,
jolsa, paulmck, willy, surenb, akpm, linux-mm, mjguzik, brauner,
jannh
On 09/05, Andrii Nakryiko wrote:
>
> +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> +{
> + const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE;
...
> + if (!vm_file || (vma->vm_flags & flags) != VM_MAYEXEC)
> + goto bail;
Not that this can really simplify your patch, feel free to ignore, but I don't
think you need to check vma->vm_flags.
Yes, find_active_uprobe_rcu() does the same valid_vma(vma, false) check, but it
too can/should be removed, afaics.
valid_vma(vma, false) makes sense in, say, unapply_uprobe() to quickly filter
out vma's which can't have this bp installed, but not in the handle_swbp() paths.
Oleg.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
2024-09-15 15:04 ` Oleg Nesterov
@ 2024-09-17 8:19 ` Andrii Nakryiko
0 siblings, 0 replies; 30+ messages in thread
From: Andrii Nakryiko @ 2024-09-17 8:19 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrii Nakryiko, linux-trace-kernel, peterz, rostedt, mhiramat,
bpf, linux-kernel, jolsa, paulmck, willy, surenb, akpm, linux-mm,
mjguzik, brauner, jannh
On Sun, Sep 15, 2024 at 5:04 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 09/05, Andrii Nakryiko wrote:
> >
> > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> > +{
> > + const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE;
> ...
> > + if (!vm_file || (vma->vm_flags & flags) != VM_MAYEXEC)
> > + goto bail;
>
> Not that this can really simplify your patch, feel free to ignore, but I don't
> think you need to check vma->vm_flags.
>
> Yes, find_active_uprobe_rcu() does the same valid_vma(vma, false) check, but it
> too can/should be removed, afaics.
yep, agreed, I'll see to simplify both, you points make total sense
>
> valid_vma(vma, false) makes sense in, say, unapply_uprobe() to quickly filter
> out vma's which can't have this bp installed, but not in the handle_swbp() paths.
>
> Oleg.
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/1] mm: introduce mmap_lock_speculation_{start|end}
2024-09-12 22:52 ` Jann Horn
@ 2024-09-24 17:15 ` Matthew Wilcox
2024-09-24 18:00 ` Jann Horn
0 siblings, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2024-09-24 17:15 UTC (permalink / raw)
To: Jann Horn
Cc: Suren Baghdasaryan, linux-trace-kernel, peterz, oleg, rostedt,
mhiramat, bpf, linux-kernel, jolsa, paulmck, akpm, linux-mm,
mjguzik, brauner, andrii
On Fri, Sep 13, 2024 at 12:52:39AM +0200, Jann Horn wrote:
> FWIW, I would still feel happier if this was a 64-bit number, though I
> guess at least with uprobes the attack surface is not that large even
> if you can wrap that counter... 2^31 counter increments are not all
> that much, especially if someone introduces a kernel path in the
> future that lets you repeatedly take the mmap_lock for writing within
> a single syscall without doing much work, or maybe on some machine
> where syscalls are really fast. I really don't like hinging memory
> safety on how fast or slow some piece of code can run, unless we can
> make strong arguments about it based on how many memory writes a CPU
> core is capable of doing per second or stuff like that.
You could repeatedly call munmap(1, 0) which will take the
mmap_write_lock, do no work and call mmap_write_unlock(). We could
fix that by moving the start/len validation outside the
mmap_write_lock(), but it won't increase the path length by much.
How many syscalls can we do per second?
https://blogs.oracle.com/linux/post/syscall-latency suggests 217ns per
syscall, so we'll be close to 4.6m syscalls/second or 466 seconds (7
minutes, 46 seconds).
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/1] mm: introduce mmap_lock_speculation_{start|end}
2024-09-24 17:15 ` Matthew Wilcox
@ 2024-09-24 18:00 ` Jann Horn
0 siblings, 0 replies; 30+ messages in thread
From: Jann Horn @ 2024-09-24 18:00 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Suren Baghdasaryan, linux-trace-kernel, peterz, oleg, rostedt,
mhiramat, bpf, linux-kernel, jolsa, paulmck, akpm, linux-mm,
mjguzik, brauner, andrii
On Tue, Sep 24, 2024 at 7:15 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Fri, Sep 13, 2024 at 12:52:39AM +0200, Jann Horn wrote:
> > FWIW, I would still feel happier if this was a 64-bit number, though I
> > guess at least with uprobes the attack surface is not that large even
> > if you can wrap that counter... 2^31 counter increments are not all
> > that much, especially if someone introduces a kernel path in the
> > future that lets you repeatedly take the mmap_lock for writing within
> > a single syscall without doing much work, or maybe on some machine
> > where syscalls are really fast. I really don't like hinging memory
> > safety on how fast or slow some piece of code can run, unless we can
> > make strong arguments about it based on how many memory writes a CPU
> > core is capable of doing per second or stuff like that.
>
> You could repeatedly call munmap(1, 0) which will take the
> mmap_write_lock, do no work and call mmap_write_unlock(). We could
> fix that by moving the start/len validation outside the
> mmap_write_lock(), but it won't increase the path length by much.
> How many syscalls can we do per second?
> https://blogs.oracle.com/linux/post/syscall-latency suggests 217ns per
> syscall, so we'll be close to 4.6m syscalls/second or 466 seconds (7
> minutes, 46 seconds).
Yeah, that seems like a pretty reasonable guess.
One method that may or may not be faster would be to use an io-uring
worker to dispatch a bunch of IORING_OP_MADVISE operations - that
would save on syscall entry overhead but in exchange you'd have to
worry about feeding a constant stream of work into the worker thread
in a cache-efficient way, maybe by having one CPU constantly switch
back and forth between a userspace thread and a uring worker or
something like that.
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2024-09-24 18:01 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-06 5:12 [PATCH 0/2] uprobes,mm: speculative lockless VMA-to-uprobe lookup Andrii Nakryiko
2024-09-06 5:12 ` [PATCH 1/2] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
2024-09-09 12:35 ` Jann Horn
2024-09-10 2:09 ` Suren Baghdasaryan
2024-09-10 15:31 ` Jann Horn
2024-09-11 21:34 ` Andrii Nakryiko
2024-09-11 21:48 ` Suren Baghdasaryan
2024-09-12 21:02 ` [PATCH v2 1/1] " Suren Baghdasaryan
2024-09-12 21:04 ` Suren Baghdasaryan
2024-09-12 22:19 ` Andrii Nakryiko
2024-09-12 22:24 ` Suren Baghdasaryan
2024-09-12 22:52 ` Jann Horn
2024-09-24 17:15 ` Matthew Wilcox
2024-09-24 18:00 ` Jann Horn
2024-09-06 5:12 ` [PATCH 2/2] uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution Andrii Nakryiko
2024-09-08 1:22 ` Liam R. Howlett
2024-09-09 1:08 ` Andrii Nakryiko
2024-09-09 13:12 ` Jann Horn
2024-09-09 21:29 ` Andrii Nakryiko
2024-09-10 15:39 ` Jann Horn
2024-09-10 20:56 ` Andrii Nakryiko
2024-09-10 16:32 ` Suren Baghdasaryan
2024-09-10 20:58 ` Andrii Nakryiko
2024-09-12 11:17 ` Christian Brauner
2024-09-12 17:54 ` Andrii Nakryiko
2024-09-15 15:04 ` Oleg Nesterov
2024-09-17 8:19 ` Andrii Nakryiko
2024-09-10 16:06 ` [PATCH 0/2] uprobes,mm: speculative lockless VMA-to-uprobe lookup Jann Horn
2024-09-10 17:58 ` Andrii Nakryiko
2024-09-10 18:13 ` Jann Horn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox