Re: [RFC 1/1] mm: introduce mmap_lock_speculation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end}
       [not found] <20240807182325.2585582-1-surenb@google.com>
@ 2024-08-08 20:18 ` Andrii Nakryiko
  2024-08-08 21:02   ` Suren Baghdasaryan
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2024-08-08 20:18 UTC (permalink / raw)
  To: Suren Baghdasaryan; +Cc: akpm, peterz, linux-kernel, linux-mm

On Wed, Aug 7, 2024 at 11:23 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Add helper functions to speculatively perform operations without
> read-locking mmap_lock, expecting that mmap_lock will not be
> write-locked and mm is not modified from under us.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> ---

This change makes sense and makes mm's seq a bit more useful and
meaningful. I've also tested it locally with uprobe stress-test, and
it seems to work great, I haven't run into any problems with a
multi-hour stress test run so far. Thanks!

Acked-by: Andrii Nakryiko <andrii@kernel.org>

> Discussion [1] follow-up. If proves to be useful can be included in that
> patchset. Based on mm-unstable.
>
> [1] https://lore.kernel.org/all/20240730134605.GO33588@noisy.programming.kicks-ass.net/
>
>  include/linux/mm_types.h  |  3 +++
>  include/linux/mmap_lock.h | 53 +++++++++++++++++++++++++++++++--------
>  kernel/fork.c             |  3 ---
>  3 files changed, 46 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 003619fab20e..a426e6ced604 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -887,6 +887,9 @@ struct mm_struct {
>                  * Roughly speaking, incrementing the sequence number is
>                  * equivalent to releasing locks on VMAs; reading the sequence
>                  * number can be part of taking a read lock on a VMA.
> +                * Incremented every time mmap_lock is write-locked/unlocked.
> +                * Initialized to 0, therefore odd values indicate mmap_lock
> +                * is write-locked and even values that it's released.
>                  *
>                  * Can be modified under write mmap_lock using RELEASE
>                  * semantics.
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index de9dc20b01ba..5410ce741d75 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -71,15 +71,12 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
>  }
>
>  #ifdef CONFIG_PER_VMA_LOCK
> -/*
> - * Drop all currently-held per-VMA locks.
> - * This is called from the mmap_lock implementation directly before releasing
> - * a write-locked mmap_lock (or downgrading it to read-locked).
> - * This should normally NOT be called manually from other places.
> - * If you want to call this manually anyway, keep in mind that this will release
> - * *all* VMA write locks, including ones from further up the stack.
> - */
> -static inline void vma_end_write_all(struct mm_struct *mm)
> +static inline void init_mm_lock_seq(struct mm_struct *mm)
> +{
> +       mm->mm_lock_seq = 0;
> +}
> +
> +static inline void inc_mm_lock_seq(struct mm_struct *mm)
>  {
>         mmap_assert_write_locked(mm);
>         /*
> @@ -91,19 +88,52 @@ static inline void vma_end_write_all(struct mm_struct *mm)
>          */
>         smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
>  }
> +
> +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
> +{
> +       /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> +       *seq = smp_load_acquire(&mm->mm_lock_seq);
> +       /* Allow speculation if mmap_lock is not write-locked */
> +       return (*seq & 1) == 0;
> +}
> +
> +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
> +{
> +       /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> +       return seq == smp_load_acquire(&mm->mm_lock_seq);
> +}
> +
>  #else
> -static inline void vma_end_write_all(struct mm_struct *mm) {}
> +static inline void init_mm_lock_seq(struct mm_struct *mm) {}
> +static inline void inc_mm_lock_seq(struct mm_struct *mm) {}
> +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
> +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq) { return false; }
>  #endif
>
> +/*
> + * Drop all currently-held per-VMA locks.
> + * This is called from the mmap_lock implementation directly before releasing
> + * a write-locked mmap_lock (or downgrading it to read-locked).
> + * This should normally NOT be called manually from other places.
> + * If you want to call this manually anyway, keep in mind that this will release
> + * *all* VMA write locks, including ones from further up the stack.
> + */
> +static inline void vma_end_write_all(struct mm_struct *mm)
> +{
> +       inc_mm_lock_seq(mm);
> +}
> +
>  static inline void mmap_init_lock(struct mm_struct *mm)
>  {
>         init_rwsem(&mm->mmap_lock);
> +       init_mm_lock_seq(mm);
>  }
>
>  static inline void mmap_write_lock(struct mm_struct *mm)
>  {
>         __mmap_lock_trace_start_locking(mm, true);
>         down_write(&mm->mmap_lock);
> +       inc_mm_lock_seq(mm);
>         __mmap_lock_trace_acquire_returned(mm, true, true);
>  }
>
> @@ -111,6 +141,7 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
>  {
>         __mmap_lock_trace_start_locking(mm, true);
>         down_write_nested(&mm->mmap_lock, subclass);
> +       inc_mm_lock_seq(mm);
>         __mmap_lock_trace_acquire_returned(mm, true, true);
>  }
>
> @@ -120,6 +151,8 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
>
>         __mmap_lock_trace_start_locking(mm, true);
>         ret = down_write_killable(&mm->mmap_lock);
> +       if (!ret)
> +               inc_mm_lock_seq(mm);
>         __mmap_lock_trace_acquire_returned(mm, true, ret == 0);
>         return ret;
>  }
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3d590e51ce84..73e37af8a24d 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1259,9 +1259,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>         seqcount_init(&mm->write_protect_seq);
>         mmap_init_lock(mm);
>         INIT_LIST_HEAD(&mm->mmlist);
> -#ifdef CONFIG_PER_VMA_LOCK
> -       mm->mm_lock_seq = 0;
> -#endif
>         mm_pgtables_bytes_init(mm);
>         mm->map_count = 0;
>         mm->locked_vm = 0;
>
> base-commit: 98808d08fc0f78ee638e0c0a88020fbbaf581ec6
> --
> 2.46.0.rc2.264.g509ed76dc8-goog
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end}
  2024-08-08 20:18 ` [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
@ 2024-08-08 21:02   ` Suren Baghdasaryan
  2024-08-08 21:11     ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: Suren Baghdasaryan @ 2024-08-08 21:02 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: akpm, peterz, linux-kernel, linux-mm, Jann Horn, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko

On Thu, Aug 8, 2024 at 8:19 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Aug 7, 2024 at 11:23 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > Add helper functions to speculatively perform operations without
> > read-locking mmap_lock, expecting that mmap_lock will not be
> > write-locked and mm is not modified from under us.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > ---
>
> This change makes sense and makes mm's seq a bit more useful and
> meaningful. I've also tested it locally with uprobe stress-test, and
> it seems to work great, I haven't run into any problems with a
> multi-hour stress test run so far. Thanks!

Thanks for testing and feel free to include this patch into your set.

I've been thinking about this some more and there is a very unlikely
corner case if between mmap_lock_speculation_start() and
mmap_lock_speculation_end() mmap_lock is write-locked/unlocked so many
times that mm->mm_lock_seq (int) overflows and just happen to reach
the same value as we recorded in mmap_lock_speculation_start(). This
would generate a false positive, which would show up as if the
mmap_lock was never touched. Such overflows are possible for vm_lock
as well (see: https://elixir.bootlin.com/linux/v6.10.3/source/include/linux/mm_types.h#L688)
but they are not critical because a false result would simply lead to
a retry under mmap_lock. However for your case this would be a
critical issue. This is an extremely low probability scenario but
should we still try to handle it?

I'm CC'ing several mm folks and Jann Horn to chime in.

>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
>
> > Discussion [1] follow-up. If proves to be useful can be included in that
> > patchset. Based on mm-unstable.
> >
> > [1] https://lore.kernel.org/all/20240730134605.GO33588@noisy.programming.kicks-ass.net/
> >
> >  include/linux/mm_types.h  |  3 +++
> >  include/linux/mmap_lock.h | 53 +++++++++++++++++++++++++++++++--------
> >  kernel/fork.c             |  3 ---
> >  3 files changed, 46 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 003619fab20e..a426e6ced604 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -887,6 +887,9 @@ struct mm_struct {
> >                  * Roughly speaking, incrementing the sequence number is
> >                  * equivalent to releasing locks on VMAs; reading the sequence
> >                  * number can be part of taking a read lock on a VMA.
> > +                * Incremented every time mmap_lock is write-locked/unlocked.
> > +                * Initialized to 0, therefore odd values indicate mmap_lock
> > +                * is write-locked and even values that it's released.
> >                  *
> >                  * Can be modified under write mmap_lock using RELEASE
> >                  * semantics.
> > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> > index de9dc20b01ba..5410ce741d75 100644
> > --- a/include/linux/mmap_lock.h
> > +++ b/include/linux/mmap_lock.h
> > @@ -71,15 +71,12 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
> >  }
> >
> >  #ifdef CONFIG_PER_VMA_LOCK
> > -/*
> > - * Drop all currently-held per-VMA locks.
> > - * This is called from the mmap_lock implementation directly before releasing
> > - * a write-locked mmap_lock (or downgrading it to read-locked).
> > - * This should normally NOT be called manually from other places.
> > - * If you want to call this manually anyway, keep in mind that this will release
> > - * *all* VMA write locks, including ones from further up the stack.
> > - */
> > -static inline void vma_end_write_all(struct mm_struct *mm)
> > +static inline void init_mm_lock_seq(struct mm_struct *mm)
> > +{
> > +       mm->mm_lock_seq = 0;
> > +}
> > +
> > +static inline void inc_mm_lock_seq(struct mm_struct *mm)
> >  {
> >         mmap_assert_write_locked(mm);
> >         /*
> > @@ -91,19 +88,52 @@ static inline void vma_end_write_all(struct mm_struct *mm)
> >          */
> >         smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> >  }
> > +
> > +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
> > +{
> > +       /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> > +       *seq = smp_load_acquire(&mm->mm_lock_seq);
> > +       /* Allow speculation if mmap_lock is not write-locked */
> > +       return (*seq & 1) == 0;
> > +}
> > +
> > +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
> > +{
> > +       /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> > +       return seq == smp_load_acquire(&mm->mm_lock_seq);
> > +}
> > +
> >  #else
> > -static inline void vma_end_write_all(struct mm_struct *mm) {}
> > +static inline void init_mm_lock_seq(struct mm_struct *mm) {}
> > +static inline void inc_mm_lock_seq(struct mm_struct *mm) {}
> > +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
> > +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq) { return false; }
> >  #endif
> >
> > +/*
> > + * Drop all currently-held per-VMA locks.
> > + * This is called from the mmap_lock implementation directly before releasing
> > + * a write-locked mmap_lock (or downgrading it to read-locked).
> > + * This should normally NOT be called manually from other places.
> > + * If you want to call this manually anyway, keep in mind that this will release
> > + * *all* VMA write locks, including ones from further up the stack.
> > + */
> > +static inline void vma_end_write_all(struct mm_struct *mm)
> > +{
> > +       inc_mm_lock_seq(mm);
> > +}
> > +
> >  static inline void mmap_init_lock(struct mm_struct *mm)
> >  {
> >         init_rwsem(&mm->mmap_lock);
> > +       init_mm_lock_seq(mm);
> >  }
> >
> >  static inline void mmap_write_lock(struct mm_struct *mm)
> >  {
> >         __mmap_lock_trace_start_locking(mm, true);
> >         down_write(&mm->mmap_lock);
> > +       inc_mm_lock_seq(mm);
> >         __mmap_lock_trace_acquire_returned(mm, true, true);
> >  }
> >
> > @@ -111,6 +141,7 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
> >  {
> >         __mmap_lock_trace_start_locking(mm, true);
> >         down_write_nested(&mm->mmap_lock, subclass);
> > +       inc_mm_lock_seq(mm);
> >         __mmap_lock_trace_acquire_returned(mm, true, true);
> >  }
> >
> > @@ -120,6 +151,8 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
> >
> >         __mmap_lock_trace_start_locking(mm, true);
> >         ret = down_write_killable(&mm->mmap_lock);
> > +       if (!ret)
> > +               inc_mm_lock_seq(mm);
> >         __mmap_lock_trace_acquire_returned(mm, true, ret == 0);
> >         return ret;
> >  }
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 3d590e51ce84..73e37af8a24d 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1259,9 +1259,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> >         seqcount_init(&mm->write_protect_seq);
> >         mmap_init_lock(mm);
> >         INIT_LIST_HEAD(&mm->mmlist);
> > -#ifdef CONFIG_PER_VMA_LOCK
> > -       mm->mm_lock_seq = 0;
> > -#endif
> >         mm_pgtables_bytes_init(mm);
> >         mm->map_count = 0;
> >         mm->locked_vm = 0;
> >
> > base-commit: 98808d08fc0f78ee638e0c0a88020fbbaf581ec6
> > --
> > 2.46.0.rc2.264.g509ed76dc8-goog
> >


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end}
  2024-08-08 21:02   ` Suren Baghdasaryan
@ 2024-08-08 21:11     ` Andrii Nakryiko
  2024-08-08 21:42       ` Jann Horn
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2024-08-08 21:11 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, peterz, linux-kernel, linux-mm, Jann Horn, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko

On Thu, Aug 8, 2024 at 2:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Aug 8, 2024 at 8:19 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Wed, Aug 7, 2024 at 11:23 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > Add helper functions to speculatively perform operations without
> > > read-locking mmap_lock, expecting that mmap_lock will not be
> > > write-locked and mm is not modified from under us.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > ---
> >
> > This change makes sense and makes mm's seq a bit more useful and
> > meaningful. I've also tested it locally with uprobe stress-test, and
> > it seems to work great, I haven't run into any problems with a
> > multi-hour stress test run so far. Thanks!
>
> Thanks for testing and feel free to include this patch into your set.

Will do!

>
> I've been thinking about this some more and there is a very unlikely
> corner case if between mmap_lock_speculation_start() and
> mmap_lock_speculation_end() mmap_lock is write-locked/unlocked so many
> times that mm->mm_lock_seq (int) overflows and just happen to reach
> the same value as we recorded in mmap_lock_speculation_start(). This
> would generate a false positive, which would show up as if the
> mmap_lock was never touched. Such overflows are possible for vm_lock
> as well (see: https://elixir.bootlin.com/linux/v6.10.3/source/include/linux/mm_types.h#L688)
> but they are not critical because a false result would simply lead to
> a retry under mmap_lock. However for your case this would be a
> critical issue. This is an extremely low probability scenario but
> should we still try to handle it?
>

No, I think it's fine. Similar problems could happen with refcount_t,
for instance (it has a logic to have a sticky "has overflown" state,
which I believe relies on the fact that we'll never be able to
increment refcount 2bln+ times in between some resetting logic).
Anyways, I think it's utterly unrealistic and should be considered
impossible.



> I'm CC'ing several mm folks and Jann Horn to chime in.
>
> >
> > Acked-by: Andrii Nakryiko <andrii@kernel.org>
> >
> > > Discussion [1] follow-up. If proves to be useful can be included in that
> > > patchset. Based on mm-unstable.
> > >
> > > [1] https://lore.kernel.org/all/20240730134605.GO33588@noisy.programming.kicks-ass.net/
> > >
> > >  include/linux/mm_types.h  |  3 +++
> > >  include/linux/mmap_lock.h | 53 +++++++++++++++++++++++++++++++--------
> > >  kernel/fork.c             |  3 ---
> > >  3 files changed, 46 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 003619fab20e..a426e6ced604 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -887,6 +887,9 @@ struct mm_struct {
> > >                  * Roughly speaking, incrementing the sequence number is
> > >                  * equivalent to releasing locks on VMAs; reading the sequence
> > >                  * number can be part of taking a read lock on a VMA.
> > > +                * Incremented every time mmap_lock is write-locked/unlocked.
> > > +                * Initialized to 0, therefore odd values indicate mmap_lock
> > > +                * is write-locked and even values that it's released.
> > >                  *
> > >                  * Can be modified under write mmap_lock using RELEASE
> > >                  * semantics.
> > > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> > > index de9dc20b01ba..5410ce741d75 100644
> > > --- a/include/linux/mmap_lock.h
> > > +++ b/include/linux/mmap_lock.h
> > > @@ -71,15 +71,12 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
> > >  }
> > >
> > >  #ifdef CONFIG_PER_VMA_LOCK
> > > -/*
> > > - * Drop all currently-held per-VMA locks.
> > > - * This is called from the mmap_lock implementation directly before releasing
> > > - * a write-locked mmap_lock (or downgrading it to read-locked).
> > > - * This should normally NOT be called manually from other places.
> > > - * If you want to call this manually anyway, keep in mind that this will release
> > > - * *all* VMA write locks, including ones from further up the stack.
> > > - */
> > > -static inline void vma_end_write_all(struct mm_struct *mm)
> > > +static inline void init_mm_lock_seq(struct mm_struct *mm)
> > > +{
> > > +       mm->mm_lock_seq = 0;
> > > +}
> > > +
> > > +static inline void inc_mm_lock_seq(struct mm_struct *mm)
> > >  {
> > >         mmap_assert_write_locked(mm);
> > >         /*
> > > @@ -91,19 +88,52 @@ static inline void vma_end_write_all(struct mm_struct *mm)
> > >          */
> > >         smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > >  }
> > > +
> > > +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
> > > +{
> > > +       /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> > > +       *seq = smp_load_acquire(&mm->mm_lock_seq);
> > > +       /* Allow speculation if mmap_lock is not write-locked */
> > > +       return (*seq & 1) == 0;
> > > +}
> > > +
> > > +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
> > > +{
> > > +       /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> > > +       return seq == smp_load_acquire(&mm->mm_lock_seq);
> > > +}
> > > +
> > >  #else
> > > -static inline void vma_end_write_all(struct mm_struct *mm) {}
> > > +static inline void init_mm_lock_seq(struct mm_struct *mm) {}
> > > +static inline void inc_mm_lock_seq(struct mm_struct *mm) {}
> > > +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
> > > +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq) { return false; }
> > >  #endif
> > >
> > > +/*
> > > + * Drop all currently-held per-VMA locks.
> > > + * This is called from the mmap_lock implementation directly before releasing
> > > + * a write-locked mmap_lock (or downgrading it to read-locked).
> > > + * This should normally NOT be called manually from other places.
> > > + * If you want to call this manually anyway, keep in mind that this will release
> > > + * *all* VMA write locks, including ones from further up the stack.
> > > + */
> > > +static inline void vma_end_write_all(struct mm_struct *mm)
> > > +{
> > > +       inc_mm_lock_seq(mm);
> > > +}
> > > +
> > >  static inline void mmap_init_lock(struct mm_struct *mm)
> > >  {
> > >         init_rwsem(&mm->mmap_lock);
> > > +       init_mm_lock_seq(mm);
> > >  }
> > >
> > >  static inline void mmap_write_lock(struct mm_struct *mm)
> > >  {
> > >         __mmap_lock_trace_start_locking(mm, true);
> > >         down_write(&mm->mmap_lock);
> > > +       inc_mm_lock_seq(mm);
> > >         __mmap_lock_trace_acquire_returned(mm, true, true);
> > >  }
> > >
> > > @@ -111,6 +141,7 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
> > >  {
> > >         __mmap_lock_trace_start_locking(mm, true);
> > >         down_write_nested(&mm->mmap_lock, subclass);
> > > +       inc_mm_lock_seq(mm);
> > >         __mmap_lock_trace_acquire_returned(mm, true, true);
> > >  }
> > >
> > > @@ -120,6 +151,8 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
> > >
> > >         __mmap_lock_trace_start_locking(mm, true);
> > >         ret = down_write_killable(&mm->mmap_lock);
> > > +       if (!ret)
> > > +               inc_mm_lock_seq(mm);
> > >         __mmap_lock_trace_acquire_returned(mm, true, ret == 0);
> > >         return ret;
> > >  }
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 3d590e51ce84..73e37af8a24d 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -1259,9 +1259,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> > >         seqcount_init(&mm->write_protect_seq);
> > >         mmap_init_lock(mm);
> > >         INIT_LIST_HEAD(&mm->mmlist);
> > > -#ifdef CONFIG_PER_VMA_LOCK
> > > -       mm->mm_lock_seq = 0;
> > > -#endif
> > >         mm_pgtables_bytes_init(mm);
> > >         mm->map_count = 0;
> > >         mm->locked_vm = 0;
> > >
> > > base-commit: 98808d08fc0f78ee638e0c0a88020fbbaf581ec6
> > > --
> > > 2.46.0.rc2.264.g509ed76dc8-goog
> > >


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end}
  2024-08-08 21:11     ` Andrii Nakryiko
@ 2024-08-08 21:42       ` Jann Horn
  2024-08-08 22:05         ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: Jann Horn @ 2024-08-08 21:42 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Suren Baghdasaryan, akpm, peterz, linux-kernel, linux-mm,
	Matthew Wilcox, Vlastimil Babka, Michal Hocko

On Thu, Aug 8, 2024 at 11:11 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Thu, Aug 8, 2024 at 2:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Aug 8, 2024 at 8:19 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Wed, Aug 7, 2024 at 11:23 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > Add helper functions to speculatively perform operations without
> > > > read-locking mmap_lock, expecting that mmap_lock will not be
> > > > write-locked and mm is not modified from under us.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > > Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > > ---
> > >
> > > This change makes sense and makes mm's seq a bit more useful and
> > > meaningful. I've also tested it locally with uprobe stress-test, and
> > > it seems to work great, I haven't run into any problems with a
> > > multi-hour stress test run so far. Thanks!
> >
> > Thanks for testing and feel free to include this patch into your set.
>
> Will do!
>
> >
> > I've been thinking about this some more and there is a very unlikely
> > corner case if between mmap_lock_speculation_start() and
> > mmap_lock_speculation_end() mmap_lock is write-locked/unlocked so many
> > times that mm->mm_lock_seq (int) overflows and just happen to reach
> > the same value as we recorded in mmap_lock_speculation_start(). This
> > would generate a false positive, which would show up as if the
> > mmap_lock was never touched. Such overflows are possible for vm_lock
> > as well (see: https://elixir.bootlin.com/linux/v6.10.3/source/include/linux/mm_types.h#L688)
> > but they are not critical because a false result would simply lead to
> > a retry under mmap_lock. However for your case this would be a
> > critical issue. This is an extremely low probability scenario but
> > should we still try to handle it?
> >
>
> No, I think it's fine.

Modern computers don't take *that* long to count to 2^32, even when
every step involves one or more syscalls. I've seen bugs where, for
example, a 32-bit refcount is not decremented where it should, making
it possible to overflow the refcount with 2^32 operations of some
kind, and those have taken something like 3 hours to trigger in one
case (https://bugs.chromium.org/p/project-zero/issues/detail?id=2478),
14 hours in another case. Or even cases where, if you have enough RAM,
you can create 2^32 legitimate references to an object and overflow a
refcount that way
(https://bugs.chromium.org/p/project-zero/issues/detail?id=809 if you
had more than 32 GiB of RAM, taking only 25 minutes to overflow the
32-bit counter - and that is with every step allocating memory).
So I'd expect 2^32 simple operations that take the mmap lock for
writing to be faster than 25 minutes on a modern desktop machine.

So for a reader of some kinda 32-bit sequence count, if it is
conceivably possible for the reader to take at least maybe a couple
minutes or so between the sequence count reads (also counting time
during which the reader is preempted or something like that), there
could be a problem. At that point in the analysis, if you wanted to
know whether it's actually exploitable, I guess you'd have to look at
what kinda context you're running in, and what kinda events can
interrupt/preempt you (like whether someone can send a sufficiently
dense flood of IPIs to completely prevent you making forward progress,
like in https://www.vusec.net/projects/ghostrace/), and for how long
those things can delay you (maybe including what the pessimal
scheduler behavior looks like if you're in preemptible context, or how
long clock interrupts can take to execute when processing a giant pile
of epoll watches), and so on...

> Similar problems could happen with refcount_t,
> for instance (it has a logic to have a sticky "has overflown" state,
> which I believe relies on the fact that we'll never be able to
> increment refcount 2bln+ times in between some resetting logic).
> Anyways, I think it's utterly unrealistic and should be considered
> impossible.

IIRC refcount_t protects against this even in theoretical, fairly
pessimal scenarios, because the maximum number of tasks you can have
on Linux is smaller than the number of refcount decrements you'd have
to do in parallel to bring a pinned refcount back down to 0.

I know this is a weakness of seqcount_t (though last time I checked I
couldn't find any examples where it seemed like you could actually
abuse this).

But if you want a counter, and something bad would happen if the
counter wraps, and you don't have a really strong guarantee that the
counter won't wrap, I think it's more robust to make it 64-bit. (Or an
unsigned long and hope there aren't too many people who still run
32-bit kernels on anything important... though that's not very
pretty.)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end}
  2024-08-08 21:42       ` Jann Horn
@ 2024-08-08 22:05         ` Andrii Nakryiko
  2024-08-08 22:16           ` Jann Horn
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2024-08-08 22:05 UTC (permalink / raw)
  To: Jann Horn
  Cc: Suren Baghdasaryan, akpm, peterz, linux-kernel, linux-mm,
	Matthew Wilcox, Vlastimil Babka, Michal Hocko

On Thu, Aug 8, 2024 at 2:43 PM Jann Horn <jannh@google.com> wrote:
>
> On Thu, Aug 8, 2024 at 11:11 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> > On Thu, Aug 8, 2024 at 2:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Thu, Aug 8, 2024 at 8:19 PM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Wed, Aug 7, 2024 at 11:23 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >
> > > > > Add helper functions to speculatively perform operations without
> > > > > read-locking mmap_lock, expecting that mmap_lock will not be
> > > > > write-locked and mm is not modified from under us.
> > > > >
> > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > > > Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > > > ---
> > > >
> > > > This change makes sense and makes mm's seq a bit more useful and
> > > > meaningful. I've also tested it locally with uprobe stress-test, and
> > > > it seems to work great, I haven't run into any problems with a
> > > > multi-hour stress test run so far. Thanks!
> > >
> > > Thanks for testing and feel free to include this patch into your set.
> >
> > Will do!
> >
> > >
> > > I've been thinking about this some more and there is a very unlikely
> > > corner case if between mmap_lock_speculation_start() and
> > > mmap_lock_speculation_end() mmap_lock is write-locked/unlocked so many
> > > times that mm->mm_lock_seq (int) overflows and just happen to reach
> > > the same value as we recorded in mmap_lock_speculation_start(). This
> > > would generate a false positive, which would show up as if the
> > > mmap_lock was never touched. Such overflows are possible for vm_lock
> > > as well (see: https://elixir.bootlin.com/linux/v6.10.3/source/include/linux/mm_types.h#L688)
> > > but they are not critical because a false result would simply lead to
> > > a retry under mmap_lock. However for your case this would be a
> > > critical issue. This is an extremely low probability scenario but
> > > should we still try to handle it?
> > >
> >
> > No, I think it's fine.
>
> Modern computers don't take *that* long to count to 2^32, even when
> every step involves one or more syscalls. I've seen bugs where, for
> example, a 32-bit refcount is not decremented where it should, making
> it possible to overflow the refcount with 2^32 operations of some
> kind, and those have taken something like 3 hours to trigger in one
> case (https://bugs.chromium.org/p/project-zero/issues/detail?id=2478),
> 14 hours in another case. Or even cases where, if you have enough RAM,
> you can create 2^32 legitimate references to an object and overflow a
> refcount that way
> (https://bugs.chromium.org/p/project-zero/issues/detail?id=809 if you
> had more than 32 GiB of RAM, taking only 25 minutes to overflow the
> 32-bit counter - and that is with every step allocating memory).
> So I'd expect 2^32 simple operations that take the mmap lock for
> writing to be faster than 25 minutes on a modern desktop machine.
>
> So for a reader of some kinda 32-bit sequence count, if it is
> conceivably possible for the reader to take at least maybe a couple
> minutes or so between the sequence count reads (also counting time
> during which the reader is preempted or something like that), there
> could be a problem. At that point in the analysis, if you wanted to
> know whether it's actually exploitable, I guess you'd have to look at
> what kinda context you're running in, and what kinda events can
> interrupt/preempt you (like whether someone can send a sufficiently
> dense flood of IPIs to completely prevent you making forward progress,
> like in https://www.vusec.net/projects/ghostrace/), and for how long
> those things can delay you (maybe including what the pessimal
> scheduler behavior looks like if you're in preemptible context, or how
> long clock interrupts can take to execute when processing a giant pile
> of epoll watches), and so on...
>

And here we are talking about *lockless* *speculative* VMA usage that
will last what, at most on the order of a few microseconds? So I stand
by "can never happen", because if it does, your system is so
overloaded that something like this uprobe issue is your least
concern.

> > Similar problems could happen with refcount_t,
> > for instance (it has a logic to have a sticky "has overflown" state,
> > which I believe relies on the fact that we'll never be able to
> > increment refcount 2bln+ times in between some resetting logic).
> > Anyways, I think it's utterly unrealistic and should be considered
> > impossible.
>
> IIRC refcount_t protects against this even in theoretical, fairly
> pessimal scenarios, because the maximum number of tasks you can have
> on Linux is smaller than the number of refcount decrements you'd have
> to do in parallel to bring a pinned refcount back down to 0.
>
> I know this is a weakness of seqcount_t (though last time I checked I
> couldn't find any examples where it seemed like you could actually
> abuse this).
>
> But if you want a counter, and something bad would happen if the
> counter wraps, and you don't have a really strong guarantee that the
> counter won't wrap, I think it's more robust to make it 64-bit. (Or an
> unsigned long and hope there aren't too many people who still run
> 32-bit kernels on anything important... though that's not very
> pretty.)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end}
  2024-08-08 22:05         ` Andrii Nakryiko
@ 2024-08-08 22:16           ` Jann Horn
  2024-08-08 22:36             ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: Jann Horn @ 2024-08-08 22:16 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Suren Baghdasaryan, akpm, peterz, linux-kernel, linux-mm,
	Matthew Wilcox, Vlastimil Babka, Michal Hocko

On Fri, Aug 9, 2024 at 12:05 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Thu, Aug 8, 2024 at 2:43 PM Jann Horn <jannh@google.com> wrote:
> >
> > On Thu, Aug 8, 2024 at 11:11 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > > On Thu, Aug 8, 2024 at 2:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Thu, Aug 8, 2024 at 8:19 PM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > >
> > > > > On Wed, Aug 7, 2024 at 11:23 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > >
> > > > > > Add helper functions to speculatively perform operations without
> > > > > > read-locking mmap_lock, expecting that mmap_lock will not be
> > > > > > write-locked and mm is not modified from under us.
> > > > > >
> > > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > > > > Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > > > > ---
> > > > >
> > > > > This change makes sense and makes mm's seq a bit more useful and
> > > > > meaningful. I've also tested it locally with uprobe stress-test, and
> > > > > it seems to work great, I haven't run into any problems with a
> > > > > multi-hour stress test run so far. Thanks!
> > > >
> > > > Thanks for testing and feel free to include this patch into your set.
> > >
> > > Will do!
> > >
> > > >
> > > > I've been thinking about this some more and there is a very unlikely
> > > > corner case if between mmap_lock_speculation_start() and
> > > > mmap_lock_speculation_end() mmap_lock is write-locked/unlocked so many
> > > > times that mm->mm_lock_seq (int) overflows and just happen to reach
> > > > the same value as we recorded in mmap_lock_speculation_start(). This
> > > > would generate a false positive, which would show up as if the
> > > > mmap_lock was never touched. Such overflows are possible for vm_lock
> > > > as well (see: https://elixir.bootlin.com/linux/v6.10.3/source/include/linux/mm_types.h#L688)
> > > > but they are not critical because a false result would simply lead to
> > > > a retry under mmap_lock. However for your case this would be a
> > > > critical issue. This is an extremely low probability scenario but
> > > > should we still try to handle it?
> > > >
> > >
> > > No, I think it's fine.
> >
> > Modern computers don't take *that* long to count to 2^32, even when
> > every step involves one or more syscalls. I've seen bugs where, for
> > example, a 32-bit refcount is not decremented where it should, making
> > it possible to overflow the refcount with 2^32 operations of some
> > kind, and those have taken something like 3 hours to trigger in one
> > case (https://bugs.chromium.org/p/project-zero/issues/detail?id=2478),
> > 14 hours in another case. Or even cases where, if you have enough RAM,
> > you can create 2^32 legitimate references to an object and overflow a
> > refcount that way
> > (https://bugs.chromium.org/p/project-zero/issues/detail?id=809 if you
> > had more than 32 GiB of RAM, taking only 25 minutes to overflow the
> > 32-bit counter - and that is with every step allocating memory).
> > So I'd expect 2^32 simple operations that take the mmap lock for
> > writing to be faster than 25 minutes on a modern desktop machine.
> >
> > So for a reader of some kinda 32-bit sequence count, if it is
> > conceivably possible for the reader to take at least maybe a couple
> > minutes or so between the sequence count reads (also counting time
> > during which the reader is preempted or something like that), there
> > could be a problem. At that point in the analysis, if you wanted to
> > know whether it's actually exploitable, I guess you'd have to look at
> > what kinda context you're running in, and what kinda events can
> > interrupt/preempt you (like whether someone can send a sufficiently
> > dense flood of IPIs to completely prevent you making forward progress,
> > like in https://www.vusec.net/projects/ghostrace/), and for how long
> > those things can delay you (maybe including what the pessimal
> > scheduler behavior looks like if you're in preemptible context, or how
> > long clock interrupts can take to execute when processing a giant pile
> > of epoll watches), and so on...
> >
>
> And here we are talking about *lockless* *speculative* VMA usage that
> will last what, at most on the order of a few microseconds?

Are you talking about time spent in task context, or time spent while
the task is on the CPU (including time in interrupt context), or about
wall clock time?

https://www.vusec.net/projects/ghostrace/ is pretty amazing - when you
look at the paper
https://download.vusec.net/papers/ghostrace_sec24.pdf you can see in
Figure 4 how they managed to turn a race window that's 8 instructions
wide into a window they can stretch "indefinitely", and they didn't
even have to reschedule to pull it off. If I understand correctly,
they stretched the race window to something like 35 seconds and could
have stretched it even wider if they had wanted to?

(And yes, Linux fixed the specific trick they used for doing that, but
it still shows that this kinda thing is possible in principle.)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end}
  2024-08-08 22:16           ` Jann Horn
@ 2024-08-08 22:36             ` Andrii Nakryiko
  2024-08-08 23:26               ` Suren Baghdasaryan
  2024-08-09 15:20               ` Jann Horn
  0 siblings, 2 replies; 11+ messages in thread
From: Andrii Nakryiko @ 2024-08-08 22:36 UTC (permalink / raw)
  To: Jann Horn
  Cc: Suren Baghdasaryan, akpm, peterz, linux-kernel, linux-mm,
	Matthew Wilcox, Vlastimil Babka, Michal Hocko

On Thu, Aug 8, 2024 at 3:16 PM Jann Horn <jannh@google.com> wrote:
>
> On Fri, Aug 9, 2024 at 12:05 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> > On Thu, Aug 8, 2024 at 2:43 PM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Thu, Aug 8, 2024 at 11:11 PM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > > On Thu, Aug 8, 2024 at 2:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >
> > > > > On Thu, Aug 8, 2024 at 8:19 PM Andrii Nakryiko
> > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Aug 7, 2024 at 11:23 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > >
> > > > > > > Add helper functions to speculatively perform operations without
> > > > > > > read-locking mmap_lock, expecting that mmap_lock will not be
> > > > > > > write-locked and mm is not modified from under us.
> > > > > > >
> > > > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > > > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > > > > > Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > > > > > ---
> > > > > >
> > > > > > This change makes sense and makes mm's seq a bit more useful and
> > > > > > meaningful. I've also tested it locally with uprobe stress-test, and
> > > > > > it seems to work great, I haven't run into any problems with a
> > > > > > multi-hour stress test run so far. Thanks!
> > > > >
> > > > > Thanks for testing and feel free to include this patch into your set.
> > > >
> > > > Will do!
> > > >
> > > > >
> > > > > I've been thinking about this some more and there is a very unlikely
> > > > > corner case if between mmap_lock_speculation_start() and
> > > > > mmap_lock_speculation_end() mmap_lock is write-locked/unlocked so many
> > > > > times that mm->mm_lock_seq (int) overflows and just happen to reach
> > > > > the same value as we recorded in mmap_lock_speculation_start(). This
> > > > > would generate a false positive, which would show up as if the
> > > > > mmap_lock was never touched. Such overflows are possible for vm_lock
> > > > > as well (see: https://elixir.bootlin.com/linux/v6.10.3/source/include/linux/mm_types.h#L688)
> > > > > but they are not critical because a false result would simply lead to
> > > > > a retry under mmap_lock. However for your case this would be a
> > > > > critical issue. This is an extremely low probability scenario but
> > > > > should we still try to handle it?
> > > > >
> > > >
> > > > No, I think it's fine.
> > >
> > > Modern computers don't take *that* long to count to 2^32, even when
> > > every step involves one or more syscalls. I've seen bugs where, for
> > > example, a 32-bit refcount is not decremented where it should, making
> > > it possible to overflow the refcount with 2^32 operations of some
> > > kind, and those have taken something like 3 hours to trigger in one
> > > case (https://bugs.chromium.org/p/project-zero/issues/detail?id=2478),
> > > 14 hours in another case. Or even cases where, if you have enough RAM,
> > > you can create 2^32 legitimate references to an object and overflow a
> > > refcount that way
> > > (https://bugs.chromium.org/p/project-zero/issues/detail?id=809 if you
> > > had more than 32 GiB of RAM, taking only 25 minutes to overflow the
> > > 32-bit counter - and that is with every step allocating memory).
> > > So I'd expect 2^32 simple operations that take the mmap lock for
> > > writing to be faster than 25 minutes on a modern desktop machine.
> > >
> > > So for a reader of some kinda 32-bit sequence count, if it is
> > > conceivably possible for the reader to take at least maybe a couple
> > > minutes or so between the sequence count reads (also counting time
> > > during which the reader is preempted or something like that), there
> > > could be a problem. At that point in the analysis, if you wanted to
> > > know whether it's actually exploitable, I guess you'd have to look at
> > > what kinda context you're running in, and what kinda events can
> > > interrupt/preempt you (like whether someone can send a sufficiently
> > > dense flood of IPIs to completely prevent you making forward progress,
> > > like in https://www.vusec.net/projects/ghostrace/), and for how long
> > > those things can delay you (maybe including what the pessimal
> > > scheduler behavior looks like if you're in preemptible context, or how
> > > long clock interrupts can take to execute when processing a giant pile
> > > of epoll watches), and so on...
> > >
> >
> > And here we are talking about *lockless* *speculative* VMA usage that
> > will last what, at most on the order of a few microseconds?
>
> Are you talking about time spent in task context, or time spent while
> the task is on the CPU (including time in interrupt context), or about
> wall clock time?

We are doing, roughly:

mmap_lock_speculation_start();
rcu_read_lock();
vma_lookup();
rb_find();
rcu_read_unlock();
mmap_lock_speculation_end();


On non-RT kernel this can be prolonged only by having an NMI somewhere
in the middle. On RT it can get preempted even within RCU locked
region, if I understand correctly. If you manage to make this part run
sufficiently long to overflow 31-bit counter, it's probably a bigger
problem than mmap's sequence wrapping over, no?

>
> https://www.vusec.net/projects/ghostrace/ is pretty amazing - when you
> look at the paper
> https://download.vusec.net/papers/ghostrace_sec24.pdf you can see in
> Figure 4 how they managed to turn a race window that's 8 instructions
> wide into a window they can stretch "indefinitely", and they didn't
> even have to reschedule to pull it off. If I understand correctly,
> they stretched the race window to something like 35 seconds and could
> have stretched it even wider if they had wanted to?
>
> (And yes, Linux fixed the specific trick they used for doing that, but
> it still shows that this kinda thing is possible in principle.)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end}
  2024-08-08 22:36             ` Andrii Nakryiko
@ 2024-08-08 23:26               ` Suren Baghdasaryan
  2024-08-09 15:20               ` Jann Horn
  1 sibling, 0 replies; 11+ messages in thread
From: Suren Baghdasaryan @ 2024-08-08 23:26 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jann Horn, akpm, peterz, linux-kernel, linux-mm, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko

On Thu, Aug 8, 2024 at 3:36 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Aug 8, 2024 at 3:16 PM Jann Horn <jannh@google.com> wrote:
> >
> > On Fri, Aug 9, 2024 at 12:05 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > > On Thu, Aug 8, 2024 at 2:43 PM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > On Thu, Aug 8, 2024 at 11:11 PM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > On Thu, Aug 8, 2024 at 2:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > >
> > > > > > On Thu, Aug 8, 2024 at 8:19 PM Andrii Nakryiko
> > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, Aug 7, 2024 at 11:23 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > >
> > > > > > > > Add helper functions to speculatively perform operations without
> > > > > > > > read-locking mmap_lock, expecting that mmap_lock will not be
> > > > > > > > write-locked and mm is not modified from under us.
> > > > > > > >
> > > > > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > > > > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > > > > > > Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > > > > > > ---
> > > > > > >
> > > > > > > This change makes sense and makes mm's seq a bit more useful and
> > > > > > > meaningful. I've also tested it locally with uprobe stress-test, and
> > > > > > > it seems to work great, I haven't run into any problems with a
> > > > > > > multi-hour stress test run so far. Thanks!
> > > > > >
> > > > > > Thanks for testing and feel free to include this patch into your set.
> > > > >
> > > > > Will do!
> > > > >
> > > > > >
> > > > > > I've been thinking about this some more and there is a very unlikely
> > > > > > corner case if between mmap_lock_speculation_start() and
> > > > > > mmap_lock_speculation_end() mmap_lock is write-locked/unlocked so many
> > > > > > times that mm->mm_lock_seq (int) overflows and just happen to reach
> > > > > > the same value as we recorded in mmap_lock_speculation_start(). This
> > > > > > would generate a false positive, which would show up as if the
> > > > > > mmap_lock was never touched. Such overflows are possible for vm_lock
> > > > > > as well (see: https://elixir.bootlin.com/linux/v6.10.3/source/include/linux/mm_types.h#L688)
> > > > > > but they are not critical because a false result would simply lead to
> > > > > > a retry under mmap_lock. However for your case this would be a
> > > > > > critical issue. This is an extremely low probability scenario but
> > > > > > should we still try to handle it?
> > > > > >
> > > > >
> > > > > No, I think it's fine.
> > > >
> > > > Modern computers don't take *that* long to count to 2^32, even when
> > > > every step involves one or more syscalls. I've seen bugs where, for
> > > > example, a 32-bit refcount is not decremented where it should, making
> > > > it possible to overflow the refcount with 2^32 operations of some
> > > > kind, and those have taken something like 3 hours to trigger in one
> > > > case (https://bugs.chromium.org/p/project-zero/issues/detail?id=2478),
> > > > 14 hours in another case. Or even cases where, if you have enough RAM,
> > > > you can create 2^32 legitimate references to an object and overflow a
> > > > refcount that way
> > > > (https://bugs.chromium.org/p/project-zero/issues/detail?id=809 if you
> > > > had more than 32 GiB of RAM, taking only 25 minutes to overflow the
> > > > 32-bit counter - and that is with every step allocating memory).
> > > > So I'd expect 2^32 simple operations that take the mmap lock for
> > > > writing to be faster than 25 minutes on a modern desktop machine.
> > > >
> > > > So for a reader of some kinda 32-bit sequence count, if it is
> > > > conceivably possible for the reader to take at least maybe a couple
> > > > minutes or so between the sequence count reads (also counting time
> > > > during which the reader is preempted or something like that), there
> > > > could be a problem. At that point in the analysis, if you wanted to
> > > > know whether it's actually exploitable, I guess you'd have to look at
> > > > what kinda context you're running in, and what kinda events can
> > > > interrupt/preempt you (like whether someone can send a sufficiently
> > > > dense flood of IPIs to completely prevent you making forward progress,
> > > > like in https://www.vusec.net/projects/ghostrace/), and for how long
> > > > those things can delay you (maybe including what the pessimal
> > > > scheduler behavior looks like if you're in preemptible context, or how
> > > > long clock interrupts can take to execute when processing a giant pile
> > > > of epoll watches), and so on...
> > > >
> > >
> > > And here we are talking about *lockless* *speculative* VMA usage that
> > > will last what, at most on the order of a few microseconds?
> >
> > Are you talking about time spent in task context, or time spent while
> > the task is on the CPU (including time in interrupt context), or about
> > wall clock time?
>
> We are doing, roughly:
>
> mmap_lock_speculation_start();
> rcu_read_lock();
> vma_lookup();
> rb_find();
> rcu_read_unlock();
> mmap_lock_speculation_end();
>
>
> On non-RT kernel this can be prolonged only by having an NMI somewhere
> in the middle. On RT it can get preempted even within RCU locked
> region, if I understand correctly. If you manage to make this part run
> sufficiently long to overflow 31-bit counter, it's probably a bigger
> problem than mmap's sequence wrapping over, no?

I was also thinking that an easy way to strengthen the guarantee this
overflow does not happen is to make mm->mm_lock_seq a 64-bit counter
while keeping vma->vm_lock_seq as is. When comparing them we can use
the lowest 32 bits of mm->mm_lock_seq without any loss of correctness.
Seems easy enough (famous last words). If we decide to do that I will
run performance tests to make sure performance does not suffer.

>
> >
> > https://www.vusec.net/projects/ghostrace/ is pretty amazing - when you
> > look at the paper
> > https://download.vusec.net/papers/ghostrace_sec24.pdf you can see in
> > Figure 4 how they managed to turn a race window that's 8 instructions
> > wide into a window they can stretch "indefinitely", and they didn't
> > even have to reschedule to pull it off. If I understand correctly,
> > they stretched the race window to something like 35 seconds and could
> > have stretched it even wider if they had wanted to?
> >
> > (And yes, Linux fixed the specific trick they used for doing that, but
> > it still shows that this kinda thing is possible in principle.)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end}
  2024-08-08 22:36             ` Andrii Nakryiko
  2024-08-08 23:26               ` Suren Baghdasaryan
@ 2024-08-09 15:20               ` Jann Horn
  2024-08-09 16:39                 ` Andrii Nakryiko
  1 sibling, 1 reply; 11+ messages in thread
From: Jann Horn @ 2024-08-09 15:20 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Suren Baghdasaryan, akpm, peterz, linux-kernel, linux-mm,
	Matthew Wilcox, Vlastimil Babka, Michal Hocko

On Fri, Aug 9, 2024 at 12:36 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Thu, Aug 8, 2024 at 3:16 PM Jann Horn <jannh@google.com> wrote:
> >
> > On Fri, Aug 9, 2024 at 12:05 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > > On Thu, Aug 8, 2024 at 2:43 PM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > On Thu, Aug 8, 2024 at 11:11 PM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > On Thu, Aug 8, 2024 at 2:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > >
> > > > > > On Thu, Aug 8, 2024 at 8:19 PM Andrii Nakryiko
> > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, Aug 7, 2024 at 11:23 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > >
> > > > > > > > Add helper functions to speculatively perform operations without
> > > > > > > > read-locking mmap_lock, expecting that mmap_lock will not be
> > > > > > > > write-locked and mm is not modified from under us.
> > > > > > > >
> > > > > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > > > > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > > > > > > Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > > > > > > ---
> > > > > > >
> > > > > > > This change makes sense and makes mm's seq a bit more useful and
> > > > > > > meaningful. I've also tested it locally with uprobe stress-test, and
> > > > > > > it seems to work great, I haven't run into any problems with a
> > > > > > > multi-hour stress test run so far. Thanks!
> > > > > >
> > > > > > Thanks for testing and feel free to include this patch into your set.
> > > > >
> > > > > Will do!
> > > > >
> > > > > >
> > > > > > I've been thinking about this some more and there is a very unlikely
> > > > > > corner case if between mmap_lock_speculation_start() and
> > > > > > mmap_lock_speculation_end() mmap_lock is write-locked/unlocked so many
> > > > > > times that mm->mm_lock_seq (int) overflows and just happen to reach
> > > > > > the same value as we recorded in mmap_lock_speculation_start(). This
> > > > > > would generate a false positive, which would show up as if the
> > > > > > mmap_lock was never touched. Such overflows are possible for vm_lock
> > > > > > as well (see: https://elixir.bootlin.com/linux/v6.10.3/source/include/linux/mm_types.h#L688)
> > > > > > but they are not critical because a false result would simply lead to
> > > > > > a retry under mmap_lock. However for your case this would be a
> > > > > > critical issue. This is an extremely low probability scenario but
> > > > > > should we still try to handle it?
> > > > > >
> > > > >
> > > > > No, I think it's fine.
> > > >
> > > > Modern computers don't take *that* long to count to 2^32, even when
> > > > every step involves one or more syscalls. I've seen bugs where, for
> > > > example, a 32-bit refcount is not decremented where it should, making
> > > > it possible to overflow the refcount with 2^32 operations of some
> > > > kind, and those have taken something like 3 hours to trigger in one
> > > > case (https://bugs.chromium.org/p/project-zero/issues/detail?id=2478),
> > > > 14 hours in another case. Or even cases where, if you have enough RAM,
> > > > you can create 2^32 legitimate references to an object and overflow a
> > > > refcount that way
> > > > (https://bugs.chromium.org/p/project-zero/issues/detail?id=809 if you
> > > > had more than 32 GiB of RAM, taking only 25 minutes to overflow the
> > > > 32-bit counter - and that is with every step allocating memory).
> > > > So I'd expect 2^32 simple operations that take the mmap lock for
> > > > writing to be faster than 25 minutes on a modern desktop machine.
> > > >
> > > > So for a reader of some kinda 32-bit sequence count, if it is
> > > > conceivably possible for the reader to take at least maybe a couple
> > > > minutes or so between the sequence count reads (also counting time
> > > > during which the reader is preempted or something like that), there
> > > > could be a problem. At that point in the analysis, if you wanted to
> > > > know whether it's actually exploitable, I guess you'd have to look at
> > > > what kinda context you're running in, and what kinda events can
> > > > interrupt/preempt you (like whether someone can send a sufficiently
> > > > dense flood of IPIs to completely prevent you making forward progress,
> > > > like in https://www.vusec.net/projects/ghostrace/), and for how long
> > > > those things can delay you (maybe including what the pessimal
> > > > scheduler behavior looks like if you're in preemptible context, or how
> > > > long clock interrupts can take to execute when processing a giant pile
> > > > of epoll watches), and so on...
> > > >
> > >
> > > And here we are talking about *lockless* *speculative* VMA usage that
> > > will last what, at most on the order of a few microseconds?
> >
> > Are you talking about time spent in task context, or time spent while
> > the task is on the CPU (including time in interrupt context), or about
> > wall clock time?
>
> We are doing, roughly:
>
> mmap_lock_speculation_start();
> rcu_read_lock();
> vma_lookup();
> rb_find();
> rcu_read_unlock();
> mmap_lock_speculation_end();
>
>
> On non-RT kernel this can be prolonged only by having an NMI somewhere
> in the middle.

I don't think you're running with interrupts off here? Even on kernels
without any preemption support, normal interrupts (like timers,
incoming network traffic, TLB flush IPIs) should still be able to
interrupt here. And in CONFIG_PREEMPT kernels (which enable
CONFIG_PREEMPT_RCU by default), rcu_read_lock() doesn't block
preemption, so you can even get preempted here - I don't think you
need RT for that.

My understanding is that the main difference between normal
CONFIG_PREEMPT and RT is whether spin_lock() blocks preemption.

> On RT it can get preempted even within RCU locked
> region, if I understand correctly. If you manage to make this part run
> sufficiently long to overflow 31-bit counter, it's probably a bigger
> problem than mmap's sequence wrapping over, no?

From the perspective of security, I don't consider it to be
particularly severe by itself if a local process can make the system
stall for very long amounts of time. And from the perspective of
reliability, I think scenarios where someone has to very explicitly go
out of their way to destabilize the system don't matter so much?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end}
  2024-08-09 15:20               ` Jann Horn
@ 2024-08-09 16:39                 ` Andrii Nakryiko
  2024-08-09 16:55                   ` Suren Baghdasaryan
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2024-08-09 16:39 UTC (permalink / raw)
  To: Jann Horn
  Cc: Suren Baghdasaryan, akpm, peterz, linux-kernel, linux-mm,
	Matthew Wilcox, Vlastimil Babka, Michal Hocko

On Fri, Aug 9, 2024 at 8:21 AM Jann Horn <jannh@google.com> wrote:
>
> On Fri, Aug 9, 2024 at 12:36 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> > On Thu, Aug 8, 2024 at 3:16 PM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Fri, Aug 9, 2024 at 12:05 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > > On Thu, Aug 8, 2024 at 2:43 PM Jann Horn <jannh@google.com> wrote:
> > > > >
> > > > > On Thu, Aug 8, 2024 at 11:11 PM Andrii Nakryiko
> > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > On Thu, Aug 8, 2024 at 2:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > >
> > > > > > > On Thu, Aug 8, 2024 at 8:19 PM Andrii Nakryiko
> > > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Aug 7, 2024 at 11:23 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > > >
> > > > > > > > > Add helper functions to speculatively perform operations without
> > > > > > > > > read-locking mmap_lock, expecting that mmap_lock will not be
> > > > > > > > > write-locked and mm is not modified from under us.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > > > > > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > > > > > > > Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > > > > > > > ---
> > > > > > > >
> > > > > > > > This change makes sense and makes mm's seq a bit more useful and
> > > > > > > > meaningful. I've also tested it locally with uprobe stress-test, and
> > > > > > > > it seems to work great, I haven't run into any problems with a
> > > > > > > > multi-hour stress test run so far. Thanks!
> > > > > > >
> > > > > > > Thanks for testing and feel free to include this patch into your set.
> > > > > >
> > > > > > Will do!
> > > > > >
> > > > > > >
> > > > > > > I've been thinking about this some more and there is a very unlikely
> > > > > > > corner case if between mmap_lock_speculation_start() and
> > > > > > > mmap_lock_speculation_end() mmap_lock is write-locked/unlocked so many
> > > > > > > times that mm->mm_lock_seq (int) overflows and just happen to reach
> > > > > > > the same value as we recorded in mmap_lock_speculation_start(). This
> > > > > > > would generate a false positive, which would show up as if the
> > > > > > > mmap_lock was never touched. Such overflows are possible for vm_lock
> > > > > > > as well (see: https://elixir.bootlin.com/linux/v6.10.3/source/include/linux/mm_types.h#L688)
> > > > > > > but they are not critical because a false result would simply lead to
> > > > > > > a retry under mmap_lock. However for your case this would be a
> > > > > > > critical issue. This is an extremely low probability scenario but
> > > > > > > should we still try to handle it?
> > > > > > >
> > > > > >
> > > > > > No, I think it's fine.
> > > > >
> > > > > Modern computers don't take *that* long to count to 2^32, even when
> > > > > every step involves one or more syscalls. I've seen bugs where, for
> > > > > example, a 32-bit refcount is not decremented where it should, making
> > > > > it possible to overflow the refcount with 2^32 operations of some
> > > > > kind, and those have taken something like 3 hours to trigger in one
> > > > > case (https://bugs.chromium.org/p/project-zero/issues/detail?id=2478),
> > > > > 14 hours in another case. Or even cases where, if you have enough RAM,
> > > > > you can create 2^32 legitimate references to an object and overflow a
> > > > > refcount that way
> > > > > (https://bugs.chromium.org/p/project-zero/issues/detail?id=809 if you
> > > > > had more than 32 GiB of RAM, taking only 25 minutes to overflow the
> > > > > 32-bit counter - and that is with every step allocating memory).
> > > > > So I'd expect 2^32 simple operations that take the mmap lock for
> > > > > writing to be faster than 25 minutes on a modern desktop machine.
> > > > >
> > > > > So for a reader of some kinda 32-bit sequence count, if it is
> > > > > conceivably possible for the reader to take at least maybe a couple
> > > > > minutes or so between the sequence count reads (also counting time
> > > > > during which the reader is preempted or something like that), there
> > > > > could be a problem. At that point in the analysis, if you wanted to
> > > > > know whether it's actually exploitable, I guess you'd have to look at
> > > > > what kinda context you're running in, and what kinda events can
> > > > > interrupt/preempt you (like whether someone can send a sufficiently
> > > > > dense flood of IPIs to completely prevent you making forward progress,
> > > > > like in https://www.vusec.net/projects/ghostrace/), and for how long
> > > > > those things can delay you (maybe including what the pessimal
> > > > > scheduler behavior looks like if you're in preemptible context, or how
> > > > > long clock interrupts can take to execute when processing a giant pile
> > > > > of epoll watches), and so on...
> > > > >
> > > >
> > > > And here we are talking about *lockless* *speculative* VMA usage that
> > > > will last what, at most on the order of a few microseconds?
> > >
> > > Are you talking about time spent in task context, or time spent while
> > > the task is on the CPU (including time in interrupt context), or about
> > > wall clock time?
> >
> > We are doing, roughly:
> >
> > mmap_lock_speculation_start();
> > rcu_read_lock();
> > vma_lookup();
> > rb_find();
> > rcu_read_unlock();
> > mmap_lock_speculation_end();
> >
> >
> > On non-RT kernel this can be prolonged only by having an NMI somewhere
> > in the middle.
>
> I don't think you're running with interrupts off here? Even on kernels
> without any preemption support, normal interrupts (like timers,
> incoming network traffic, TLB flush IPIs) should still be able to
> interrupt here. And in CONFIG_PREEMPT kernels (which enable
> CONFIG_PREEMPT_RCU by default), rcu_read_lock() doesn't block
> preemption, so you can even get preempted here - I don't think you
> need RT for that.

Fair enough, normal interrupts can happen as well. Still, we are
talking about the above fast sequence running long enough (for
whatever reason) for the rest of the system to update mm (and not just
plan increment counters) for 2 billion times with mmap_write_lock() +
actual work + vma_end_write_all() logic. All kinds of bad things will
start happening before that: RCU stall warnings, lots of accumulated
memory waiting for RCU grace period, blocked threads on
synchronize_rcu(), etc.

>
> My understanding is that the main difference between normal
> CONFIG_PREEMPT and RT is whether spin_lock() blocks preemption.
>
> > On RT it can get preempted even within RCU locked
> > region, if I understand correctly. If you manage to make this part run
> > sufficiently long to overflow 31-bit counter, it's probably a bigger
> > problem than mmap's sequence wrapping over, no?
>
> From the perspective of security, I don't consider it to be
> particularly severe by itself if a local process can make the system
> stall for very long amounts of time. And from the perspective of
> reliability, I think scenarios where someone has to very explicitly go
> out of their way to destabilize the system don't matter so much?

So just to be clear. u64 counter is a no-brainer and I have nothing
against that. What I do worry about, though, is that this 64-bit
counter will be objected to due to it being potentially slower on
32-bit architectures. So I'd rather have
mmap_lock_speculation_{start,end}() with a 32-bit mm_lock_seq counter
than not have a way to speculate against VMA/mm at all.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end}
  2024-08-09 16:39                 ` Andrii Nakryiko
@ 2024-08-09 16:55                   ` Suren Baghdasaryan
  0 siblings, 0 replies; 11+ messages in thread
From: Suren Baghdasaryan @ 2024-08-09 16:55 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jann Horn, akpm, peterz, linux-kernel, linux-mm, Matthew Wilcox,
	Vlastimil Babka, Michal Hocko

On Fri, Aug 9, 2024 at 4:39 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Aug 9, 2024 at 8:21 AM Jann Horn <jannh@google.com> wrote:
> >
> > On Fri, Aug 9, 2024 at 12:36 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > > On Thu, Aug 8, 2024 at 3:16 PM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > On Fri, Aug 9, 2024 at 12:05 AM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > On Thu, Aug 8, 2024 at 2:43 PM Jann Horn <jannh@google.com> wrote:
> > > > > >
> > > > > > On Thu, Aug 8, 2024 at 11:11 PM Andrii Nakryiko
> > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > > On Thu, Aug 8, 2024 at 2:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Aug 8, 2024 at 8:19 PM Andrii Nakryiko
> > > > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Aug 7, 2024 at 11:23 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Add helper functions to speculatively perform operations without
> > > > > > > > > > read-locking mmap_lock, expecting that mmap_lock will not be
> > > > > > > > > > write-locked and mm is not modified from under us.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > > > > > > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > > > > > > > > Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > > > > > > > > ---
> > > > > > > > >
> > > > > > > > > This change makes sense and makes mm's seq a bit more useful and
> > > > > > > > > meaningful. I've also tested it locally with uprobe stress-test, and
> > > > > > > > > it seems to work great, I haven't run into any problems with a
> > > > > > > > > multi-hour stress test run so far. Thanks!
> > > > > > > >
> > > > > > > > Thanks for testing and feel free to include this patch into your set.
> > > > > > >
> > > > > > > Will do!
> > > > > > >
> > > > > > > >
> > > > > > > > I've been thinking about this some more and there is a very unlikely
> > > > > > > > corner case if between mmap_lock_speculation_start() and
> > > > > > > > mmap_lock_speculation_end() mmap_lock is write-locked/unlocked so many
> > > > > > > > times that mm->mm_lock_seq (int) overflows and just happen to reach
> > > > > > > > the same value as we recorded in mmap_lock_speculation_start(). This
> > > > > > > > would generate a false positive, which would show up as if the
> > > > > > > > mmap_lock was never touched. Such overflows are possible for vm_lock
> > > > > > > > as well (see: https://elixir.bootlin.com/linux/v6.10.3/source/include/linux/mm_types.h#L688)
> > > > > > > > but they are not critical because a false result would simply lead to
> > > > > > > > a retry under mmap_lock. However for your case this would be a
> > > > > > > > critical issue. This is an extremely low probability scenario but
> > > > > > > > should we still try to handle it?
> > > > > > > >
> > > > > > >
> > > > > > > No, I think it's fine.
> > > > > >
> > > > > > Modern computers don't take *that* long to count to 2^32, even when
> > > > > > every step involves one or more syscalls. I've seen bugs where, for
> > > > > > example, a 32-bit refcount is not decremented where it should, making
> > > > > > it possible to overflow the refcount with 2^32 operations of some
> > > > > > kind, and those have taken something like 3 hours to trigger in one
> > > > > > case (https://bugs.chromium.org/p/project-zero/issues/detail?id=2478),
> > > > > > 14 hours in another case. Or even cases where, if you have enough RAM,
> > > > > > you can create 2^32 legitimate references to an object and overflow a
> > > > > > refcount that way
> > > > > > (https://bugs.chromium.org/p/project-zero/issues/detail?id=809 if you
> > > > > > had more than 32 GiB of RAM, taking only 25 minutes to overflow the
> > > > > > 32-bit counter - and that is with every step allocating memory).
> > > > > > So I'd expect 2^32 simple operations that take the mmap lock for
> > > > > > writing to be faster than 25 minutes on a modern desktop machine.
> > > > > >
> > > > > > So for a reader of some kinda 32-bit sequence count, if it is
> > > > > > conceivably possible for the reader to take at least maybe a couple
> > > > > > minutes or so between the sequence count reads (also counting time
> > > > > > during which the reader is preempted or something like that), there
> > > > > > could be a problem. At that point in the analysis, if you wanted to
> > > > > > know whether it's actually exploitable, I guess you'd have to look at
> > > > > > what kinda context you're running in, and what kinda events can
> > > > > > interrupt/preempt you (like whether someone can send a sufficiently
> > > > > > dense flood of IPIs to completely prevent you making forward progress,
> > > > > > like in https://www.vusec.net/projects/ghostrace/), and for how long
> > > > > > those things can delay you (maybe including what the pessimal
> > > > > > scheduler behavior looks like if you're in preemptible context, or how
> > > > > > long clock interrupts can take to execute when processing a giant pile
> > > > > > of epoll watches), and so on...
> > > > > >
> > > > >
> > > > > And here we are talking about *lockless* *speculative* VMA usage that
> > > > > will last what, at most on the order of a few microseconds?
> > > >
> > > > Are you talking about time spent in task context, or time spent while
> > > > the task is on the CPU (including time in interrupt context), or about
> > > > wall clock time?
> > >
> > > We are doing, roughly:
> > >
> > > mmap_lock_speculation_start();
> > > rcu_read_lock();
> > > vma_lookup();
> > > rb_find();
> > > rcu_read_unlock();
> > > mmap_lock_speculation_end();
> > >
> > >
> > > On non-RT kernel this can be prolonged only by having an NMI somewhere
> > > in the middle.
> >
> > I don't think you're running with interrupts off here? Even on kernels
> > without any preemption support, normal interrupts (like timers,
> > incoming network traffic, TLB flush IPIs) should still be able to
> > interrupt here. And in CONFIG_PREEMPT kernels (which enable
> > CONFIG_PREEMPT_RCU by default), rcu_read_lock() doesn't block
> > preemption, so you can even get preempted here - I don't think you
> > need RT for that.
>
> Fair enough, normal interrupts can happen as well. Still, we are
> talking about the above fast sequence running long enough (for
> whatever reason) for the rest of the system to update mm (and not just
> plan increment counters) for 2 billion times with mmap_write_lock() +
> actual work + vma_end_write_all() logic. All kinds of bad things will
> start happening before that: RCU stall warnings, lots of accumulated
> memory waiting for RCU grace period, blocked threads on
> synchronize_rcu(), etc.
>
> >
> > My understanding is that the main difference between normal
> > CONFIG_PREEMPT and RT is whether spin_lock() blocks preemption.
> >
> > > On RT it can get preempted even within RCU locked
> > > region, if I understand correctly. If you manage to make this part run
> > > sufficiently long to overflow 31-bit counter, it's probably a bigger
> > > problem than mmap's sequence wrapping over, no?
> >
> > From the perspective of security, I don't consider it to be
> > particularly severe by itself if a local process can make the system
> > stall for very long amounts of time. And from the perspective of
> > reliability, I think scenarios where someone has to very explicitly go
> > out of their way to destabilize the system don't matter so much?
>
> So just to be clear. u64 counter is a no-brainer and I have nothing
> against that. What I do worry about, though, is that this 64-bit
> counter will be objected to due to it being potentially slower on
> 32-bit architectures. So I'd rather have
> mmap_lock_speculation_{start,end}() with a 32-bit mm_lock_seq counter
> than not have a way to speculate against VMA/mm at all.

IMHO the probability that the 32-bit counter will wrap around and end
up at exactly the same value out of 2^32 possible ones is so minuscule
that we could ignore that possibility.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-08-09 16:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20240807182325.2585582-1-surenb@google.com>
2024-08-08 20:18 ` [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
2024-08-08 21:02   ` Suren Baghdasaryan
2024-08-08 21:11     ` Andrii Nakryiko
2024-08-08 21:42       ` Jann Horn
2024-08-08 22:05         ` Andrii Nakryiko
2024-08-08 22:16           ` Jann Horn
2024-08-08 22:36             ` Andrii Nakryiko
2024-08-08 23:26               ` Suren Baghdasaryan
2024-08-09 15:20               ` Jann Horn
2024-08-09 16:39                 ` Andrii Nakryiko
2024-08-09 16:55                   ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox