From: Suren Baghdasaryan <surenb@google.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Jann Horn <jannh@google.com>,
akpm@linux-foundation.org, peterz@infradead.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Matthew Wilcox <willy@infradead.org>,
Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@suse.com>
Subject: Re: [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end}
Date: Fri, 9 Aug 2024 16:55:17 +0000 [thread overview]
Message-ID: <CAJuCfpHquwbc2768MjOp8vUT7fSaV=xd+pBn4fPOUHBHJdNxnA@mail.gmail.com> (raw)
In-Reply-To: <CAEf4BzarYwnt-MT+1icXeTVdk0gLUmXuJtV_5dA9=a8CWE3=Tg@mail.gmail.com>
On Fri, Aug 9, 2024 at 4:39 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Aug 9, 2024 at 8:21 AM Jann Horn <jannh@google.com> wrote:
> >
> > On Fri, Aug 9, 2024 at 12:36 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > > On Thu, Aug 8, 2024 at 3:16 PM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > On Fri, Aug 9, 2024 at 12:05 AM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > On Thu, Aug 8, 2024 at 2:43 PM Jann Horn <jannh@google.com> wrote:
> > > > > >
> > > > > > On Thu, Aug 8, 2024 at 11:11 PM Andrii Nakryiko
> > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > > On Thu, Aug 8, 2024 at 2:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Aug 8, 2024 at 8:19 PM Andrii Nakryiko
> > > > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Aug 7, 2024 at 11:23 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Add helper functions to speculatively perform operations without
> > > > > > > > > > read-locking mmap_lock, expecting that mmap_lock will not be
> > > > > > > > > > write-locked and mm is not modified from under us.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > > > > > > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > > > > > > > > Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > > > > > > > > ---
> > > > > > > > >
> > > > > > > > > This change makes sense and makes mm's seq a bit more useful and
> > > > > > > > > meaningful. I've also tested it locally with uprobe stress-test, and
> > > > > > > > > it seems to work great, I haven't run into any problems with a
> > > > > > > > > multi-hour stress test run so far. Thanks!
> > > > > > > >
> > > > > > > > Thanks for testing and feel free to include this patch into your set.
> > > > > > >
> > > > > > > Will do!
> > > > > > >
> > > > > > > >
> > > > > > > > I've been thinking about this some more and there is a very unlikely
> > > > > > > > corner case if between mmap_lock_speculation_start() and
> > > > > > > > mmap_lock_speculation_end() mmap_lock is write-locked/unlocked so many
> > > > > > > > times that mm->mm_lock_seq (int) overflows and just happen to reach
> > > > > > > > the same value as we recorded in mmap_lock_speculation_start(). This
> > > > > > > > would generate a false positive, which would show up as if the
> > > > > > > > mmap_lock was never touched. Such overflows are possible for vm_lock
> > > > > > > > as well (see: https://elixir.bootlin.com/linux/v6.10.3/source/include/linux/mm_types.h#L688)
> > > > > > > > but they are not critical because a false result would simply lead to
> > > > > > > > a retry under mmap_lock. However for your case this would be a
> > > > > > > > critical issue. This is an extremely low probability scenario but
> > > > > > > > should we still try to handle it?
> > > > > > > >
> > > > > > >
> > > > > > > No, I think it's fine.
> > > > > >
> > > > > > Modern computers don't take *that* long to count to 2^32, even when
> > > > > > every step involves one or more syscalls. I've seen bugs where, for
> > > > > > example, a 32-bit refcount is not decremented where it should, making
> > > > > > it possible to overflow the refcount with 2^32 operations of some
> > > > > > kind, and those have taken something like 3 hours to trigger in one
> > > > > > case (https://bugs.chromium.org/p/project-zero/issues/detail?id=2478),
> > > > > > 14 hours in another case. Or even cases where, if you have enough RAM,
> > > > > > you can create 2^32 legitimate references to an object and overflow a
> > > > > > refcount that way
> > > > > > (https://bugs.chromium.org/p/project-zero/issues/detail?id=809 if you
> > > > > > had more than 32 GiB of RAM, taking only 25 minutes to overflow the
> > > > > > 32-bit counter - and that is with every step allocating memory).
> > > > > > So I'd expect 2^32 simple operations that take the mmap lock for
> > > > > > writing to be faster than 25 minutes on a modern desktop machine.
> > > > > >
> > > > > > So for a reader of some kinda 32-bit sequence count, if it is
> > > > > > conceivably possible for the reader to take at least maybe a couple
> > > > > > minutes or so between the sequence count reads (also counting time
> > > > > > during which the reader is preempted or something like that), there
> > > > > > could be a problem. At that point in the analysis, if you wanted to
> > > > > > know whether it's actually exploitable, I guess you'd have to look at
> > > > > > what kinda context you're running in, and what kinda events can
> > > > > > interrupt/preempt you (like whether someone can send a sufficiently
> > > > > > dense flood of IPIs to completely prevent you making forward progress,
> > > > > > like in https://www.vusec.net/projects/ghostrace/), and for how long
> > > > > > those things can delay you (maybe including what the pessimal
> > > > > > scheduler behavior looks like if you're in preemptible context, or how
> > > > > > long clock interrupts can take to execute when processing a giant pile
> > > > > > of epoll watches), and so on...
> > > > > >
> > > > >
> > > > > And here we are talking about *lockless* *speculative* VMA usage that
> > > > > will last what, at most on the order of a few microseconds?
> > > >
> > > > Are you talking about time spent in task context, or time spent while
> > > > the task is on the CPU (including time in interrupt context), or about
> > > > wall clock time?
> > >
> > > We are doing, roughly:
> > >
> > > mmap_lock_speculation_start();
> > > rcu_read_lock();
> > > vma_lookup();
> > > rb_find();
> > > rcu_read_unlock();
> > > mmap_lock_speculation_end();
> > >
> > >
> > > On non-RT kernel this can be prolonged only by having an NMI somewhere
> > > in the middle.
> >
> > I don't think you're running with interrupts off here? Even on kernels
> > without any preemption support, normal interrupts (like timers,
> > incoming network traffic, TLB flush IPIs) should still be able to
> > interrupt here. And in CONFIG_PREEMPT kernels (which enable
> > CONFIG_PREEMPT_RCU by default), rcu_read_lock() doesn't block
> > preemption, so you can even get preempted here - I don't think you
> > need RT for that.
>
> Fair enough, normal interrupts can happen as well. Still, we are
> talking about the above fast sequence running long enough (for
> whatever reason) for the rest of the system to update mm (and not just
> plan increment counters) for 2 billion times with mmap_write_lock() +
> actual work + vma_end_write_all() logic. All kinds of bad things will
> start happening before that: RCU stall warnings, lots of accumulated
> memory waiting for RCU grace period, blocked threads on
> synchronize_rcu(), etc.
>
> >
> > My understanding is that the main difference between normal
> > CONFIG_PREEMPT and RT is whether spin_lock() blocks preemption.
> >
> > > On RT it can get preempted even within RCU locked
> > > region, if I understand correctly. If you manage to make this part run
> > > sufficiently long to overflow 31-bit counter, it's probably a bigger
> > > problem than mmap's sequence wrapping over, no?
> >
> > From the perspective of security, I don't consider it to be
> > particularly severe by itself if a local process can make the system
> > stall for very long amounts of time. And from the perspective of
> > reliability, I think scenarios where someone has to very explicitly go
> > out of their way to destabilize the system don't matter so much?
>
> So just to be clear. u64 counter is a no-brainer and I have nothing
> against that. What I do worry about, though, is that this 64-bit
> counter will be objected to due to it being potentially slower on
> 32-bit architectures. So I'd rather have
> mmap_lock_speculation_{start,end}() with a 32-bit mm_lock_seq counter
> than not have a way to speculate against VMA/mm at all.
IMHO the probability that the 32-bit counter will wrap around and end
up at exactly the same value out of 2^32 possible ones is so minuscule
that we could ignore that possibility.
prev parent reply other threads:[~2024-08-09 16:55 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20240807182325.2585582-1-surenb@google.com>
2024-08-08 20:18 ` Andrii Nakryiko
2024-08-08 21:02 ` Suren Baghdasaryan
2024-08-08 21:11 ` Andrii Nakryiko
2024-08-08 21:42 ` Jann Horn
2024-08-08 22:05 ` Andrii Nakryiko
2024-08-08 22:16 ` Jann Horn
2024-08-08 22:36 ` Andrii Nakryiko
2024-08-08 23:26 ` Suren Baghdasaryan
2024-08-09 15:20 ` Jann Horn
2024-08-09 16:39 ` Andrii Nakryiko
2024-08-09 16:55 ` Suren Baghdasaryan [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAJuCfpHquwbc2768MjOp8vUT7fSaV=xd+pBn4fPOUHBHJdNxnA@mail.gmail.com' \
--to=surenb@google.com \
--cc=akpm@linux-foundation.org \
--cc=andrii.nakryiko@gmail.com \
--cc=jannh@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=peterz@infradead.org \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox