Re: not issuing vma_start_write() in dup_mmap() if the caller is single-threaded

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mateusz Guzik <mjguzik@gmail.com>
To: "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Mateusz Guzik <mjguzik@gmail.com>,
	 Suren Baghdasaryan <surenb@google.com>,
	linux-mm <linux-mm@kvack.org>,
	 Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: not issuing vma_start_write() in dup_mmap() if the caller is single-threaded
Date: Mon, 31 Mar 2025 22:27:55 +0200	[thread overview]
Message-ID: <CAGudoHF1JkD3Zo0Md_5_Ea8xpqpbzbJc_r3_335AF4nW1s3t8w@mail.gmail.com> (raw)
In-Reply-To: <4v25j6kbo4flzy5zbubf2w6bgesz2juadevokqlg2ugbe3im3h@h37mcss2aclt>

On Mon, Mar 31, 2025 at 9:24 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> Reading the commit did trigger my memory about the 5% regression, but I
> disregarded it as it seems so unlikely in real life that the benefits
> outweighed the upper bounds of 5% negative.
>
> The 5,000 forks of 10,000 vmas benchmark would also be at least a bit
> faster with Suren's more recent rcu safe vma change [1].
>

Well it follows it would be more than 5% now? ;)

> Although, that doesn't rule out changing this to get higher number, I
> still don't think the complexity of the change is worth it for a
> contrived benchmark.
>

I assumed this would be easy enough to reason through for someone
familiar with mm (which I'm not).

If this is deemed too hairy, then I completely agree it should not be
pursued (at least for now, see below).

> Reading Documentation/mm/active_mm.rst makes me uncertain of the
> mm_count == 1 and mm_users == 1 test.  Since mm_count is the number of
> 'lazy' users, and it may no longer include the lazy users..
>

The doc states interested callers need to use mmgrab_lazy_tlb() and
mmdrop_lazy_tlb() and that the *exact* count wont be known.

This is perfectly fine for my proposal as we don't care specifically
how many other users are out there (lazy tlb or otherwise), what we do
care to know is if there is at least one and that much we are being
told in the expected manner: pounding the mm off to lazy tlb also
grabs a ->mm_count on it.

To be more exact, mmgrab_lazy_tlb() starts with pinning the struct
while the mmap semaphore is held. This synchronizes ->mm_count
visibility against dup_mmap() which write-locks it.

Suppose a process with two threads races with one issuing dup_mmap()
and the other one exiting. Further suppose the forking thread got the
lock after the thread executing exit_mmap() finished and the mm
remains in active use for lazy tlb. Per my previous e-mail it is an
invariant that in this case ->mm_count is at least two:
1. the forking thread uses the mm, which implies mm_users > 0, which
implies mm_count of at least one.
2. employing the mm for lazy tlb use bumped the count, which means it
has to be at least two, regardless of how many lazy uses managed to
elide refcount manipulation afterwards

But the count being two disables the optimization. Also note the
forking thread got the lock first, ->mm_count cannot transition 1 -> 2
on the account of lazy tlb flush as in the worst case it is waiting
for the lock (besides, this implies ->mm_users of at least two due to
the thread waiting which also disables the optimization).

> Since there are no real workloads that suffer, is this worth it?
>

It is unclear to me if you are criticizing this idea specifically
(given your doubts of whether it even works) or the notion of this
kind of optimization to begin with.

And this brings me back to kernel build, which I'm looking at for kicks.

There is a lot of evil going on there from userspace pov, which
partially obfuscates kernel-side slowdowns, of which there are plenty
(and I whacked some of them on the VFS side, more in the pipeline). I
don't expect dodging vma locking to be worth much in isolation in a
real workload. I do expect consistent effort to provide smooth
execution in commonly used code paths (fork being one) to add up.
Apart from being algorithmically sound in its own right, smooth
execution means no stalls for no stalls which can be easily avoided
(notably cache misses and lock-prefixed ops).

To give you a specific example, compilation of C files ran by gnu make
goes through the shell -- as in extra fork + exec trip on top of
spawning the compiler. In order to model this, with focus on the
kernel side, I slapped in a system("gcc -c -o /tmp/crap.o src.c")
testcase into will-it-scale (src.c is hello world with one #include
clause, important for later). The use of system() suffers the shell
trip in the same manner.

I'm seeing about 20% system time, about a third of which looks
suspicious at best. For example over 1% of the total (or over 4% of
kernel time) is spent in sync_regs() copying 168 bytes using rep
movsq. Convincing the compiler to use regular stores instead dropped
the routine to about 0.1% and gave me a measurable boost in throughput
of these compiles. Will this speed up compilation of a non-toy file?
In isolation, no. I do however claim that:
1. consistent effort to whack avoidable slowdowns adds up, you don't
need to do anything revolutionary (albeit that's welcome)
2. the kernel is chock full of slowdowns which don't need to be there,
but I'm not going to rant about specific examples
3. the famed real workloads are shooting themselves in their feet on
the regular, making it really hard to show improvement on the kernel
side, apart from massive changes or fixing utter breakage

So, for the vma lock avoidance idea, it may be it requires too much
analysis to justify it. If so, screw it. I do stand by what I tries to
accomplish tho (whacking atomics in the fast path in the common case).
-- 
Mateusz Guzik <mjguzik gmail.com>

     prev parent reply	other threads:[~2025-03-31 20:28 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-27  5:46 Mateusz Guzik
2025-03-29  0:56 ` Suren Baghdasaryan
2025-03-29  1:15   ` Mateusz Guzik
2025-03-29  1:35     ` Suren Baghdasaryan
2025-03-29  1:51       ` Mateusz Guzik
2025-03-30 19:23         ` Suren Baghdasaryan
2025-03-30 19:25           ` Suren Baghdasaryan
2025-03-30 19:43           ` Mateusz Guzik
2025-03-31 16:43             ` Liam R. Howlett
2025-03-31 17:50               ` Mateusz Guzik
2025-03-31 18:42                 ` Suren Baghdasaryan
2025-03-31 19:24                 ` Liam R. Howlett
2025-03-31 20:27                   ` Mateusz Guzik [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGudoHF1JkD3Zo0Md_5_Ea8xpqpbzbJc_r3_335AF4nW1s3t8w@mail.gmail.com \
    --to=mjguzik@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=surenb@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox