linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mateusz Guzik <mjguzik@gmail.com>
To: Linus Torvalds <torvalds@linux-foundation.org>,
	 Christian Brauner <brauner@kernel.org>
Cc: kernel test robot <oliver.sang@intel.com>,
	oe-lkp@lists.linux.dev,  lkp@intel.com,
	Linux Memory Management List <linux-mm@kvack.org>,
	 linux-kernel@vger.kernel.org, ying.huang@intel.com,
	feng.tang@intel.com, fengwei.yin@intel.com
Subject: Re: [linux-next:master] [lockref]  d042dae6ad:  unixbench.throughput -33.7% regression
Date: Tue, 2 Jul 2024 14:10:32 +0200	[thread overview]
Message-ID: <vgg45kxk2fiyznm44w2saz3qjiwrjp3spvpswsl4ovd2jl5c5g@54dlbd4kdlh4> (raw)
In-Reply-To: <lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx>

On Tue, Jul 02, 2024 at 09:19:10AM +0200, Mateusz Guzik wrote:
> On Thu, Jun 27, 2024 at 10:41:35AM +0800, kernel test robot wrote:
> > 
> > 
> > Hello,
> > 
> > kernel test robot noticed a -33.7% regression of unixbench.throughput on:
> > 
> > 
> > commit: d042dae6ad74df8a00ee8a3c6b77b53bc9e32f64 ("lockref: speculatively spin waiting for the lock to be released")
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > 
> [..]
> > | testcase: change | stress-ng: stress-ng.getdent.ops_per_sec -56.5% regression                                |
> [..]
> > | testcase: change | stress-ng: stress-ng.handle.ops_per_sec -60.2% regression                                 |
> > | test machine     | 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory |
> [..]
> 
> Both good and bad news is that this particular patch should be dropped
> (but struct dentry reordering should stay). There is some gnarly stuff
> here, please read the entire thing, even though it may seem rambly at
> times.
> 
> To recap there can be a microbenchmark setting with continuous get/put
> traffic where only sporadically someone takes the lock. The constant
> traffic paired with immediate fallback to locking makes the primitives
> degrade to locked operation until benchmark is concluded.  This is the
> problem the patch tried to address as with dentry reordering I found
> myself running into the degraded state by accident a lot. There was
> worry lkp machinery would do too, disfiguring their results.
> 
> However, another microbenchmark setting can be mixing explicit spinlock
> acquire with lockref get/put. With high enough worker count the lock can
> be continuously held, making any spinning waiting for it to be released
> actively detrimental to performance. This is what the lkp folks ran
> into.
> 
> So I got a 2 socket * 48 cores * 2 threads machine to play with.
> 
> Both unixbench and stress-ng --getdent testcases are heavily mixed with
> read-write locking and mutexes, while stress-ng --handle is all about
> spinlocks and 1/3rd of it is lockref thus I only benchmarked the last
> one.
> 
> At a *low* scale of 24 workers at my usual test box performance improves
> immensely (from ~165k ops/s to ~620k ops/s) as locking is avoided plenty
> of times.
> 
> At 192 workers on the big box stock kernel achieves ~134k ops/s, while
> spinwait drops it to ~42k. As an experiment I added a dedicated waiting
> loop with configurable spin count. Doing merely one spin drops the rate
> to ~124k ops/s.
> 
> I figured I'm going to write the smallest patch which avoids locked-only
> degradation and call it a day -- for example it is possible to check if
> the lock is contested thanks to the arch_spin_is_contended macro.
> 
> Then the loop could be:
> 	if (lockref_locked(old)) {
> 		if (locked_contested(old))
> 			break;
> 		cpu_relax();
> 		old.lock_count = READ_ONCE(lockref->lock_count);
> 		continue;
> 	}
> 
> 
> Note how this avoids any additional accesses to the cacheline if the
> lock is under traffic. While this would avoid degrading in certain
> trivial cases, it very much can still result in lockref get/put being in
> the picture *and* only taking locks, so I'm not particularly happy with
> it.
> 
> This is where I found that going to 60 workers issuing open or stat on
> the same dentry *also* permanently degrades, for example:
> 
> # ./stat2_processes -t 60
> testcase:Same file stat
> warmup
> min:11293 max:25058 total:1164504
> min:10706 max:12977 total:679839
> min:10303 max:12222 total:643437
> min:8722 max:9846 total:542022
> min:8514 max:9554 total:529731
> min:8349 max:9296 total:519381
> measurement
> min:8203 max:9079 total:510768
> min:8122 max:8913 total:504069
> 
> According to perf top it's all contention on the spinlock and it is the
> lockref routines taking it. So something sneaks in a lock/unlock cycle
> and there is enough traffic that the default 100 spins with the patch
> are not sufficient to wait it out. With the configurable spin count I
> found this can stay lockless if I make it wait 1000 spins. Getting the
> spin count "wrong" further reduces performance -- for example with 700
> spin limit it was not enough to wait out the lock holders and throughput
> went down to ~269k due to time wasted spinning, merely a little more
> than half of the degraded state above.
> 
> That is to say messing with lockref itself is probably best avoided as
> getting a win depends on being at the right (low) scale or getting
> "lucky".
> 
> My worry about locked-only degradation showing up was not justified in
> that for the scales used by lkp folks it already occures.
> 
> All in all the thing to do is to find out who takes the spin lock and if
> it can be avoided. Perhaps this can be further split so that only spots
> depending on the refcount take the d_lock (for example in the --handle
> test the lock is taken to iterate the alias list -- it would not mind
> refs going up or down).
> 
> It may be (and I'm really just handwaving here) dentries should move
> away from lockref altogether in favor of storing flags in the refcount.
> Even if both get/put would still have to cmpxchg they would never have
> to concern itself with locking of any sort, except when the dentry is
> going down. And probably the unref side could merely lock xadd into it.
> I recognize this would convert some regular ops in other parts of the
> kernel into atomics and it may be undesirable. Something to ponder
> nonetheless.
> 
> At the end of the day:
> 1. this patch should be dropped, I don't think there is a good
> replacement either
> 2. the real fix would do something about the dentry lock itself or its
> consumers (maybe the lock is taken way more than it should?)

Well there is also the option of going full RCU in the fast path, which
I mentioned last time around lockref.

This would be applicable at least for stat, fstatx, readlink and access
syscalls.

It would not help the above bench though, but then again I don't think
it is doing anything realistic.

I'm going to poke around it when I find the time.

Maybe ditching lockref would be a good idea regardless if the above pans
out.


  reply	other threads:[~2024-07-02 12:10 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-27  2:41 kernel test robot
2024-06-27  6:25 ` Mateusz Guzik
2024-06-27  7:00   ` Mateusz Guzik
2024-06-27 16:32     ` Linus Torvalds
2024-06-27 16:55       ` Mateusz Guzik
2024-06-27 16:57       ` Linus Torvalds
2024-06-27 17:20         ` Mateusz Guzik
2024-06-27 17:23         ` Linus Torvalds
2024-07-02  7:19 ` Mateusz Guzik
2024-07-02 12:10   ` Mateusz Guzik [this message]
2024-07-02 16:47     ` Linus Torvalds
2024-07-02 17:02       ` Mateusz Guzik
2024-07-02 17:28         ` Linus Torvalds
2024-07-02 17:46           ` Mateusz Guzik
2024-07-02 17:58             ` Mateusz Guzik
2024-07-02 18:41               ` Linus Torvalds
2024-07-02 20:33                 ` Mateusz Guzik
2024-07-02 20:42                   ` Linus Torvalds
2024-07-02 21:15                     ` Mateusz Guzik
2024-07-02 22:14                       ` Linus Torvalds
2024-07-03 13:53                         ` Mateusz Guzik
2024-07-03 14:08                           ` Christian Brauner
2024-07-03 14:11                             ` Mateusz Guzik
2024-07-03 16:47                           ` Linus Torvalds
2024-07-03  8:34   ` Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=vgg45kxk2fiyznm44w2saz3qjiwrjp3spvpswsl4ovd2jl5c5g@54dlbd4kdlh4 \
    --to=mjguzik@gmail.com \
    --cc=brauner@kernel.org \
    --cc=feng.tang@intel.com \
    --cc=fengwei.yin@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkp@intel.com \
    --cc=oe-lkp@lists.linux.dev \
    --cc=oliver.sang@intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox