linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/2] execve scalability issues, part 1
@ 2023-08-23  5:06 Mateusz Guzik
  2023-08-23  5:06 ` [PATCH v3 1/2] pcpcntr: add group allocation/free Mateusz Guzik
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Mateusz Guzik @ 2023-08-23  5:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: dennis, tj, cl, akpm, shakeelb, vegard.nossum, linux-mm, Mateusz Guzik

To start I figured I'm going to bench about as friendly case as it gets
-- statically linked *separate* binaries all doing execve in a loop.

I borrowed the bench from here:
http://apollo.backplane.com/DFlyMisc/doexec.c

$ cc -static -O2 -o static-doexec doexec.c
$ ./static-doexec $(nproc)

It prints a result every second.

My test box is temporarily only 26 cores and even at this scale I run
into massive lock contention stemming from back-to-back calls to
percpu_counter_init (and _destroy later).

While not a panacea, one simple thing to do here is to batch these ops.
Since the term "batching" is already used in the file, I decided to
refer to it as "grouping" instead.

Even if this code could be patched to dodge these counters,  I would
argue a high-traffic alloc/free consumer is only a matter of time so it
makes sense to facilitate it.

With the fix I get an ok win, to quote from the commit:
> Even at a very modest scale of 26 cores (ops/s):
> before: 133543.63
> after:  186061.81 (+39%)

While with the patch these allocations remain a significant problem,
the primary bottleneck shifts to:

    __pv_queued_spin_lock_slowpath+1
    _raw_spin_lock_irqsave+57
    folio_lruvec_lock_irqsave+91
    release_pages+590
    tlb_batch_pages_flush+61
    tlb_finish_mmu+101
    exit_mmap+327
    __mmput+61
    begin_new_exec+1245
    load_elf_binary+712
    bprm_execve+644
    do_execveat_common.isra.0+429
    __x64_sys_execve+50
    do_syscall_64+46
    entry_SYSCALL_64_after_hwframe+110

I intend to do more work on the area to mostly sort it out, but I would
not mind if someone else took the hammer to folio. :)

With this out of the way I'll be looking at some form of caching to
eliminate these allocs as a problem.

v3:
- fix !CONFIG_SMP build
- drop the backtrace from fork commit message

v2:
- force bigger alignment on alloc
- rename "counters" to "nr_counters" and pass prior to lock key
- drop {}'s for single-statement loops


Mateusz Guzik (2):
  pcpcntr: add group allocation/free
  fork: group allocation of per-cpu counters for mm struct

 include/linux/percpu_counter.h | 39 ++++++++++++++++++----
 kernel/fork.c                  | 14 ++------
 lib/percpu_counter.c           | 61 +++++++++++++++++++++++-----------
 3 files changed, 77 insertions(+), 37 deletions(-)

-- 
2.41.0



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-09-06  8:26 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-23  5:06 [PATCH v3 0/2] execve scalability issues, part 1 Mateusz Guzik
2023-08-23  5:06 ` [PATCH v3 1/2] pcpcntr: add group allocation/free Mateusz Guzik
2023-08-24  6:26   ` Dennis Zhou
2023-08-24 10:01     ` Vegard Nossum
2023-08-23  5:06 ` [PATCH v3 2/2] kernel/fork: group allocation/free of per-cpu counters for mm struct Mateusz Guzik
2023-08-24  6:28   ` Dennis Zhou
2023-09-06  8:25   ` kernel test robot
2023-08-25 15:14 ` [PATCH v3 0/2] execve scalability issues, part 1 Dennis Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox