From: Marcelo Tosatti <mtosatti@redhat.com>
To: Frederic Weisbecker <frederic@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>,
Leonardo Bras <leobras.c@gmail.com>,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
linux-mm@kvack.org, Johannes Weiner <hannes@cmpxchg.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Muchun Song <muchun.song@linux.dev>,
Andrew Morton <akpm@linux-foundation.org>,
Christoph Lameter <cl@linux.com>,
Pekka Enberg <penberg@kernel.org>,
David Rientjes <rientjes@google.com>,
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
Vlastimil Babka <vbabka@suse.cz>,
Hyeonggon Yoo <42.hyeyoo@gmail.com>,
Leonardo Bras <leobras@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
Waiman Long <longman@redhat.com>,
Boqun Feng <boqun.feng@gmail.com>,
Frederic Weisbecker <fweisbecker@suse.de>
Subject: Re: [PATCH 0/4] Introduce QPW for per-cpu operations
Date: Tue, 24 Feb 2026 14:23:25 -0300 [thread overview]
Message-ID: <aZ3ejedS7nE5mnva@tpad> (raw)
In-Reply-To: <aZzM_44L1vKzcOCy@pavilion.home>
On Mon, Feb 23, 2026 at 10:56:15PM +0100, Frederic Weisbecker wrote:
> Le Thu, Feb 19, 2026 at 08:30:31PM +0100, Michal Hocko a écrit :
> > On Thu 19-02-26 12:27:23, Marcelo Tosatti wrote:
> > > Michal,
> > >
> > > Again, i don't see how moving operations to happen at return to
> > > kernel would help (assuming you are talking about
> > > "context_tracking,x86: Defer some IPIs until a user->kernel transition").
> >
> > Nope, I am not talking about IPIs, although those are an example of pcp
> > state as well. I am sorry I do not have a link handy, I am pretty sure
> > Frederic will have that. Another example, though, was vmstat flushes
> > that need to be pcp. There are many other examples.
>
> Here it is:
>
> https://lore.kernel.org/all/20250410152327.24504-1-frederic@kernel.org/
>
> Thanks.
Frederic,
I think this is a valid solution, however on systems with many CPUs, in
nohz_full, performing system calls, can't there be significant increase
of lru_lock contention ? Consider 100+ CPUs performing many system calls
which add 1 or 2 folios to per-CPU LRU lists.
Note: if you are confident about the above not being a problem,
this approach looks good to me.
commit eb709b0d062efd653a61183af8e27b2711c3cf5c
Author: Shaohua Li <shaohua.li@intel.com>
Date: Tue May 24 17:12:55 2011 -0700
mm: batch activate_page() to reduce lock contention
The zone->lru_lock is heavily contented in workload where activate_page()
is frequently used. We could do batch activate_page() to reduce the lock
contention. The batched pages will be added into zone list when the pool
is full or page reclaim is trying to drain them.
For example, in a 4 socket 64 CPU system, create a sparse file and 64
processes, processes shared map to the file. Each process read access the
whole file and then exit. The process exit will do unmap_vmas() and cause
a lot of activate_page() call. In such workload, we saw about 58% total
time reduction with below patch. Other workloads with a lot of
activate_page also benefits a lot too.
...
The most significent are:
case-lru-file-readtwice -11.69%
case-mmap-pread-rand -15.26%
case-mmap-pread-seq -69.72%
Some Gemini answers (question was "list of nohz_full usecases"):
2. Scientific Simulation & Research
Research institutions (like CERN, NASA, or national labs) use nohz_full
for "tightly coupled" parallel workloads.
Workloads: Molecular dynamics, fluid dynamics (CFD), and weather forecasting (e.g., WRF models).
The "Barrier" Problem: In massive clusters using MPI (Message Passing
Interface), all CPUs often have to reach a synchronization barrier
before the next step of a simulation. If one CPU is delayed by a few
milliseconds due to a timer tick, all other thousands of CPUs sit idle
waiting for it. nohz_full prevents this "tail latency" from stalling the
entire supercomputer.
...
4. Competitive Benchmarking & Kernel Development
Performance engineers use this mode to get "clean" numbers when testing
new hardware or compilers.
Workloads: Core-to-core latency tests, cache-bandwidth benchmarks, and
standard suites like SPEC CPU.
Goal: Eliminating the "noise" of the operating system so that the
results reflect pure hardware performance.
...
Summary Table: Who uses nohz_full?
User Group Primary Workload Why they use it
Quant Firms High-Frequency Trading To prevent micro-stutter during trade execution.
Research Labs MPI-based Simulations To avoid the "slowest node" stalling the whole cluster.
Telcos/ISP 5G/Packet Processing To ensure wire-speed processing without interrupts.
Hardware Vendors Chip Validation To benchmark CPU performance without OS interference.
Here is how scientific simulations handle system calls:
1. The "Compute-Loop" (Low Syscall)
The core of a simulation (like a GROMACS molecular dynamics step) is
just raw math: fetching data from RAM, doing floating-point arithmetic
(AVX/SSE), and writing it back.
During the loop: The CPU stays in "Userspace" for millions of cycles
without ever asking the kernel for help.
Why it works: Since there are no system calls, nohz_full can
successfully turn off the timer tick, allowing the CPU to focus 100% on
the math.
2. The "Communication-Phase" (High Syscall)
System calls usually happen only at the end of a computation block, when
the simulation needs to talk to other nodes.
The Tools: MPI (Message Passing Interface) uses system calls like write,
sendmsg, or specialized RDMA calls to move data across the network.
The Pattern: These simulations follow a "Burst" pattern—long periods
of zero system calls (computation) followed by a quick burst of system
calls (synchronization).
3. When are they "Syscall Intensive"?
There are specific parts of a simulation that are intensive, but
researchers try to minimize them:
I/O Operations: Writing "checkpoints" or large trajectory files to disk
(using write()). This is why high-end HPC systems use Asynchronous I/O
or dedicated I/O nodes—to keep the compute cores from getting bogged
down in system calls.
Memory Allocation: Constantly calling malloc/free involves the brk or
mmap system calls. Optimized simulation tools pre-allocate all the
memory they need at startup to avoid this.
next prev parent reply other threads:[~2026-02-24 18:26 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-06 14:34 Marcelo Tosatti
2026-02-06 14:34 ` [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
2026-02-06 15:20 ` Marcelo Tosatti
2026-02-07 0:16 ` Leonardo Bras
2026-02-11 12:09 ` Marcelo Tosatti
2026-02-14 21:32 ` Leonardo Bras
2026-02-06 14:34 ` [PATCH 2/4] mm/swap: move bh draining into a separate workqueue Marcelo Tosatti
2026-02-06 14:34 ` [PATCH 3/4] swap: apply new queue_percpu_work_on() interface Marcelo Tosatti
2026-02-07 1:06 ` Leonardo Bras
2026-02-06 14:34 ` [PATCH 4/4] slub: " Marcelo Tosatti
2026-02-07 1:27 ` Leonardo Bras
2026-02-06 23:56 ` [PATCH 0/4] Introduce QPW for per-cpu operations Leonardo Bras
2026-02-10 14:01 ` Michal Hocko
2026-02-11 12:01 ` Marcelo Tosatti
2026-02-11 12:11 ` Marcelo Tosatti
2026-02-14 21:35 ` Leonardo Bras
2026-02-11 16:38 ` Michal Hocko
2026-02-11 16:50 ` Marcelo Tosatti
2026-02-11 16:59 ` Vlastimil Babka
2026-02-11 17:07 ` Michal Hocko
2026-02-14 22:02 ` Leonardo Bras
2026-02-16 11:00 ` Michal Hocko
2026-02-19 15:27 ` Marcelo Tosatti
2026-02-19 19:30 ` Michal Hocko
2026-02-20 14:30 ` Marcelo Tosatti
2026-02-23 9:18 ` Michal Hocko
2026-02-23 21:56 ` Frederic Weisbecker
2026-02-24 17:23 ` Marcelo Tosatti [this message]
2026-02-20 10:48 ` Vlastimil Babka
2026-02-20 12:31 ` Michal Hocko
2026-02-20 17:35 ` Marcelo Tosatti
2026-02-20 17:58 ` Vlastimil Babka
2026-02-20 19:01 ` Marcelo Tosatti
2026-02-23 9:11 ` Michal Hocko
2026-02-23 11:20 ` Marcelo Tosatti
2026-02-24 14:40 ` Frederic Weisbecker
2026-02-24 18:12 ` Marcelo Tosatti
2026-02-20 16:51 ` Marcelo Tosatti
2026-02-20 16:55 ` Marcelo Tosatti
2026-02-20 22:38 ` Leonardo Bras
2026-02-23 18:09 ` Vlastimil Babka
2026-02-20 21:58 ` Leonardo Bras
2026-02-23 9:06 ` Michal Hocko
2026-02-19 13:15 ` Marcelo Tosatti
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aZ3ejedS7nE5mnva@tpad \
--to=mtosatti@redhat.com \
--cc=42.hyeyoo@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=boqun.feng@gmail.com \
--cc=cgroups@vger.kernel.org \
--cc=cl@linux.com \
--cc=frederic@kernel.org \
--cc=fweisbecker@suse.de \
--cc=hannes@cmpxchg.org \
--cc=iamjoonsoo.kim@lge.com \
--cc=leobras.c@gmail.com \
--cc=leobras@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=longman@redhat.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=penberg@kernel.org \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=tglx@linutronix.de \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox