From: Frederic Weisbecker <frederic@kernel.org>
To: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>,
Leonardo Bras <leobras.c@gmail.com>,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
linux-mm@kvack.org, Johannes Weiner <hannes@cmpxchg.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Muchun Song <muchun.song@linux.dev>,
Andrew Morton <akpm@linux-foundation.org>,
Christoph Lameter <cl@linux.com>,
Pekka Enberg <penberg@kernel.org>,
David Rientjes <rientjes@google.com>,
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
Vlastimil Babka <vbabka@suse.cz>,
Hyeonggon Yoo <42.hyeyoo@gmail.com>,
Leonardo Bras <leobras@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
Waiman Long <longman@redhat.com>,
Boqun Feng <boqun.feng@gmail.com>,
Frederic Weisbecker <fweisbecker@suse.de>
Subject: Re: [PATCH 0/4] Introduce QPW for per-cpu operations
Date: Wed, 25 Feb 2026 22:49:54 +0100 [thread overview]
Message-ID: <aZ9ugjKvb4U7_R93@pavilion.home> (raw)
In-Reply-To: <aZ3ejedS7nE5mnva@tpad>
Le Tue, Feb 24, 2026 at 02:23:25PM -0300, Marcelo Tosatti a écrit :
> On Mon, Feb 23, 2026 at 10:56:15PM +0100, Frederic Weisbecker wrote:
> > Le Thu, Feb 19, 2026 at 08:30:31PM +0100, Michal Hocko a écrit :
> > > On Thu 19-02-26 12:27:23, Marcelo Tosatti wrote:
> > > > Michal,
> > > >
> > > > Again, i don't see how moving operations to happen at return to
> > > > kernel would help (assuming you are talking about
> > > > "context_tracking,x86: Defer some IPIs until a user->kernel transition").
> > >
> > > Nope, I am not talking about IPIs, although those are an example of pcp
> > > state as well. I am sorry I do not have a link handy, I am pretty sure
> > > Frederic will have that. Another example, though, was vmstat flushes
> > > that need to be pcp. There are many other examples.
> >
> > Here it is:
> >
> > https://lore.kernel.org/all/20250410152327.24504-1-frederic@kernel.org/
> >
> > Thanks.
>
> Frederic,
>
> I think this is a valid solution, however on systems with many CPUs, in
> nohz_full, performing system calls, can't there be significant increase
> of lru_lock contention ? Consider 100+ CPUs performing many system calls
> which add 1 or 2 folios to per-CPU LRU lists.
That's more a question for Michal or Vlastimil.
>
> Note: if you are confident about the above not being a problem,
> this approach looks good to me.
>
> commit eb709b0d062efd653a61183af8e27b2711c3cf5c
> Author: Shaohua Li <shaohua.li@intel.com>
> Date: Tue May 24 17:12:55 2011 -0700
>
> mm: batch activate_page() to reduce lock contention
>
> The zone->lru_lock is heavily contented in workload where activate_page()
> is frequently used. We could do batch activate_page() to reduce the lock
> contention. The batched pages will be added into zone list when the pool
> is full or page reclaim is trying to drain them.
>
> For example, in a 4 socket 64 CPU system, create a sparse file and 64
> processes, processes shared map to the file. Each process read access the
> whole file and then exit. The process exit will do unmap_vmas() and cause
> a lot of activate_page() call. In such workload, we saw about 58% total
> time reduction with below patch. Other workloads with a lot of
> activate_page also benefits a lot too.
>
> ...
> The most significent are:
> case-lru-file-readtwice -11.69%
> case-mmap-pread-rand -15.26%
> case-mmap-pread-seq -69.72%
>
> Some Gemini answers (question was "list of nohz_full usecases"):
>
> 2. Scientific Simulation & Research
>
> Research institutions (like CERN, NASA, or national labs) use nohz_full
> for "tightly coupled" parallel workloads.
>
> Workloads: Molecular dynamics, fluid dynamics (CFD), and weather forecasting (e.g., WRF models).
>
> The "Barrier" Problem: In massive clusters using MPI (Message Passing
> Interface), all CPUs often have to reach a synchronization barrier
> before the next step of a simulation. If one CPU is delayed by a few
> milliseconds due to a timer tick, all other thousands of CPUs sit idle
> waiting for it. nohz_full prevents this "tail latency" from stalling the
> entire supercomputer.
Wow! I heard about that possible usecase but I didn't know it was
really used in practice. I privately dug into this with gemini and
there are actual usecase references.
>
> ...
>
> 4. Competitive Benchmarking & Kernel Development
> Performance engineers use this mode to get "clean" numbers when testing
> new hardware or compilers.
>
> Workloads: Core-to-core latency tests, cache-bandwidth benchmarks, and
> standard suites like SPEC CPU.
>
> Goal: Eliminating the "noise" of the operating system so that the
> results reflect pure hardware performance.
Ok didn't know about that one either.
> Summary Table: Who uses nohz_full?
> User Group Primary Workload Why they use it
> Quant Firms High-Frequency Trading To prevent micro-stutter during trade execution.
> Research Labs MPI-based Simulations To avoid the "slowest node" stalling the whole cluster.
> Telcos/ISP 5G/Packet Processing To ensure wire-speed processing without interrupts.
> Hardware Vendors Chip Validation To benchmark CPU performance without OS interference.
>
>
> Here is how scientific simulations handle system calls:
>
> 1. The "Compute-Loop" (Low Syscall)
>
> The core of a simulation (like a GROMACS molecular dynamics step) is
> just raw math: fetching data from RAM, doing floating-point arithmetic
> (AVX/SSE), and writing it back.
>
> During the loop: The CPU stays in "Userspace" for millions of cycles
> without ever asking the kernel for help.
>
> Why it works: Since there are no system calls, nohz_full can
> successfully turn off the timer tick, allowing the CPU to focus 100% on
> the math.
>
> 2. The "Communication-Phase" (High Syscall)
>
> System calls usually happen only at the end of a computation block, when
> the simulation needs to talk to other nodes.
>
> The Tools: MPI (Message Passing Interface) uses system calls like write,
> sendmsg, or specialized RDMA calls to move data across the network.
>
> The Pattern: These simulations follow a "Burst" pattern—long periods
> of zero system calls (computation) followed by a quick burst of system
> calls (synchronization).
>
> 3. When are they "Syscall Intensive"?
>
> There are specific parts of a simulation that are intensive, but
> researchers try to minimize them:
>
> I/O Operations: Writing "checkpoints" or large trajectory files to disk
> (using write()). This is why high-end HPC systems use Asynchronous I/O
> or dedicated I/O nodes—to keep the compute cores from getting bogged
> down in system calls.
>
> Memory Allocation: Constantly calling malloc/free involves the brk or
> mmap system calls. Optimized simulation tools pre-allocate all the
> memory they need at startup to avoid this.
Ok. I asked a similar question and got this (you made me use an LLM for the
first time btw, I held out for 4 years... I'm sure I can wait 4 more years until
the next usage :o)
Does MPI synchronization only use userspace code or does it rely on syscalls?
The short answer is: **both**, but the goal of HPC (High Performance Computing) is to use as few **syscalls** (system calls) as possible.
Here is the technical breakdown:
### 1. The "Fast Path" (User-Space only)
To maximize speed, modern MPI implementations (like OpenMPI or MPICH) try to stay in **User-Space** as much as possible.
* **Shared Memory (Intra-node):** If two processes are running on the same server, they communicate via a shared memory zone. Synchronization is handled by **spinlocks** or **software barriers**. The CPU "polls" (loops) on a memory variable until it changes. **No syscall is involved here.**
* **RDMA / InfiniBand (Inter-node):** With high-performance network cards, MPI uses **RDMA** (Remote Direct Memory Access). The user code instructs the network card to write directly into the memory of the distant server. Once configured, data exchange happens without involving the Linux kernel.
### 2. The "Slow Path" (System Calls / Syscalls)
Passing through the kernel (a syscall) is necessary in certain situations, but it is "expensive" because it forces a **context switch**, which flushes CPU caches.
* **Initialization:** During startup (`MPI_Init`), many syscalls are used to create sockets, map shared memory (`mmap`), and configure network interfaces.
* **Standard TCP/IP:** If you are not using a high-performance network (RDMA) but simple Ethernet instead, MPI must call `send()` and `recv()`, which are syscalls. The Linux kernel then takes over to manage the TCP/IP stack.
* **Sleep Mode (Blocking):** If an MPI process waits for a message for too long, it may decide to "go to sleep" to yield the CPU to another task via syscalls like `futex()` or `poll()`.
**In summary:** MPI synchronization aims to be **100% User-Space** (via memory polling) to avoid syscall latency. It is precisely because MPI tries to bypass the kernel that we use `nohz_full`: we are asking the kernel not to even "knock on the CPU's door" with its clock interruptions.
next prev parent reply other threads:[~2026-02-25 21:50 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-06 14:34 Marcelo Tosatti
2026-02-06 14:34 ` [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
2026-02-06 15:20 ` Marcelo Tosatti
2026-02-07 0:16 ` Leonardo Bras
2026-02-11 12:09 ` Marcelo Tosatti
2026-02-14 21:32 ` Leonardo Bras
2026-02-06 14:34 ` [PATCH 2/4] mm/swap: move bh draining into a separate workqueue Marcelo Tosatti
2026-02-06 14:34 ` [PATCH 3/4] swap: apply new queue_percpu_work_on() interface Marcelo Tosatti
2026-02-07 1:06 ` Leonardo Bras
2026-02-06 14:34 ` [PATCH 4/4] slub: " Marcelo Tosatti
2026-02-07 1:27 ` Leonardo Bras
2026-02-06 23:56 ` [PATCH 0/4] Introduce QPW for per-cpu operations Leonardo Bras
2026-02-10 14:01 ` Michal Hocko
2026-02-11 12:01 ` Marcelo Tosatti
2026-02-11 12:11 ` Marcelo Tosatti
2026-02-14 21:35 ` Leonardo Bras
2026-02-11 16:38 ` Michal Hocko
2026-02-11 16:50 ` Marcelo Tosatti
2026-02-11 16:59 ` Vlastimil Babka
2026-02-11 17:07 ` Michal Hocko
2026-02-14 22:02 ` Leonardo Bras
2026-02-16 11:00 ` Michal Hocko
2026-02-19 15:27 ` Marcelo Tosatti
2026-02-19 19:30 ` Michal Hocko
2026-02-20 14:30 ` Marcelo Tosatti
2026-02-23 9:18 ` Michal Hocko
2026-02-23 21:56 ` Frederic Weisbecker
2026-02-24 17:23 ` Marcelo Tosatti
2026-02-25 21:49 ` Frederic Weisbecker [this message]
2026-02-20 10:48 ` Vlastimil Babka
2026-02-20 12:31 ` Michal Hocko
2026-02-20 17:35 ` Marcelo Tosatti
2026-02-20 17:58 ` Vlastimil Babka
2026-02-20 19:01 ` Marcelo Tosatti
2026-02-23 9:11 ` Michal Hocko
2026-02-23 11:20 ` Marcelo Tosatti
2026-02-24 14:40 ` Frederic Weisbecker
2026-02-24 18:12 ` Marcelo Tosatti
2026-02-20 16:51 ` Marcelo Tosatti
2026-02-20 16:55 ` Marcelo Tosatti
2026-02-20 22:38 ` Leonardo Bras
2026-02-23 18:09 ` Vlastimil Babka
2026-02-20 21:58 ` Leonardo Bras
2026-02-23 9:06 ` Michal Hocko
2026-02-19 13:15 ` Marcelo Tosatti
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aZ9ugjKvb4U7_R93@pavilion.home \
--to=frederic@kernel.org \
--cc=42.hyeyoo@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=boqun.feng@gmail.com \
--cc=cgroups@vger.kernel.org \
--cc=cl@linux.com \
--cc=fweisbecker@suse.de \
--cc=hannes@cmpxchg.org \
--cc=iamjoonsoo.kim@lge.com \
--cc=leobras.c@gmail.com \
--cc=leobras@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=longman@redhat.com \
--cc=mhocko@suse.com \
--cc=mtosatti@redhat.com \
--cc=muchun.song@linux.dev \
--cc=penberg@kernel.org \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=tglx@linutronix.de \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox