From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id CBF88FD45F8
	for <linux-mm@archiver.kernel.org>; Wed, 25 Feb 2026 21:50:01 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 363736B0089; Wed, 25 Feb 2026 16:50:01 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 337A66B008A; Wed, 25 Feb 2026 16:50:01 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 20FDF6B008C; Wed, 25 Feb 2026 16:50:01 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 0C05A6B0089
	for <linux-mm@kvack.org>; Wed, 25 Feb 2026 16:50:01 -0500 (EST)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id AF1A51C5D7
	for <linux-mm@kvack.org>; Wed, 25 Feb 2026 21:50:00 +0000 (UTC)
X-FDA: 84484322160.28.DAD9E43
Received: from sea.source.kernel.org (sea.source.kernel.org [172.234.252.31])
	by imf07.hostedemail.com (Postfix) with ESMTP id EFC1C40005
	for <linux-mm@kvack.org>; Wed, 25 Feb 2026 21:49:58 +0000 (UTC)
Authentication-Results: imf07.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=mfjWYAmD;
	spf=pass (imf07.hostedemail.com: domain of frederic@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=frederic@kernel.org;
	dmarc=pass (policy=quarantine) header.from=kernel.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1772056199;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=CHeb4aRDCwxKjo65gZtztEJJKCajvCXQUy0SWR/0MTw=;
	b=qjIaNKb2tScy+XvPWDZR0cu5QaFnI/BmBwQYNqJUkc2HmS2qN4+t7jYL4AmyQbuXHt94Vi
	F8EzgX7SKRWy3FGi5XH/2ZZsRLJtbpQ8+WHxPVc8bUqiNlTTmZmTzKT7VmXNIF8VxNtxrV
	UH6HR1pWZLRo4KL4qP+Dha5M0dc9OXw=
ARC-Authentication-Results: i=1;
	imf07.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=mfjWYAmD;
	spf=pass (imf07.hostedemail.com: domain of frederic@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=frederic@kernel.org;
	dmarc=pass (policy=quarantine) header.from=kernel.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1772056199; a=rsa-sha256;
	cv=none;
	b=RanKaPtkLD6V6xAcJ4kkVpYx2Ph/VF/+PBp8UAIAz36ZWL1DOx3qzEFOEmwhSzffQkW0a1
	iHJGw64ujn1jpZU9gVVvJR9aDpNgGVEkuJqXIPiJ4TBN9ZAE4geW1rQhqlCeh4FDMu19at
	0SQ7LkVeKhvzd4zOwEXqJXpostGjGvw=
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by sea.source.kernel.org (Postfix) with ESMTP id E07A94376E;
	Wed, 25 Feb 2026 21:49:57 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 47D25C116D0;
	Wed, 25 Feb 2026 21:49:57 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1772056197;
	bh=eFfIbBi1BSaZ60S9OfdQ91pAdXxYSBhBHROYNQOv5Cg=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=mfjWYAmDklXC1TbGymTmWkOLCr3v35NDsg+jk09TAAVxaBG5TOHo/+842G4WDx6HJ
	 mgF4tAsOA3Ompfz4yIqzUgyrFZwekdUlp1nARAEGk03tVlyAE0Dc8USwm9p7F2jNcc
	 D/fR89YRAAFNL/85ZnAiaaLzzTW2icXX+um8qBRr4F9lwoCampcszr5ZqSXw/dT1de
	 ns1cFBxHuVIDlvKr5w0VPmO8UzXptg8oOw0/RczrT3RNNqgGLzNCvgdJ+/04eChF93
	 V++K04HJmkvjmMgrVm32XMzO3y8AqdWZ/M7FGZj8gIw4sPMKGC8Pr8cOmcOJSDmil1
	 RcLf1SJMDpSvQ==
Date: Wed, 25 Feb 2026 22:49:54 +0100
From: Frederic Weisbecker <frederic@kernel.org>
To: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>, Leonardo Bras <leobras.c@gmail.com>,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org, Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>,
	David Rientjes <rientjes@google.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Hyeonggon Yoo <42.hyeyoo@gmail.com>,
	Leonardo Bras <leobras@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Waiman Long <longman@redhat.com>, Boqun Feng <boqun.feng@gmail.com>,
	Frederic Weisbecker <fweisbecker@suse.de>
Subject: Re: [PATCH 0/4] Introduce QPW for per-cpu operations
Message-ID: <aZ9ugjKvb4U7_R93@pavilion.home>
References: <20260206143430.021026873@redhat.com>
 <aYs6Ju2G4bm6_tl2@tiehlicka>
 <aYxviLoWsrLqDU7o@tpad>
 <aYywl1hdBQP2_slo@tiehlicka>
 <aZDw6xI2izFDfuuu@WindFlash>
 <aZL45yORfkNvS9Rs@tiehlicka>
 <aZcr255pGT3B/eaL@tpad>
 <aZdk19MqYhWK90Do@tiehlicka>
 <aZzM_44L1vKzcOCy@pavilion.home>
 <aZ3ejedS7nE5mnva@tpad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <aZ3ejedS7nE5mnva@tpad>
X-Stat-Signature: uqybb7wwdn5hee3zqnbr5aoa4r45ot69
X-Rspam-User: 
X-Rspamd-Queue-Id: EFC1C40005
X-Rspamd-Server: rspam01
X-HE-Tag: 1772056198-59766
X-HE-Meta: U2FsdGVkX193pqDIc8JxoAxpnNS9WYv7q7jM07zdGQzGsp/sYqJiReMH3jJZv+UjrxOB9D2lw9RY8suVxqk+UhYp913QjiGtlucwYSBXtx1Imgrkh3m8Z92Kbo9qmI5ixlrbaUG/9E8sazXb2U3xGCbIV96as7jsY3cyHCot2shoBepzXIhQifFTJklrn4I9y3Rf+a6usrhL4WIGU5fJlgLgEVtqlW8IZ5bHNJAvcR4bYWh9X0eesV39yMN3GUrBWE/Jq+R2Mynktz3JQcMbSBfGO3lx8F/QZMxu+31ikhJbAV0XynmxDtu2YKWrgT4vqJWS66ZQkNaZ7TQb9dPVr7/VcMNTF7WeQgpc6lp8+2KEIoYK867KWoD0RdTkGcbdlEUhacQ88BmyqkqjeTlHjY6hOTXFYwUScSN3GZc/GoEsMNK4JLX+alCYmxjEgtABh7AS6KCm5/CAaUxZVytGEsGq+L/PBQn3DgyL3fN0RExQ1QuabLtXs+9qErCb1VxNw5om19ojKC1Ljz1vEOtLl2QB8dNXTUZ0pL8ihzHKJV0cNcE+fUNJmssGM0aG483QB9dveFld858Nqhyfwo0gZsicNVMcn3K723dbQHBMr76fHqUE4VZMiG7wEaZp9COWXeqK68Vkxsbv6V1tdZBQ0iLwdgW0exiUi0doUigbd62Ro+5TP/aXxrkNpB15mWA6eNBIr24I28i/dzBMjlce0SCfF1trCsdIae42QPohbsJqzdmWGT1+s9HXF7lPRe8563VPcPklMw0+RYKEkc6iHMv7pRuH9EY9DC1TSbf1JpEm/VIrKp1pmePSEq7C7giW6EBlkuiv/V6AFkcnDdb71PC21Jo2CU5YZEG1xK8/8OyAlulqtzgXa0Qexg32kVyOBNL4LklVm2/Z76CBoigh2Vi7KIfVBkmadbWOU3w3ECjb60pHAjjEYebPdg1c9W8RKV0V3Vl5S/vjNa8MMD+
 tDLnGU6U
 cK0mf4mfZpEaiJwlQhtbiF2SfTv5LSUK7lux2BbZdDO3qBHbR/NC+mY7Qsgj53dpT1uXgLSPd9F72BvJL8cR+1aI1iKLKY4zl0JZ20xXAXWMvTvlfUaxMVWUSJh6HZr3K8Wc5Ujsfgs6K8KSJxJzclYphKANRQi0IQb9J1aZ/Sea0QY8HNzpNg6rTdvUzm9WjCH5qU66A/+Douc5vw9avKU1hYFv/VTKpRIPVfdud6023tcfkVMydmrvKlPQDJJvvYLIs+mN3w2Liszc5Lv3efl4ffBgIdzQx1bukVrR0xvFxRVIPn9jKSQxtZ1jcbxTqWE4OP4l7Ng+rMwaul7ZyvwlttBb4ZfucOYHNcXWFaLi4oRQ=
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Le Tue, Feb 24, 2026 at 02:23:25PM -0300, Marcelo Tosatti a écrit :
> On Mon, Feb 23, 2026 at 10:56:15PM +0100, Frederic Weisbecker wrote:
> > Le Thu, Feb 19, 2026 at 08:30:31PM +0100, Michal Hocko a écrit :
> > > On Thu 19-02-26 12:27:23, Marcelo Tosatti wrote:
> > > > Michal,
> > > > 
> > > > Again, i don't see how moving operations to happen at return to 
> > > > kernel would help (assuming you are talking about 
> > > > "context_tracking,x86: Defer some IPIs until a user->kernel transition").
> > > 
> > > Nope, I am not talking about IPIs, although those are an example of pcp
> > > state as well. I am sorry I do not have a link handy, I am pretty sure
> > > Frederic will have that. Another example, though, was vmstat flushes
> > > that need to be pcp. There are many other examples.
> > 
> > Here it is:
> > 
> > https://lore.kernel.org/all/20250410152327.24504-1-frederic@kernel.org/
> > 
> > Thanks.
> 
> Frederic,
> 
> I think this is a valid solution, however on systems with many CPUs, in
> nohz_full, performing system calls, can't there be significant increase
> of lru_lock contention ? Consider 100+ CPUs performing many system calls
> which add 1 or 2 folios to per-CPU LRU lists.

That's more a question for Michal or Vlastimil.

> 
> Note: if you are confident about the above not being a problem, 
> this approach looks good to me.
> 
> commit eb709b0d062efd653a61183af8e27b2711c3cf5c
> Author: Shaohua Li <shaohua.li@intel.com>
> Date:   Tue May 24 17:12:55 2011 -0700
> 
>     mm: batch activate_page() to reduce lock contention
>     
>     The zone->lru_lock is heavily contented in workload where activate_page()
>     is frequently used.  We could do batch activate_page() to reduce the lock
>     contention.  The batched pages will be added into zone list when the pool
>     is full or page reclaim is trying to drain them.
>     
>     For example, in a 4 socket 64 CPU system, create a sparse file and 64
>     processes, processes shared map to the file.  Each process read access the
>     whole file and then exit.  The process exit will do unmap_vmas() and cause
>     a lot of activate_page() call.  In such workload, we saw about 58% total
>     time reduction with below patch.  Other workloads with a lot of
>     activate_page also benefits a lot too.
> 
> ...
>     The most significent are:
>       case-lru-file-readtwice     -11.69%
>       case-mmap-pread-rand        -15.26%
>       case-mmap-pread-seq         -69.72%
> 
> Some Gemini answers (question was "list of nohz_full usecases"):
> 
> 2. Scientific Simulation & Research
> 
> Research institutions (like CERN, NASA, or national labs) use nohz_full
> for "tightly coupled" parallel workloads.
> 
> Workloads: Molecular dynamics, fluid dynamics (CFD), and weather forecasting (e.g., WRF models).
> 
> The "Barrier" Problem: In massive clusters using MPI (Message Passing
> Interface), all CPUs often have to reach a synchronization barrier
> before the next step of a simulation. If one CPU is delayed by a few
> milliseconds due to a timer tick, all other thousands of CPUs sit idle
> waiting for it. nohz_full prevents this "tail latency" from stalling the
> entire supercomputer.

Wow! I heard about that possible usecase but I didn't know it was
really used in practice. I privately dug into this with gemini and
there are actual usecase references.

> 
> ...
> 
> 4. Competitive Benchmarking & Kernel Development
> Performance engineers use this mode to get "clean" numbers when testing
> new hardware or compilers.
> 
> Workloads: Core-to-core latency tests, cache-bandwidth benchmarks, and
> standard suites like SPEC CPU.
> 
> Goal: Eliminating the "noise" of the operating system so that the
> results reflect pure hardware performance.

Ok didn't know about that one either.

> Summary Table: Who uses nohz_full?
> User Group       Primary Workload       Why they use it
> Quant Firms      High-Frequency Trading To prevent micro-stutter during trade execution.
> Research Labs    MPI-based Simulations  To avoid the "slowest node" stalling the whole cluster.
> Telcos/ISP       5G/Packet Processing   To ensure wire-speed processing without interrupts.
> Hardware Vendors Chip Validation        To benchmark CPU performance without OS interference.
> 
> 
> Here is how scientific simulations handle system calls:
> 
> 1. The "Compute-Loop" (Low Syscall)
> 
> The core of a simulation (like a GROMACS molecular dynamics step) is
> just raw math: fetching data from RAM, doing floating-point arithmetic
> (AVX/SSE), and writing it back.
> 
> During the loop: The CPU stays in "Userspace" for millions of cycles
> without ever asking the kernel for help.
> 
> Why it works: Since there are no system calls, nohz_full can
> successfully turn off the timer tick, allowing the CPU to focus 100% on
> the math.
> 
> 2. The "Communication-Phase" (High Syscall)
> 
> System calls usually happen only at the end of a computation block, when
> the simulation needs to talk to other nodes.
> 
> The Tools: MPI (Message Passing Interface) uses system calls like write,
> sendmsg, or specialized RDMA calls to move data across the network.
> 
> The Pattern: These simulations follow a "Burst" pattern—long periods
> of zero system calls (computation) followed by a quick burst of system
> calls (synchronization).
> 
> 3. When are they "Syscall Intensive"?
> 
> There are specific parts of a simulation that are intensive, but
> researchers try to minimize them:
> 
> I/O Operations: Writing "checkpoints" or large trajectory files to disk
> (using write()). This is why high-end HPC systems use Asynchronous I/O
> or dedicated I/O nodes—to keep the compute cores from getting bogged
> down in system calls.
> 
> Memory Allocation: Constantly calling malloc/free involves the brk or
> mmap system calls. Optimized simulation tools pre-allocate all the
> memory they need at startup to avoid this.

Ok. I asked a similar question and got this (you made me use an LLM for the
first time btw, I held out for 4 years... I'm sure I can wait 4 more years until
the next usage :o)

Does MPI synchronization only use userspace code or does it rely on syscalls?

The short answer is: **both**, but the goal of HPC (High Performance Computing) is to use as few **syscalls** (system calls) as possible.

Here is the technical breakdown:

### 1. The "Fast Path" (User-Space only)

To maximize speed, modern MPI implementations (like OpenMPI or MPICH) try to stay in **User-Space** as much as possible.

* **Shared Memory (Intra-node):** If two processes are running on the same server, they communicate via a shared memory zone. Synchronization is handled by **spinlocks** or **software barriers**. The CPU "polls" (loops) on a memory variable until it changes. **No syscall is involved here.**
* **RDMA / InfiniBand (Inter-node):** With high-performance network cards, MPI uses **RDMA** (Remote Direct Memory Access). The user code instructs the network card to write directly into the memory of the distant server. Once configured, data exchange happens without involving the Linux kernel.

### 2. The "Slow Path" (System Calls / Syscalls)

Passing through the kernel (a syscall) is necessary in certain situations, but it is "expensive" because it forces a **context switch**, which flushes CPU caches.

* **Initialization:** During startup (`MPI_Init`), many syscalls are used to create sockets, map shared memory (`mmap`), and configure network interfaces.
* **Standard TCP/IP:** If you are not using a high-performance network (RDMA) but simple Ethernet instead, MPI must call `send()` and `recv()`, which are syscalls. The Linux kernel then takes over to manage the TCP/IP stack.
* **Sleep Mode (Blocking):** If an MPI process waits for a message for too long, it may decide to "go to sleep" to yield the CPU to another task via syscalls like `futex()` or `poll()`.

**In summary:** MPI synchronization aims to be **100% User-Space** (via memory polling) to avoid syscall latency. It is precisely because MPI tries to bypass the kernel that we use `nohz_full`: we are asking the kernel not to even "knock on the CPU's door" with its clock interruptions.