From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id CBF88FD45F8 for ; Wed, 25 Feb 2026 21:50:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 363736B0089; Wed, 25 Feb 2026 16:50:01 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 337A66B008A; Wed, 25 Feb 2026 16:50:01 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 20FDF6B008C; Wed, 25 Feb 2026 16:50:01 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 0C05A6B0089 for ; Wed, 25 Feb 2026 16:50:01 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id AF1A51C5D7 for ; Wed, 25 Feb 2026 21:50:00 +0000 (UTC) X-FDA: 84484322160.28.DAD9E43 Received: from sea.source.kernel.org (sea.source.kernel.org [172.234.252.31]) by imf07.hostedemail.com (Postfix) with ESMTP id EFC1C40005 for ; Wed, 25 Feb 2026 21:49:58 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=mfjWYAmD; spf=pass (imf07.hostedemail.com: domain of frederic@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=frederic@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1772056199; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CHeb4aRDCwxKjo65gZtztEJJKCajvCXQUy0SWR/0MTw=; b=qjIaNKb2tScy+XvPWDZR0cu5QaFnI/BmBwQYNqJUkc2HmS2qN4+t7jYL4AmyQbuXHt94Vi F8EzgX7SKRWy3FGi5XH/2ZZsRLJtbpQ8+WHxPVc8bUqiNlTTmZmTzKT7VmXNIF8VxNtxrV UH6HR1pWZLRo4KL4qP+Dha5M0dc9OXw= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=mfjWYAmD; spf=pass (imf07.hostedemail.com: domain of frederic@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=frederic@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1772056199; a=rsa-sha256; cv=none; b=RanKaPtkLD6V6xAcJ4kkVpYx2Ph/VF/+PBp8UAIAz36ZWL1DOx3qzEFOEmwhSzffQkW0a1 iHJGw64ujn1jpZU9gVVvJR9aDpNgGVEkuJqXIPiJ4TBN9ZAE4geW1rQhqlCeh4FDMu19at 0SQ7LkVeKhvzd4zOwEXqJXpostGjGvw= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sea.source.kernel.org (Postfix) with ESMTP id E07A94376E; Wed, 25 Feb 2026 21:49:57 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 47D25C116D0; Wed, 25 Feb 2026 21:49:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772056197; bh=eFfIbBi1BSaZ60S9OfdQ91pAdXxYSBhBHROYNQOv5Cg=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=mfjWYAmDklXC1TbGymTmWkOLCr3v35NDsg+jk09TAAVxaBG5TOHo/+842G4WDx6HJ mgF4tAsOA3Ompfz4yIqzUgyrFZwekdUlp1nARAEGk03tVlyAE0Dc8USwm9p7F2jNcc D/fR89YRAAFNL/85ZnAiaaLzzTW2icXX+um8qBRr4F9lwoCampcszr5ZqSXw/dT1de ns1cFBxHuVIDlvKr5w0VPmO8UzXptg8oOw0/RczrT3RNNqgGLzNCvgdJ+/04eChF93 V++K04HJmkvjmMgrVm32XMzO3y8AqdWZ/M7FGZj8gIw4sPMKGC8Pr8cOmcOJSDmil1 RcLf1SJMDpSvQ== Date: Wed, 25 Feb 2026 22:49:54 +0100 From: Frederic Weisbecker To: Marcelo Tosatti Cc: Michal Hocko , Leonardo Bras , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Vlastimil Babka , Hyeonggon Yoo <42.hyeyoo@gmail.com>, Leonardo Bras , Thomas Gleixner , Waiman Long , Boqun Feng , Frederic Weisbecker Subject: Re: [PATCH 0/4] Introduce QPW for per-cpu operations Message-ID: References: <20260206143430.021026873@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Stat-Signature: uqybb7wwdn5hee3zqnbr5aoa4r45ot69 X-Rspam-User: X-Rspamd-Queue-Id: EFC1C40005 X-Rspamd-Server: rspam01 X-HE-Tag: 1772056198-59766 X-HE-Meta: U2FsdGVkX193pqDIc8JxoAxpnNS9WYv7q7jM07zdGQzGsp/sYqJiReMH3jJZv+UjrxOB9D2lw9RY8suVxqk+UhYp913QjiGtlucwYSBXtx1Imgrkh3m8Z92Kbo9qmI5ixlrbaUG/9E8sazXb2U3xGCbIV96as7jsY3cyHCot2shoBepzXIhQifFTJklrn4I9y3Rf+a6usrhL4WIGU5fJlgLgEVtqlW8IZ5bHNJAvcR4bYWh9X0eesV39yMN3GUrBWE/Jq+R2Mynktz3JQcMbSBfGO3lx8F/QZMxu+31ikhJbAV0XynmxDtu2YKWrgT4vqJWS66ZQkNaZ7TQb9dPVr7/VcMNTF7WeQgpc6lp8+2KEIoYK867KWoD0RdTkGcbdlEUhacQ88BmyqkqjeTlHjY6hOTXFYwUScSN3GZc/GoEsMNK4JLX+alCYmxjEgtABh7AS6KCm5/CAaUxZVytGEsGq+L/PBQn3DgyL3fN0RExQ1QuabLtXs+9qErCb1VxNw5om19ojKC1Ljz1vEOtLl2QB8dNXTUZ0pL8ihzHKJV0cNcE+fUNJmssGM0aG483QB9dveFld858Nqhyfwo0gZsicNVMcn3K723dbQHBMr76fHqUE4VZMiG7wEaZp9COWXeqK68Vkxsbv6V1tdZBQ0iLwdgW0exiUi0doUigbd62Ro+5TP/aXxrkNpB15mWA6eNBIr24I28i/dzBMjlce0SCfF1trCsdIae42QPohbsJqzdmWGT1+s9HXF7lPRe8563VPcPklMw0+RYKEkc6iHMv7pRuH9EY9DC1TSbf1JpEm/VIrKp1pmePSEq7C7giW6EBlkuiv/V6AFkcnDdb71PC21Jo2CU5YZEG1xK8/8OyAlulqtzgXa0Qexg32kVyOBNL4LklVm2/Z76CBoigh2Vi7KIfVBkmadbWOU3w3ECjb60pHAjjEYebPdg1c9W8RKV0V3Vl5S/vjNa8MMD+ tDLnGU6U cK0mf4mfZpEaiJwlQhtbiF2SfTv5LSUK7lux2BbZdDO3qBHbR/NC+mY7Qsgj53dpT1uXgLSPd9F72BvJL8cR+1aI1iKLKY4zl0JZ20xXAXWMvTvlfUaxMVWUSJh6HZr3K8Wc5Ujsfgs6K8KSJxJzclYphKANRQi0IQb9J1aZ/Sea0QY8HNzpNg6rTdvUzm9WjCH5qU66A/+Douc5vw9avKU1hYFv/VTKpRIPVfdud6023tcfkVMydmrvKlPQDJJvvYLIs+mN3w2Liszc5Lv3efl4ffBgIdzQx1bukVrR0xvFxRVIPn9jKSQxtZ1jcbxTqWE4OP4l7Ng+rMwaul7ZyvwlttBb4ZfucOYHNcXWFaLi4oRQ= Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Le Tue, Feb 24, 2026 at 02:23:25PM -0300, Marcelo Tosatti a écrit : > On Mon, Feb 23, 2026 at 10:56:15PM +0100, Frederic Weisbecker wrote: > > Le Thu, Feb 19, 2026 at 08:30:31PM +0100, Michal Hocko a écrit : > > > On Thu 19-02-26 12:27:23, Marcelo Tosatti wrote: > > > > Michal, > > > > > > > > Again, i don't see how moving operations to happen at return to > > > > kernel would help (assuming you are talking about > > > > "context_tracking,x86: Defer some IPIs until a user->kernel transition"). > > > > > > Nope, I am not talking about IPIs, although those are an example of pcp > > > state as well. I am sorry I do not have a link handy, I am pretty sure > > > Frederic will have that. Another example, though, was vmstat flushes > > > that need to be pcp. There are many other examples. > > > > Here it is: > > > > https://lore.kernel.org/all/20250410152327.24504-1-frederic@kernel.org/ > > > > Thanks. > > Frederic, > > I think this is a valid solution, however on systems with many CPUs, in > nohz_full, performing system calls, can't there be significant increase > of lru_lock contention ? Consider 100+ CPUs performing many system calls > which add 1 or 2 folios to per-CPU LRU lists. That's more a question for Michal or Vlastimil. > > Note: if you are confident about the above not being a problem, > this approach looks good to me. > > commit eb709b0d062efd653a61183af8e27b2711c3cf5c > Author: Shaohua Li > Date: Tue May 24 17:12:55 2011 -0700 > > mm: batch activate_page() to reduce lock contention > > The zone->lru_lock is heavily contented in workload where activate_page() > is frequently used. We could do batch activate_page() to reduce the lock > contention. The batched pages will be added into zone list when the pool > is full or page reclaim is trying to drain them. > > For example, in a 4 socket 64 CPU system, create a sparse file and 64 > processes, processes shared map to the file. Each process read access the > whole file and then exit. The process exit will do unmap_vmas() and cause > a lot of activate_page() call. In such workload, we saw about 58% total > time reduction with below patch. Other workloads with a lot of > activate_page also benefits a lot too. > > ... > The most significent are: > case-lru-file-readtwice -11.69% > case-mmap-pread-rand -15.26% > case-mmap-pread-seq -69.72% > > Some Gemini answers (question was "list of nohz_full usecases"): > > 2. Scientific Simulation & Research > > Research institutions (like CERN, NASA, or national labs) use nohz_full > for "tightly coupled" parallel workloads. > > Workloads: Molecular dynamics, fluid dynamics (CFD), and weather forecasting (e.g., WRF models). > > The "Barrier" Problem: In massive clusters using MPI (Message Passing > Interface), all CPUs often have to reach a synchronization barrier > before the next step of a simulation. If one CPU is delayed by a few > milliseconds due to a timer tick, all other thousands of CPUs sit idle > waiting for it. nohz_full prevents this "tail latency" from stalling the > entire supercomputer. Wow! I heard about that possible usecase but I didn't know it was really used in practice. I privately dug into this with gemini and there are actual usecase references. > > ... > > 4. Competitive Benchmarking & Kernel Development > Performance engineers use this mode to get "clean" numbers when testing > new hardware or compilers. > > Workloads: Core-to-core latency tests, cache-bandwidth benchmarks, and > standard suites like SPEC CPU. > > Goal: Eliminating the "noise" of the operating system so that the > results reflect pure hardware performance. Ok didn't know about that one either. > Summary Table: Who uses nohz_full? > User Group Primary Workload Why they use it > Quant Firms High-Frequency Trading To prevent micro-stutter during trade execution. > Research Labs MPI-based Simulations To avoid the "slowest node" stalling the whole cluster. > Telcos/ISP 5G/Packet Processing To ensure wire-speed processing without interrupts. > Hardware Vendors Chip Validation To benchmark CPU performance without OS interference. > > > Here is how scientific simulations handle system calls: > > 1. The "Compute-Loop" (Low Syscall) > > The core of a simulation (like a GROMACS molecular dynamics step) is > just raw math: fetching data from RAM, doing floating-point arithmetic > (AVX/SSE), and writing it back. > > During the loop: The CPU stays in "Userspace" for millions of cycles > without ever asking the kernel for help. > > Why it works: Since there are no system calls, nohz_full can > successfully turn off the timer tick, allowing the CPU to focus 100% on > the math. > > 2. The "Communication-Phase" (High Syscall) > > System calls usually happen only at the end of a computation block, when > the simulation needs to talk to other nodes. > > The Tools: MPI (Message Passing Interface) uses system calls like write, > sendmsg, or specialized RDMA calls to move data across the network. > > The Pattern: These simulations follow a "Burst" pattern—long periods > of zero system calls (computation) followed by a quick burst of system > calls (synchronization). > > 3. When are they "Syscall Intensive"? > > There are specific parts of a simulation that are intensive, but > researchers try to minimize them: > > I/O Operations: Writing "checkpoints" or large trajectory files to disk > (using write()). This is why high-end HPC systems use Asynchronous I/O > or dedicated I/O nodes—to keep the compute cores from getting bogged > down in system calls. > > Memory Allocation: Constantly calling malloc/free involves the brk or > mmap system calls. Optimized simulation tools pre-allocate all the > memory they need at startup to avoid this. Ok. I asked a similar question and got this (you made me use an LLM for the first time btw, I held out for 4 years... I'm sure I can wait 4 more years until the next usage :o) Does MPI synchronization only use userspace code or does it rely on syscalls? The short answer is: **both**, but the goal of HPC (High Performance Computing) is to use as few **syscalls** (system calls) as possible. Here is the technical breakdown: ### 1. The "Fast Path" (User-Space only) To maximize speed, modern MPI implementations (like OpenMPI or MPICH) try to stay in **User-Space** as much as possible. * **Shared Memory (Intra-node):** If two processes are running on the same server, they communicate via a shared memory zone. Synchronization is handled by **spinlocks** or **software barriers**. The CPU "polls" (loops) on a memory variable until it changes. **No syscall is involved here.** * **RDMA / InfiniBand (Inter-node):** With high-performance network cards, MPI uses **RDMA** (Remote Direct Memory Access). The user code instructs the network card to write directly into the memory of the distant server. Once configured, data exchange happens without involving the Linux kernel. ### 2. The "Slow Path" (System Calls / Syscalls) Passing through the kernel (a syscall) is necessary in certain situations, but it is "expensive" because it forces a **context switch**, which flushes CPU caches. * **Initialization:** During startup (`MPI_Init`), many syscalls are used to create sockets, map shared memory (`mmap`), and configure network interfaces. * **Standard TCP/IP:** If you are not using a high-performance network (RDMA) but simple Ethernet instead, MPI must call `send()` and `recv()`, which are syscalls. The Linux kernel then takes over to manage the TCP/IP stack. * **Sleep Mode (Blocking):** If an MPI process waits for a message for too long, it may decide to "go to sleep" to yield the CPU to another task via syscalls like `futex()` or `poll()`. **In summary:** MPI synchronization aims to be **100% User-Space** (via memory polling) to avoid syscall latency. It is precisely because MPI tries to bypass the kernel that we use `nohz_full`: we are asking the kernel not to even "knock on the CPU's door" with its clock interruptions.