[PATCH 0/4] Introduce QPW for per-cpu operations

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] Introduce QPW for per-cpu operations
@ 2026-02-06 14:34 Marcelo Tosatti
  2026-02-06 14:34 ` [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
                   ` (5 more replies)
  0 siblings, 6 replies; 35+ messages in thread
From: Marcelo Tosatti @ 2026-02-06 14:34 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Waiman Long, Boqun Feng

The problem:
Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps cache bouncing low since
cacheline tends to be mostly local, and avoids the cost of locks in non-RT
kernels, even though the very few remote operations will be expensive due
to scheduling overhead.

On the other hand, for RT workloads this can represent a problem: getting
an important workload scheduled out to deal with remote requests is
sure to introduce unexpected deadline misses.

The idea:
Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
In this case, instead of scheduling work on a remote cpu, it should
be safe to grab that remote cpu's per-cpu spinlock and run the required
work locally. That major cost, which is un/locking in every local function,
already happens in PREEMPT_RT.

Also, there is no need to worry about extra cache bouncing:
The cacheline invalidation already happens due to schedule_work_on().

This will avoid schedule_work_on(), and thus avoid scheduling-out an
RT workload.

Proposed solution:
A new interface called Queue PerCPU Work (QPW), which should replace
Work Queue in the above mentioned use case.

If PREEMPT_RT=n this interfaces just wraps the current
local_locks + WorkQueue behavior, so no expected change in runtime.

If PREEMPT_RT=y, or CONFIG_QPW=y, queue_percpu_work_on(cpu,...) will
lock that cpu's per-cpu structure and perform work on it locally. 
This is possible because on functions that can be used for performing
remote work on remote per-cpu structures, the local_lock (which is already
a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
is able to get the per_cpu spinlock() for the cpu passed as parameter.

RFC->v1:

- Introduce CONFIG_QPW and qpw= kernel boot option to enable 
  remote spinlocking and execution even on !CONFIG_PREEMPT_RT
  kernels (Leonardo Bras).
- Move buffer_head draining to separate workqueue (Marcelo Tosatti).
- Convert mlock per-CPU page lists to QPW (Marcelo Tosatti).
- Drop memcontrol convertion (as isolated CPUs are not targets
  of queue_work_on anymore).
- Rebase SLUB against Vlastimil's slab/next.
- Add basic document for QPW (Waiman Long).

The following testcase triggers lru_add_drain_all on an isolated CPU
(that does sys_write to a file before entering its realtime 
loop).

/* 
 * Simulates a low latency loop program that is interrupted
 * due to lru_add_drain_all. To trigger lru_add_drain_all, run:
 *
 * blockdev --flushbufs /dev/sdX
 *
 */ 
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <stdarg.h>
#include <pthread.h>
#include <sched.h>
#include <unistd.h>

int cpu;

static void *run(void *arg)
{
	pthread_t current_thread;
	cpu_set_t cpuset;
	int ret, nrloops;
	struct sched_param sched_p;
	pid_t pid;
	int fd;
	char buf[] = "xxxxxxxxxxx";

	CPU_ZERO(&cpuset);
	CPU_SET(cpu, &cpuset);

	current_thread = pthread_self();    
	ret = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
	if (ret) {
		perror("pthread_setaffinity_np failed\n");
		exit(0);
	}

	memset(&sched_p, 0, sizeof(struct sched_param));
	sched_p.sched_priority = 1;
	pid = gettid();
	ret = sched_setscheduler(pid, SCHED_FIFO, &sched_p);
	if (ret) {
		perror("sched_setscheduler");
		exit(0);
	}

	fd = open("/tmp/tmpfile", O_RDWR|O_CREAT|O_TRUNC);
	if (fd == -1) {
		perror("open");
		exit(0);
	}

	ret = write(fd, buf, sizeof(buf));
	if (ret == -1) {
		perror("write");
		exit(0);
	}

	do { 
		nrloops = nrloops+2;
		nrloops--;
	} while (1);
}

int main(int argc, char *argv[])
{
        int fd, ret;
	pthread_t thread;
	long val;
	char *endptr, *str;
	struct sched_param sched_p;
	pid_t pid;

	if (argc != 2) {
		printf("usage: %s cpu-nr\n", argv[0]);
		printf("where CPU number is the CPU to pin thread to\n");
		exit(0);
	}
	str = argv[1];
	cpu = strtol(str, &endptr, 10);
	if (cpu < 0) {
		printf("strtol returns %d\n", cpu);
		exit(0);
	}
	printf("cpunr=%d\n", cpu);

	memset(&sched_p, 0, sizeof(struct sched_param));
	sched_p.sched_priority = 1;
	pid = getpid();
	ret = sched_setscheduler(pid, SCHED_FIFO, &sched_p);
	if (ret) {
		perror("sched_setscheduler");
		exit(0);
	}

	pthread_create(&thread, NULL, run, NULL);

	sleep(5000);

	pthread_join(thread, NULL);
}

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work
  2026-02-06 14:34 [PATCH 0/4] Introduce QPW for per-cpu operations Marcelo Tosatti
@ 2026-02-06 14:34 ` Marcelo Tosatti
  2026-02-06 15:20   ` Marcelo Tosatti
  2026-02-07  0:16   ` Leonardo Bras
  2026-02-06 14:34 ` [PATCH 2/4] mm/swap: move bh draining into a separate workqueue Marcelo Tosatti
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 35+ messages in thread
From: Marcelo Tosatti @ 2026-02-06 14:34 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Waiman Long, Boqun Feng,
	Marcelo Tosatti

Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps cache bouncing low since
cacheline tends to be mostly local, and avoids the cost of locks in non-RT
kernels, even though the very few remote operations will be expensive due
to scheduling overhead.

On the other hand, for RT workloads this can represent a problem:
scheduling work on remote cpu that are executing low latency tasks
is undesired and can introduce unexpected deadline misses.

It's interesting, though, that local_lock()s in RT kernels become
spinlock(). We can make use of those to avoid scheduling work on a remote
cpu by directly updating another cpu's per_cpu structure, while holding
it's spinlock().

In order to do that, it's necessary to introduce a new set of functions to
make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
and also the corresponding queue_percpu_work_on() and flush_percpu_work()
helpers to run the remote work.

Users of non-RT kernels but with low latency requirements can select
similar functionality by using the CONFIG_QPW compile time option.

On CONFIG_QPW disabled kernels, no changes are expected, as every
one of the introduced helpers work the exactly same as the current
implementation:
qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
queue_percpu_work_on()  ->  queue_work_on()
flush_percpu_work()     ->  flush_work()

For QPW enabled kernels, though, qpw_{un,}lock*() will use the extra
cpu parameter to select the correct per-cpu structure to work on,
and acquire the spinlock for that cpu.

queue_percpu_work_on() will just call the requested function in the current
cpu, which will operate in another cpu's per-cpu object. Since the
local_locks() become spinlock()s in QPW enabled kernels, we are
safe doing that.

flush_percpu_work() then becomes a no-op since no work is actually
scheduled on a remote cpu.

Some minimal code rework is needed in order to make this mechanism work:
The calls for local_{un,}lock*() on the functions that are currently
scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), so in
QPW enabled kernels they can reference a different cpu. It's also
necessary to use a qpw_struct instead of a work_struct, but it just
contains a work struct and, in CONFIG_QPW, the target cpu.

This should have almost no impact on non-CONFIG_QPW kernels: few
this_cpu_ptr() will become per_cpu_ptr(,smp_processor_id()).

On CONFIG_QPW kernels, this should avoid deadlines misses by
removing scheduling noise.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt |   10 +
 Documentation/locking/qpwlocks.rst              |   63 +++++++
 MAINTAINERS                                     |    6 
 include/linux/qpw.h                             |  190 ++++++++++++++++++++++++
 init/Kconfig                                    |   35 ++++
 kernel/Makefile                                 |    2 
 kernel/qpw.c                                    |   26 +++
 7 files changed, 332 insertions(+)
 create mode 100644 include/linux/qpw.h
 create mode 100644 kernel/qpw.c

Index: slab/Documentation/admin-guide/kernel-parameters.txt
===================================================================
--- slab.orig/Documentation/admin-guide/kernel-parameters.txt
+++ slab/Documentation/admin-guide/kernel-parameters.txt
@@ -2819,6 +2819,16 @@ Kernel parameters
 
 			The format of <cpu-list> is described above.
 
+	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
+			and remote interference mechanism on a kernel built with
+			CONFIG_QPW.
+			Format: { "0" | "1" }
+			0 - local_lock() + queue_work_on(remote_cpu)
+			1 - spin_lock() for both local and remote operations
+
+			Selecting 1 may be interesting for systems that want
+			to avoid interruption & context switches from IPIs.
+
 	iucv=		[HW,NET]
 
 	ivrs_ioapic	[HW,X86-64]
Index: slab/MAINTAINERS
===================================================================
--- slab.orig/MAINTAINERS
+++ slab/MAINTAINERS
@@ -21291,6 +21291,12 @@ F:	Documentation/networking/device_drive
 F:	drivers/bus/fsl-mc/
 F:	include/uapi/linux/fsl_mc.h
 
+QPW
+M:	Leonardo Bras <leobras@redhat.com>
+S:	Supported
+F:	include/linux/qpw.h
+F:	kernel/qpw.c
+
 QT1010 MEDIA DRIVER
 L:	linux-media@vger.kernel.org
 S:	Orphan
Index: slab/include/linux/qpw.h
===================================================================
--- /dev/null
+++ slab/include/linux/qpw.h
@@ -0,0 +1,190 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_QPW_H
+#define _LINUX_QPW_H
+
+#include "linux/spinlock.h"
+#include "linux/local_lock.h"
+#include "linux/workqueue.h"
+
+#ifndef CONFIG_QPW
+
+typedef local_lock_t qpw_lock_t;
+typedef local_trylock_t qpw_trylock_t;
+
+struct qpw_struct {
+	struct work_struct work;
+};
+
+#define qpw_lock_init(lock)			\
+	local_lock_init(lock)
+
+#define qpw_trylock_init(lock)			\
+	local_trylock_init(lock)
+
+#define qpw_lock(lock, cpu)			\
+	local_lock(lock)
+
+#define qpw_lock_irqsave(lock, flags, cpu)	\
+	local_lock_irqsave(lock, flags)
+
+#define qpw_trylock(lock, cpu)			\
+	local_trylock(lock)
+
+#define qpw_trylock_irqsave(lock, flags, cpu)	\
+	local_trylock_irqsave(lock, flags)
+
+#define qpw_unlock(lock, cpu)			\
+	local_unlock(lock)
+
+#define qpw_unlock_irqrestore(lock, flags, cpu)	\
+	local_unlock_irqrestore(lock, flags)
+
+#define qpw_lockdep_assert_held(lock)		\
+	lockdep_assert_held(lock)
+
+#define queue_percpu_work_on(c, wq, qpw)	\
+	queue_work_on(c, wq, &(qpw)->work)
+
+#define flush_percpu_work(qpw)			\
+	flush_work(&(qpw)->work)
+
+#define qpw_get_cpu(qpw)	smp_processor_id()
+
+#define qpw_is_cpu_remote(cpu)		(false)
+
+#define INIT_QPW(qpw, func, c)			\
+	INIT_WORK(&(qpw)->work, (func))
+
+#else /* CONFIG_QPW */
+
+DECLARE_STATIC_KEY_MAYBE(CONFIG_QPW_DEFAULT, qpw_sl);
+
+typedef union {
+	spinlock_t sl;
+	local_lock_t ll;
+} qpw_lock_t;
+
+typedef union {
+	spinlock_t sl;
+	local_trylock_t ll;
+} qpw_trylock_t;
+
+struct qpw_struct {
+	struct work_struct work;
+	int cpu;
+};
+
+#define qpw_lock_init(lock)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock_init(lock.sl);					\
+		else									\
+			local_lock_init(lock.ll);					\
+	} while (0)
+
+#define qpw_trylock_init(lock)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock_init(lock.sl);					\
+		else									\
+			local_trylock_init(lock.ll);					\
+	} while (0)
+
+#define qpw_lock(lock, cpu)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock(per_cpu_ptr(lock.sl, cpu));				\
+		else									\
+			local_lock(lock.ll);						\
+	} while (0)
+
+#define qpw_lock_irqsave(lock, flags, cpu)						\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock_irqsave(per_cpu_ptr(lock.sl, cpu), flags);		\
+		else									\
+			local_lock_irqsave(lock.ll, flags);				\
+	} while (0)
+
+#define qpw_trylock(lock, cpu)								\
+	({										\
+		int t;									\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			t = spin_trylock(per_cpu_ptr(lock.sl, cpu));			\
+		else									\
+			t = local_trylock(lock.ll);					\
+		t;									\
+	})
+
+#define qpw_trylock_irqsave(lock, flags, cpu)						\
+	({										\
+		int t;									\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			t = spin_trylock_irqsave(per_cpu_ptr(lock.sl, cpu), flags);	\
+		else									\
+			t = local_trylock_irqsave(lock.ll, flags);			\
+		t;									\
+	})
+
+#define qpw_unlock(lock, cpu)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+			spin_unlock(per_cpu_ptr(lock.sl, cpu));				\
+		} else {								\
+			local_unlock(lock.ll);						\
+		}									\
+	} while (0)
+
+#define qpw_unlock_irqrestore(lock, flags, cpu)						\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_unlock_irqrestore(per_cpu_ptr(lock.sl, cpu), flags);	\
+		else									\
+			local_unlock_irqrestore(lock.ll, flags);			\
+	} while (0)
+
+#define qpw_lockdep_assert_held(lock)							\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			lockdep_assert_held(this_cpu_ptr(lock.sl));			\
+		else									\
+			lockdep_assert_held(this_cpu_ptr(lock.ll));			\
+	} while (0)
+
+#define queue_percpu_work_on(c, wq, qpw)						\
+	do {										\
+		int __c = c;								\
+		struct qpw_struct *__qpw = (qpw);					\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+			WARN_ON((__c) != __qpw->cpu);					\
+			__qpw->work.func(&__qpw->work);					\
+		} else {								\
+			queue_work_on(__c, wq, &(__qpw)->work);				\
+		}									\
+	} while (0)
+
+/*
+ * Does nothing if QPW is set to use spinlock, as the task is already done at the
+ * time queue_percpu_work_on() returns.
+ */
+#define flush_percpu_work(qpw)								\
+	do {										\
+		struct qpw_struct *__qpw = (qpw);					\
+		if (!static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {		\
+			flush_work(&__qpw->work);					\
+		}									\
+	} while (0)
+
+#define qpw_get_cpu(w)			container_of((w), struct qpw_struct, work)->cpu
+
+#define qpw_is_cpu_remote(cpu)		((cpu) != smp_processor_id())
+
+#define INIT_QPW(qpw, func, c)								\
+	do {										\
+		struct qpw_struct *__qpw = (qpw);					\
+		INIT_WORK(&__qpw->work, (func));					\
+		__qpw->cpu = (c);							\
+	} while (0)
+
+#endif /* CONFIG_QPW */
+#endif /* LINUX_QPW_H */
Index: slab/init/Kconfig
===================================================================
--- slab.orig/init/Kconfig
+++ slab/init/Kconfig
@@ -747,6 +747,41 @@ config CPU_ISOLATION
 
 	  Say Y if unsure.
 
+config QPW
+	bool "Queue per-CPU Work"
+	depends on SMP || COMPILE_TEST
+	default n
+	help
+	  Allow changing the behavior on per-CPU resource sharing with cache,
+	  from the regular local_locks() + queue_work_on(remote_cpu) to using
+	  per-CPU spinlocks on both local and remote operations.
+
+	  This is useful to give user the option on reducing IPIs to CPUs, and
+	  thus reduce interruptions and context switches. On the other hand, it
+	  increases generated code and will use atomic operations if spinlocks
+	  are selected.
+
+	  If set, will use the default behavior set in QPW_DEFAULT unless boot
+	  parameter qpw is passed with a different behavior.
+
+	  If unset, will use the local_lock() + queue_work_on() strategy,
+	  regardless of the boot parameter or QPW_DEFAULT.
+
+	  Say N if unsure.
+
+config QPW_DEFAULT
+	bool "Use per-CPU spinlocks by default"
+	depends on QPW
+	default n
+	help
+	  If set, will use per-CPU spinlocks as default behavior for per-CPU
+	  remote operations.
+
+	  If unset, will use local_lock() + queue_work_on(cpu) as default
+	  behavior for remote operations.
+
+	  Say N if unsure
+
 source "kernel/rcu/Kconfig"
 
 config IKCONFIG
Index: slab/kernel/Makefile
===================================================================
--- slab.orig/kernel/Makefile
+++ slab/kernel/Makefile
@@ -140,6 +140,8 @@ obj-$(CONFIG_WATCH_QUEUE) += watch_queue
 obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
 obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
 
+obj-$(CONFIG_QPW) += qpw.o
+
 CFLAGS_kstack_erase.o += $(DISABLE_KSTACK_ERASE)
 CFLAGS_kstack_erase.o += $(call cc-option,-mgeneral-regs-only)
 obj-$(CONFIG_KSTACK_ERASE) += kstack_erase.o
Index: slab/kernel/qpw.c
===================================================================
--- /dev/null
+++ slab/kernel/qpw.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "linux/export.h"
+#include <linux/sched.h>
+#include <linux/qpw.h>
+#include <linux/string.h>
+
+DEFINE_STATIC_KEY_MAYBE(CONFIG_QPW_DEFAULT, qpw_sl);
+EXPORT_SYMBOL(qpw_sl);
+
+static int __init qpw_setup(char *str)
+{
+	int opt;
+
+	if (!get_option(&str, &opt)) {
+		pr_warn("QPW: invalid qpw parameter: %s, ignoring.\n", str);
+		return 0;
+	}
+
+	if (opt)
+		static_branch_enable(&qpw_sl);
+	else
+		static_branch_disable(&qpw_sl);
+
+	return 0;
+}
+__setup("qpw=", qpw_setup);
Index: slab/Documentation/locking/qpwlocks.rst
===================================================================
--- /dev/null
+++ slab/Documentation/locking/qpwlocks.rst
@@ -0,0 +1,63 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========
+QPW locks
+=========
+
+Some places in the kernel implement a parallel programming strategy
+consisting on local_locks() for most of the work, and some rare remote
+operations are scheduled on target cpu. This keeps cache bouncing low since
+cacheline tends to be mostly local, and avoids the cost of locks in non-RT
+kernels, even though the very few remote operations will be expensive due
+to scheduling overhead.
+
+On the other hand, for RT workloads this can represent a problem:
+scheduling work on remote cpu that are executing low latency tasks
+is undesired and can introduce unexpected deadline misses.
+
+QPW locks help to convert sites that use local_locks (for cpu local operations)
+and queue_work_on (for queueing work remotely, to be executed
+locally on the owner cpu of the lock) to QPW locks.
+
+The lock is declared qpw_lock_t type.
+The lock is initialized with qpw_lock_init.
+The lock is locked with qpw_lock (takes a lock and cpu as a parameter).
+The lock is unlocked with qpw_unlock (takes a lock and cpu as a parameter).
+
+The qpw_lock_irqsave function disables interrupts and saves current interrupt state,
+cpu as a parameter.
+
+For trylock variant, there is the qpw_trylock_t type, initialized with
+qpw_trylock_init. Then the corresponding qpw_trylock and
+qpw_trylock_irqsave.
+
+work_struct should be replaced by qpw_struct, which contains a cpu parameter
+(owner cpu of the lock), initialized by INIT_QPW.
+
+The queue work related functions (analogous to queue_work_on and flush_work) are:
+queue_percpu_work_on and flush_percpu_work.
+
+The behaviour of the QPW functions is as follows:
+
+* !CONFIG_PREEMPT_RT and !CONFIG_QPW (or CONFIG_QPW and qpw=off kernel
+boot parameter):
+        - qpw_lock:                     local_lock
+        - qpw_lock_irqsave:             local_lock_irqsave
+        - qpw_trylock:                  local_trylock
+        - qpw_trylock_irqsave:          local_trylock_irqsave
+        - qpw_unlock:                   local_unlock
+        - queue_percpu_work_on:         queue_work_on
+        - flush_percpu_work:            flush_work
+
+* CONFIG_PREEMPT_RT or CONFIG_QPW (and CONFIG_QPW_DEFAULT or qpw=on kernel
+boot parameter),
+        - qpw_lock:                     spin_lock
+        - qpw_lock_irqsave:             spin_lock_irqsave
+        - qpw_trylock:                  spin_trylock
+        - qpw_trylock_irqsave:          spin_trylock_irqsave
+        - qpw_unlock:                   spin_unlock
+        - queue_percpu_work_on:         executes work function on caller cpu
+        - flush_percpu_work:            empty
+
+qpw_get_cpu(work_struct), to be called from within qpw work function,
+returns the target cpu.




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work
  2026-02-06 14:34 ` [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
@ 2026-02-06 15:20   ` Marcelo Tosatti
  2026-02-07  0:16   ` Leonardo Bras
  1 sibling, 0 replies; 35+ messages in thread
From: Marcelo Tosatti @ 2026-02-06 15:20 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Waiman Long, Boqun Feng

On Fri, Feb 06, 2026 at 11:34:31AM -0300, Marcelo Tosatti wrote:
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.

Forgot to mention: patchset is against Vlastimil's slab/next tree.




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work
  2026-02-06 14:34 ` [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
  2026-02-06 15:20   ` Marcelo Tosatti
@ 2026-02-07  0:16   ` Leonardo Bras
  2026-02-11 12:09     ` Marcelo Tosatti
  1 sibling, 1 reply; 35+ messages in thread
From: Leonardo Bras @ 2026-02-07  0:16 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Leonardo Bras, linux-kernel, cgroups, linux-mm, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras,
	Thomas Gleixner, Waiman Long, Boqun Feng

On Fri, Feb 06, 2026 at 11:34:31AM -0300, Marcelo Tosatti wrote:
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
> 
> On the other hand, for RT workloads this can represent a problem:
> scheduling work on remote cpu that are executing low latency tasks
> is undesired and can introduce unexpected deadline misses.
> 
> It's interesting, though, that local_lock()s in RT kernels become
> spinlock(). We can make use of those to avoid scheduling work on a remote
> cpu by directly updating another cpu's per_cpu structure, while holding
> it's spinlock().
> 
> In order to do that, it's necessary to introduce a new set of functions to
> make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
> and also the corresponding queue_percpu_work_on() and flush_percpu_work()
> helpers to run the remote work.
> 
> Users of non-RT kernels but with low latency requirements can select
> similar functionality by using the CONFIG_QPW compile time option.
> 
> On CONFIG_QPW disabled kernels, no changes are expected, as every
> one of the introduced helpers work the exactly same as the current
> implementation:
> qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
> queue_percpu_work_on()  ->  queue_work_on()
> flush_percpu_work()     ->  flush_work()
> 
> For QPW enabled kernels, though, qpw_{un,}lock*() will use the extra
> cpu parameter to select the correct per-cpu structure to work on,
> and acquire the spinlock for that cpu.
> 
> queue_percpu_work_on() will just call the requested function in the current
> cpu, which will operate in another cpu's per-cpu object. Since the
> local_locks() become spinlock()s in QPW enabled kernels, we are
> safe doing that.
> 
> flush_percpu_work() then becomes a no-op since no work is actually
> scheduled on a remote cpu.
> 
> Some minimal code rework is needed in order to make this mechanism work:
> The calls for local_{un,}lock*() on the functions that are currently
> scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), so in
> QPW enabled kernels they can reference a different cpu. It's also
> necessary to use a qpw_struct instead of a work_struct, but it just
> contains a work struct and, in CONFIG_QPW, the target cpu.
> 
> This should have almost no impact on non-CONFIG_QPW kernels: few
> this_cpu_ptr() will become per_cpu_ptr(,smp_processor_id()).
> 
> On CONFIG_QPW kernels, this should avoid deadlines misses by
> removing scheduling noise.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |   10 +
>  Documentation/locking/qpwlocks.rst              |   63 +++++++
>  MAINTAINERS                                     |    6 
>  include/linux/qpw.h                             |  190 ++++++++++++++++++++++++
>  init/Kconfig                                    |   35 ++++
>  kernel/Makefile                                 |    2 
>  kernel/qpw.c                                    |   26 +++
>  7 files changed, 332 insertions(+)
>  create mode 100644 include/linux/qpw.h
>  create mode 100644 kernel/qpw.c
> 
> Index: slab/Documentation/admin-guide/kernel-parameters.txt
> ===================================================================
> --- slab.orig/Documentation/admin-guide/kernel-parameters.txt
> +++ slab/Documentation/admin-guide/kernel-parameters.txt
> @@ -2819,6 +2819,16 @@ Kernel parameters
>  
>  			The format of <cpu-list> is described above.
>  
> +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> +			and remote interference mechanism on a kernel built with
> +			CONFIG_QPW.
> +			Format: { "0" | "1" }
> +			0 - local_lock() + queue_work_on(remote_cpu)
> +			1 - spin_lock() for both local and remote operations
> +
> +			Selecting 1 may be interesting for systems that want
> +			to avoid interruption & context switches from IPIs.
> +
>  	iucv=		[HW,NET]
>  
>  	ivrs_ioapic	[HW,X86-64]
> Index: slab/MAINTAINERS
> ===================================================================
> --- slab.orig/MAINTAINERS
> +++ slab/MAINTAINERS
> @@ -21291,6 +21291,12 @@ F:	Documentation/networking/device_drive
>  F:	drivers/bus/fsl-mc/
>  F:	include/uapi/linux/fsl_mc.h
>  
> +QPW
> +M:	Leonardo Bras <leobras@redhat.com>

Thanks for keeping that up :)
Could you please change this line to 

+M:	Leonardo Bras <leobras.c@gmail.com>

As I don't have access to Red Hat's mail anymore.
The signoffs on each commit should be fine to keep :)

> +S:	Supported
> +F:	include/linux/qpw.h
> +F:	kernel/qpw.c
> +

Should we also add the Documentation file as well?

+F:	Documentation/locking/qpwlocks.rst


>  QT1010 MEDIA DRIVER
>  L:	linux-media@vger.kernel.org
>  S:	Orphan
> Index: slab/include/linux/qpw.h
> ===================================================================
> --- /dev/null
> +++ slab/include/linux/qpw.h
> @@ -0,0 +1,190 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_QPW_H
> +#define _LINUX_QPW_H
> +
> +#include "linux/spinlock.h"
> +#include "linux/local_lock.h"
> +#include "linux/workqueue.h"
> +
> +#ifndef CONFIG_QPW
> +
> +typedef local_lock_t qpw_lock_t;
> +typedef local_trylock_t qpw_trylock_t;
> +
> +struct qpw_struct {
> +	struct work_struct work;
> +};
> +
> +#define qpw_lock_init(lock)			\
> +	local_lock_init(lock)
> +
> +#define qpw_trylock_init(lock)			\
> +	local_trylock_init(lock)
> +
> +#define qpw_lock(lock, cpu)			\
> +	local_lock(lock)
> +
> +#define qpw_lock_irqsave(lock, flags, cpu)	\
> +	local_lock_irqsave(lock, flags)
> +
> +#define qpw_trylock(lock, cpu)			\
> +	local_trylock(lock)
> +
> +#define qpw_trylock_irqsave(lock, flags, cpu)	\
> +	local_trylock_irqsave(lock, flags)
> +
> +#define qpw_unlock(lock, cpu)			\
> +	local_unlock(lock)
> +
> +#define qpw_unlock_irqrestore(lock, flags, cpu)	\
> +	local_unlock_irqrestore(lock, flags)
> +
> +#define qpw_lockdep_assert_held(lock)		\
> +	lockdep_assert_held(lock)
> +
> +#define queue_percpu_work_on(c, wq, qpw)	\
> +	queue_work_on(c, wq, &(qpw)->work)
> +
> +#define flush_percpu_work(qpw)			\
> +	flush_work(&(qpw)->work)
> +
> +#define qpw_get_cpu(qpw)	smp_processor_id()
> +
> +#define qpw_is_cpu_remote(cpu)		(false)
> +
> +#define INIT_QPW(qpw, func, c)			\
> +	INIT_WORK(&(qpw)->work, (func))
> +
> +#else /* CONFIG_QPW */
> +
> +DECLARE_STATIC_KEY_MAYBE(CONFIG_QPW_DEFAULT, qpw_sl);
> +
> +typedef union {
> +	spinlock_t sl;
> +	local_lock_t ll;
> +} qpw_lock_t;
> +
> +typedef union {
> +	spinlock_t sl;
> +	local_trylock_t ll;
> +} qpw_trylock_t;
> +
> +struct qpw_struct {
> +	struct work_struct work;
> +	int cpu;
> +};
> +
> +#define qpw_lock_init(lock)								\
> +	do {										\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
> +			spin_lock_init(lock.sl);					\
> +		else									\
> +			local_lock_init(lock.ll);					\
> +	} while (0)
> +
> +#define qpw_trylock_init(lock)								\
> +	do {										\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
> +			spin_lock_init(lock.sl);					\
> +		else									\
> +			local_trylock_init(lock.ll);					\
> +	} while (0)
> +
> +#define qpw_lock(lock, cpu)								\
> +	do {										\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
> +			spin_lock(per_cpu_ptr(lock.sl, cpu));				\
> +		else									\
> +			local_lock(lock.ll);						\
> +	} while (0)
> +
> +#define qpw_lock_irqsave(lock, flags, cpu)						\
> +	do {										\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
> +			spin_lock_irqsave(per_cpu_ptr(lock.sl, cpu), flags);		\
> +		else									\
> +			local_lock_irqsave(lock.ll, flags);				\
> +	} while (0)
> +
> +#define qpw_trylock(lock, cpu)								\
> +	({										\
> +		int t;									\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
> +			t = spin_trylock(per_cpu_ptr(lock.sl, cpu));			\
> +		else									\
> +			t = local_trylock(lock.ll);					\
> +		t;									\
> +	})
> +
> +#define qpw_trylock_irqsave(lock, flags, cpu)						\
> +	({										\
> +		int t;									\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
> +			t = spin_trylock_irqsave(per_cpu_ptr(lock.sl, cpu), flags);	\
> +		else									\
> +			t = local_trylock_irqsave(lock.ll, flags);			\
> +		t;									\
> +	})
> +
> +#define qpw_unlock(lock, cpu)								\
> +	do {										\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
> +			spin_unlock(per_cpu_ptr(lock.sl, cpu));				\
> +		} else {								\
> +			local_unlock(lock.ll);						\
> +		}									\
> +	} while (0)
> +
> +#define qpw_unlock_irqrestore(lock, flags, cpu)						\
> +	do {										\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
> +			spin_unlock_irqrestore(per_cpu_ptr(lock.sl, cpu), flags);	\
> +		else									\
> +			local_unlock_irqrestore(lock.ll, flags);			\
> +	} while (0)
> +
> +#define qpw_lockdep_assert_held(lock)							\
> +	do {										\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
> +			lockdep_assert_held(this_cpu_ptr(lock.sl));			\
> +		else									\
> +			lockdep_assert_held(this_cpu_ptr(lock.ll));			\
> +	} while (0)
> +
> +#define queue_percpu_work_on(c, wq, qpw)						\
> +	do {										\
> +		int __c = c;								\
> +		struct qpw_struct *__qpw = (qpw);					\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
> +			WARN_ON((__c) != __qpw->cpu);					\
> +			__qpw->work.func(&__qpw->work);					\
> +		} else {								\
> +			queue_work_on(__c, wq, &(__qpw)->work);				\
> +		}									\
> +	} while (0)
> +
> +/*
> + * Does nothing if QPW is set to use spinlock, as the task is already done at the
> + * time queue_percpu_work_on() returns.
> + */
> +#define flush_percpu_work(qpw)								\
> +	do {										\
> +		struct qpw_struct *__qpw = (qpw);					\
> +		if (!static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {		\
> +			flush_work(&__qpw->work);					\
> +		}									\
> +	} while (0)
> +
> +#define qpw_get_cpu(w)			container_of((w), struct qpw_struct, work)->cpu
> +
> +#define qpw_is_cpu_remote(cpu)		((cpu) != smp_processor_id())
> +
> +#define INIT_QPW(qpw, func, c)								\
> +	do {										\
> +		struct qpw_struct *__qpw = (qpw);					\
> +		INIT_WORK(&__qpw->work, (func));					\
> +		__qpw->cpu = (c);							\
> +	} while (0)
> +
> +#endif /* CONFIG_QPW */
> +#endif /* LINUX_QPW_H */
> Index: slab/init/Kconfig
> ===================================================================
> --- slab.orig/init/Kconfig
> +++ slab/init/Kconfig
> @@ -747,6 +747,41 @@ config CPU_ISOLATION
>  
>  	  Say Y if unsure.
>  
> +config QPW
> +	bool "Queue per-CPU Work"
> +	depends on SMP || COMPILE_TEST
> +	default n
> +	help
> +	  Allow changing the behavior on per-CPU resource sharing with cache,
> +	  from the regular local_locks() + queue_work_on(remote_cpu) to using
> +	  per-CPU spinlocks on both local and remote operations.
> +
> +	  This is useful to give user the option on reducing IPIs to CPUs, and
> +	  thus reduce interruptions and context switches. On the other hand, it
> +	  increases generated code and will use atomic operations if spinlocks
> +	  are selected.
> +
> +	  If set, will use the default behavior set in QPW_DEFAULT unless boot
> +	  parameter qpw is passed with a different behavior.
> +
> +	  If unset, will use the local_lock() + queue_work_on() strategy,
> +	  regardless of the boot parameter or QPW_DEFAULT.
> +
> +	  Say N if unsure.
> +
> +config QPW_DEFAULT
> +	bool "Use per-CPU spinlocks by default"
> +	depends on QPW
> +	default n
> +	help
> +	  If set, will use per-CPU spinlocks as default behavior for per-CPU
> +	  remote operations.
> +
> +	  If unset, will use local_lock() + queue_work_on(cpu) as default
> +	  behavior for remote operations.
> +
> +	  Say N if unsure
> +
>  source "kernel/rcu/Kconfig"
>  
>  config IKCONFIG
> Index: slab/kernel/Makefile
> ===================================================================
> --- slab.orig/kernel/Makefile
> +++ slab/kernel/Makefile
> @@ -140,6 +140,8 @@ obj-$(CONFIG_WATCH_QUEUE) += watch_queue
>  obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
>  obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
>  
> +obj-$(CONFIG_QPW) += qpw.o
> +
>  CFLAGS_kstack_erase.o += $(DISABLE_KSTACK_ERASE)
>  CFLAGS_kstack_erase.o += $(call cc-option,-mgeneral-regs-only)
>  obj-$(CONFIG_KSTACK_ERASE) += kstack_erase.o
> Index: slab/kernel/qpw.c
> ===================================================================
> --- /dev/null
> +++ slab/kernel/qpw.c
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/export.h"
> +#include <linux/sched.h>
> +#include <linux/qpw.h>
> +#include <linux/string.h>
> +
> +DEFINE_STATIC_KEY_MAYBE(CONFIG_QPW_DEFAULT, qpw_sl);
> +EXPORT_SYMBOL(qpw_sl);
> +
> +static int __init qpw_setup(char *str)
> +{
> +	int opt;
> +
> +	if (!get_option(&str, &opt)) {
> +		pr_warn("QPW: invalid qpw parameter: %s, ignoring.\n", str);
> +		return 0;
> +	}
> +
> +	if (opt)
> +		static_branch_enable(&qpw_sl);
> +	else
> +		static_branch_disable(&qpw_sl);
> +
> +	return 0;
> +}
> +__setup("qpw=", qpw_setup);
> Index: slab/Documentation/locking/qpwlocks.rst
> ===================================================================
> --- /dev/null
> +++ slab/Documentation/locking/qpwlocks.rst
> @@ -0,0 +1,63 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=========
> +QPW locks
> +=========
> +
> +Some places in the kernel implement a parallel programming strategy
> +consisting on local_locks() for most of the work, and some rare remote
> +operations are scheduled on target cpu. This keeps cache bouncing low since
> +cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> +kernels, even though the very few remote operations will be expensive due
> +to scheduling overhead.
> +
> +On the other hand, for RT workloads this can represent a problem:
> +scheduling work on remote cpu that are executing low latency tasks
> +is undesired and can introduce unexpected deadline misses.
> +
> +QPW locks help to convert sites that use local_locks (for cpu local operations)
> +and queue_work_on (for queueing work remotely, to be executed
> +locally on the owner cpu of the lock) to QPW locks.
> +
> +The lock is declared qpw_lock_t type.
> +The lock is initialized with qpw_lock_init.
> +The lock is locked with qpw_lock (takes a lock and cpu as a parameter).
> +The lock is unlocked with qpw_unlock (takes a lock and cpu as a parameter).
> +
> +The qpw_lock_irqsave function disables interrupts and saves current interrupt state,
> +cpu as a parameter.
> +
> +For trylock variant, there is the qpw_trylock_t type, initialized with
> +qpw_trylock_init. Then the corresponding qpw_trylock and
> +qpw_trylock_irqsave.
> +
> +work_struct should be replaced by qpw_struct, which contains a cpu parameter
> +(owner cpu of the lock), initialized by INIT_QPW.
> +
> +The queue work related functions (analogous to queue_work_on and flush_work) are:
> +queue_percpu_work_on and flush_percpu_work.
> +
> +The behaviour of the QPW functions is as follows:
> +
> +* !CONFIG_PREEMPT_RT and !CONFIG_QPW (or CONFIG_QPW and qpw=off kernel

I don't think PREEMPT_RT is needed here (maybe it was copied from the 
previous QPW version which was dependent on PREEMPT_RT?)

> +boot parameter):
> +        - qpw_lock:                     local_lock
> +        - qpw_lock_irqsave:             local_lock_irqsave
> +        - qpw_trylock:                  local_trylock
> +        - qpw_trylock_irqsave:          local_trylock_irqsave
> +        - qpw_unlock:                   local_unlock
> +        - queue_percpu_work_on:         queue_work_on
> +        - flush_percpu_work:            flush_work
> +
> +* CONFIG_PREEMPT_RT or CONFIG_QPW (and CONFIG_QPW_DEFAULT or qpw=on kernel

Same here

> +boot parameter),
> +        - qpw_lock:                     spin_lock
> +        - qpw_lock_irqsave:             spin_lock_irqsave
> +        - qpw_trylock:                  spin_trylock
> +        - qpw_trylock_irqsave:          spin_trylock_irqsave
> +        - qpw_unlock:                   spin_unlock
> +        - queue_percpu_work_on:         executes work function on caller cpu
> +        - flush_percpu_work:            empty
> +
> +qpw_get_cpu(work_struct), to be called from within qpw work function,
> +returns the target cpu.
> 
> 


Other than that, LGTM!

Reviewed-by: Leonardo Bras <leobras.c@gmail.com>

Thanks!
Leo


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work
  2026-02-07  0:16   ` Leonardo Bras
@ 2026-02-11 12:09     ` Marcelo Tosatti
  2026-02-14 21:32       ` Leonardo Bras
  0 siblings, 1 reply; 35+ messages in thread
From: Marcelo Tosatti @ 2026-02-11 12:09 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: linux-kernel, cgroups, linux-mm, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner,
	Waiman Long, Boqun Feng

Hi Leonardo,

On Fri, Feb 06, 2026 at 09:16:36PM -0300, Leonardo Bras wrote:
> > ===================================================================
> > --- slab.orig/MAINTAINERS
> > +++ slab/MAINTAINERS
> > @@ -21291,6 +21291,12 @@ F:	Documentation/networking/device_drive
> >  F:	drivers/bus/fsl-mc/
> >  F:	include/uapi/linux/fsl_mc.h
> >  
> > +QPW
> > +M:	Leonardo Bras <leobras@redhat.com>
> 
> Thanks for keeping that up :)
> Could you please change this line to 
> 
> +M:	Leonardo Bras <leobras.c@gmail.com>
> 
> As I don't have access to Red Hat's mail anymore.
> The signoffs on each commit should be fine to keep :)

Done.

> 
> > +S:	Supported
> > +F:	include/linux/qpw.h
> > +F:	kernel/qpw.c
> > +
> 
> Should we also add the Documentation file as well?
> 
> +F:	Documentation/locking/qpwlocks.rst

Done.

> > +The queue work related functions (analogous to queue_work_on and flush_work) are:
> > +queue_percpu_work_on and flush_percpu_work.
> > +
> > +The behaviour of the QPW functions is as follows:
> > +
> > +* !CONFIG_PREEMPT_RT and !CONFIG_QPW (or CONFIG_QPW and qpw=off kernel
> 
> I don't think PREEMPT_RT is needed here (maybe it was copied from the 
> previous QPW version which was dependent on PREEMPT_RT?)

Ah, OK, my bad. Well, shouldnt CONFIG_PREEMPT_RT select CONFIG_QPW and
CONFIG_QPW_DEFAULT=y ?

> > +boot parameter):
> > +        - qpw_lock:                     local_lock
> > +        - qpw_lock_irqsave:             local_lock_irqsave
> > +        - qpw_trylock:                  local_trylock
> > +        - qpw_trylock_irqsave:          local_trylock_irqsave
> > +        - qpw_unlock:                   local_unlock
> > +        - queue_percpu_work_on:         queue_work_on
> > +        - flush_percpu_work:            flush_work
> > +
> > +* CONFIG_PREEMPT_RT or CONFIG_QPW (and CONFIG_QPW_DEFAULT or qpw=on kernel
> 
> Same here
> 
> > +boot parameter),
> > +        - qpw_lock:                     spin_lock
> > +        - qpw_lock_irqsave:             spin_lock_irqsave
> > +        - qpw_trylock:                  spin_trylock
> > +        - qpw_trylock_irqsave:          spin_trylock_irqsave
> > +        - qpw_unlock:                   spin_unlock
> > +        - queue_percpu_work_on:         executes work function on caller cpu
> > +        - flush_percpu_work:            empty
> > +
> > +qpw_get_cpu(work_struct), to be called from within qpw work function,
> > +returns the target cpu.
> > 
> > 
> 
> 
> Other than that, LGTM!
> 
> Reviewed-by: Leonardo Bras <leobras.c@gmail.com>
> 
> Thanks!
> Leo
> 
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work
  2026-02-11 12:09     ` Marcelo Tosatti
@ 2026-02-14 21:32       ` Leonardo Bras
  0 siblings, 0 replies; 35+ messages in thread
From: Leonardo Bras @ 2026-02-14 21:32 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Leonardo Bras, linux-kernel, cgroups, linux-mm, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras,
	Thomas Gleixner, Waiman Long, Boqun Feng

On Wed, Feb 11, 2026 at 09:09:02AM -0300, Marcelo Tosatti wrote:
> Hi Leonardo,
> 
> On Fri, Feb 06, 2026 at 09:16:36PM -0300, Leonardo Bras wrote:
> > > ===================================================================
> > > --- slab.orig/MAINTAINERS
> > > +++ slab/MAINTAINERS
> > > @@ -21291,6 +21291,12 @@ F:	Documentation/networking/device_drive
> > >  F:	drivers/bus/fsl-mc/
> > >  F:	include/uapi/linux/fsl_mc.h
> > >  
> > > +QPW
> > > +M:	Leonardo Bras <leobras@redhat.com>
> > 
> > Thanks for keeping that up :)
> > Could you please change this line to 
> > 
> > +M:	Leonardo Bras <leobras.c@gmail.com>
> > 
> > As I don't have access to Red Hat's mail anymore.
> > The signoffs on each commit should be fine to keep :)
> 
> Done.
> 
> > 
> > > +S:	Supported
> > > +F:	include/linux/qpw.h
> > > +F:	kernel/qpw.c
> > > +
> > 
> > Should we also add the Documentation file as well?
> > 
> > +F:	Documentation/locking/qpwlocks.rst
> 
> Done.
> 
> > > +The queue work related functions (analogous to queue_work_on and flush_work) are:
> > > +queue_percpu_work_on and flush_percpu_work.
> > > +
> > > +The behaviour of the QPW functions is as follows:
> > > +
> > > +* !CONFIG_PREEMPT_RT and !CONFIG_QPW (or CONFIG_QPW and qpw=off kernel
> > 
> > I don't think PREEMPT_RT is needed here (maybe it was copied from the 
> > previous QPW version which was dependent on PREEMPT_RT?)
> 
> Ah, OK, my bad. Well, shouldnt CONFIG_PREEMPT_RT select CONFIG_QPW and
> CONFIG_QPW_DEFAULT=y ?

Oh, I sure think it should, even if not doing so at the current patchset.

But my point in above comment is that even if it did, there was no need to 
mention !RT and !QPW, as RT would select QPW, so you only need to mention 
QPW :)

Before QPW having it's own CONFIG_ I was using RT to compile this in, so 
maybe that's why the previous version of the cover letter mentioned it. :\

Thanks!
Leo


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 2/4] mm/swap: move bh draining into a separate workqueue
  2026-02-06 14:34 [PATCH 0/4] Introduce QPW for per-cpu operations Marcelo Tosatti
  2026-02-06 14:34 ` [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
@ 2026-02-06 14:34 ` Marcelo Tosatti
  2026-02-06 14:34 ` [PATCH 3/4] swap: apply new queue_percpu_work_on() interface Marcelo Tosatti
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 35+ messages in thread
From: Marcelo Tosatti @ 2026-02-06 14:34 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Waiman Long, Boqun Feng,
	Marcelo Tosatti

Separate the bh draining into a separate workqueue
(from the mm lru draining), so that its possible to switch
the mm lru draining to QPW.

To switch bh draining to QPW, it would be necessary to add
a spinlock to addition of bhs to percpu cache, and that is a
very hot path.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
 mm/swap.c |   52 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 37 insertions(+), 15 deletions(-)

Index: slab/mm/swap.c
===================================================================
--- slab.orig/mm/swap.c
+++ slab/mm/swap.c
@@ -745,12 +745,11 @@ void lru_add_drain(void)
  * the same cpu. It shouldn't be a problem in !SMP case since
  * the core is only one and the locks will disable preemption.
  */
-static void lru_add_and_bh_lrus_drain(void)
+static void lru_add_mm_drain(void)
 {
 	local_lock(&cpu_fbatches.lock);
 	lru_add_drain_cpu(smp_processor_id());
 	local_unlock(&cpu_fbatches.lock);
-	invalidate_bh_lrus_cpu();
 	mlock_drain_local();
 }
 
@@ -769,10 +768,17 @@ static DEFINE_PER_CPU(struct work_struct
 
 static void lru_add_drain_per_cpu(struct work_struct *dummy)
 {
-	lru_add_and_bh_lrus_drain();
+	lru_add_mm_drain();
 }
 
-static bool cpu_needs_drain(unsigned int cpu)
+static DEFINE_PER_CPU(struct work_struct, bh_add_drain_work);
+
+static void bh_add_drain_per_cpu(struct work_struct *dummy)
+{
+	invalidate_bh_lrus_cpu();
+}
+
+static bool cpu_needs_mm_drain(unsigned int cpu)
 {
 	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
 
@@ -783,8 +789,12 @@ static bool cpu_needs_drain(unsigned int
 		folio_batch_count(&fbatches->lru_deactivate) ||
 		folio_batch_count(&fbatches->lru_lazyfree) ||
 		folio_batch_count(&fbatches->lru_activate) ||
-		need_mlock_drain(cpu) ||
-		has_bh_in_lru(cpu, NULL);
+		need_mlock_drain(cpu);
+}
+
+static bool cpu_needs_bh_drain(unsigned int cpu)
+{
+	return has_bh_in_lru(cpu, NULL);
 }
 
 /*
@@ -807,7 +817,7 @@ static inline void __lru_add_drain_all(b
 	 * each CPU.
 	 */
 	static unsigned int lru_drain_gen;
-	static struct cpumask has_work;
+	static struct cpumask has_mm_work, has_bh_work;
 	static DEFINE_MUTEX(lock);
 	unsigned cpu, this_gen;
 
@@ -870,20 +880,31 @@ static inline void __lru_add_drain_all(b
 	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
 	smp_mb();
 
-	cpumask_clear(&has_work);
+	cpumask_clear(&has_mm_work);
+	cpumask_clear(&has_bh_work);
 	for_each_online_cpu(cpu) {
-		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
+		struct work_struct *mm_work = &per_cpu(lru_add_drain_work, cpu);
+		struct work_struct *bh_work = &per_cpu(bh_add_drain_work, cpu);
+
+		if (cpu_needs_mm_drain(cpu)) {
+			INIT_WORK(mm_work, lru_add_drain_per_cpu);
+			queue_work_on(cpu, mm_percpu_wq, mm_work);
+			__cpumask_set_cpu(cpu, &has_mm_work);
+		}
 
-		if (cpu_needs_drain(cpu)) {
-			INIT_WORK(work, lru_add_drain_per_cpu);
-			queue_work_on(cpu, mm_percpu_wq, work);
-			__cpumask_set_cpu(cpu, &has_work);
+		if (cpu_needs_bh_drain(cpu)) {
+			INIT_WORK(bh_work, bh_add_drain_per_cpu);
+			queue_work_on(cpu, mm_percpu_wq, bh_work);
+			__cpumask_set_cpu(cpu, &has_bh_work);
 		}
 	}
 
-	for_each_cpu(cpu, &has_work)
+	for_each_cpu(cpu, &has_mm_work)
 		flush_work(&per_cpu(lru_add_drain_work, cpu));
 
+	for_each_cpu(cpu, &has_bh_work)
+		flush_work(&per_cpu(bh_add_drain_work, cpu));
+
 done:
 	mutex_unlock(&lock);
 }
@@ -929,7 +950,8 @@ void lru_cache_disable(void)
 #ifdef CONFIG_SMP
 	__lru_add_drain_all(true);
 #else
-	lru_add_and_bh_lrus_drain();
+	lru_add_mm_drain();
+	invalidate_bh_lrus_cpu();
 #endif
 }
 




^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 3/4] swap: apply new queue_percpu_work_on() interface
  2026-02-06 14:34 [PATCH 0/4] Introduce QPW for per-cpu operations Marcelo Tosatti
  2026-02-06 14:34 ` [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
  2026-02-06 14:34 ` [PATCH 2/4] mm/swap: move bh draining into a separate workqueue Marcelo Tosatti
@ 2026-02-06 14:34 ` Marcelo Tosatti
  2026-02-07  1:06   ` Leonardo Bras
  2026-02-06 14:34 ` [PATCH 4/4] slub: " Marcelo Tosatti
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 35+ messages in thread
From: Marcelo Tosatti @ 2026-02-06 14:34 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Waiman Long, Boqun Feng,
	Marcelo Tosatti

Make use of the new qpw_{un,}lock*() and queue_percpu_work_on()
interface to improve performance & latency on PREEMPT_RT kernels.

For functions that may be scheduled in a different cpu, replace
local_{un,}lock*() by qpw_{un,}lock*(), and replace schedule_work_on() by
queue_percpu_work_on(). The same happens for flush_work() and
flush_percpu_work().

The change requires allocation of qpw_structs instead of a work_structs,
and changing parameters of a few functions to include the cpu parameter.

This should bring no relevant performance impact on non-RT kernels:
For functions that may be scheduled in a different cpu, the local_*lock's
this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 mm/internal.h   |    4 +-
 mm/mlock.c      |   71 ++++++++++++++++++++++++++++++++------------
 mm/page_alloc.c |    2 -
 mm/swap.c       |   90 +++++++++++++++++++++++++++++++-------------------------
 4 files changed, 108 insertions(+), 59 deletions(-)

Index: slab/mm/mlock.c
===================================================================
--- slab.orig/mm/mlock.c
+++ slab/mm/mlock.c
@@ -25,17 +25,16 @@
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
 #include <linux/secretmem.h>
+#include <linux/qpw.h>
 
 #include "internal.h"
 
 struct mlock_fbatch {
-	local_lock_t lock;
+	qpw_lock_t lock;
 	struct folio_batch fbatch;
 };
 
-static DEFINE_PER_CPU(struct mlock_fbatch, mlock_fbatch) = {
-	.lock = INIT_LOCAL_LOCK(lock),
-};
+static DEFINE_PER_CPU(struct mlock_fbatch, mlock_fbatch);
 
 bool can_do_mlock(void)
 {
@@ -209,18 +208,25 @@ static void mlock_folio_batch(struct fol
 	folios_put(fbatch);
 }
 
-void mlock_drain_local(void)
+void mlock_drain_cpu(int cpu)
 {
 	struct folio_batch *fbatch;
 
-	local_lock(&mlock_fbatch.lock);
-	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
+	qpw_lock(&mlock_fbatch.lock, cpu);
+	fbatch = per_cpu_ptr(&mlock_fbatch.fbatch, cpu);
 	if (folio_batch_count(fbatch))
 		mlock_folio_batch(fbatch);
-	local_unlock(&mlock_fbatch.lock);
+	qpw_unlock(&mlock_fbatch.lock, cpu);
 }
 
-void mlock_drain_remote(int cpu)
+void mlock_drain_local(void)
+{
+	migrate_disable();
+	mlock_drain_cpu(smp_processor_id());
+	migrate_enable();
+}
+
+void mlock_drain_offline(int cpu)
 {
 	struct folio_batch *fbatch;
 
@@ -242,9 +248,12 @@ bool need_mlock_drain(int cpu)
 void mlock_folio(struct folio *folio)
 {
 	struct folio_batch *fbatch;
+	int cpu;
 
-	local_lock(&mlock_fbatch.lock);
-	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
+	migrate_disable();
+	cpu = smp_processor_id();
+	qpw_lock(&mlock_fbatch.lock, cpu);
+	fbatch = per_cpu_ptr(&mlock_fbatch.fbatch, cpu);
 
 	if (!folio_test_set_mlocked(folio)) {
 		int nr_pages = folio_nr_pages(folio);
@@ -257,7 +266,8 @@ void mlock_folio(struct folio *folio)
 	if (!folio_batch_add(fbatch, mlock_lru(folio)) ||
 	    !folio_may_be_lru_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
-	local_unlock(&mlock_fbatch.lock);
+	qpw_unlock(&mlock_fbatch.lock, cpu);
+	migrate_enable();
 }
 
 /**
@@ -268,9 +278,13 @@ void mlock_new_folio(struct folio *folio
 {
 	struct folio_batch *fbatch;
 	int nr_pages = folio_nr_pages(folio);
+	int cpu;
+
+	migrate_disable();
+	cpu = smp_processor_id();
+	qpw_lock(&mlock_fbatch.lock, cpu);
 
-	local_lock(&mlock_fbatch.lock);
-	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
+	fbatch = per_cpu_ptr(&mlock_fbatch.fbatch, cpu);
 	folio_set_mlocked(folio);
 
 	zone_stat_mod_folio(folio, NR_MLOCK, nr_pages);
@@ -280,7 +294,8 @@ void mlock_new_folio(struct folio *folio
 	if (!folio_batch_add(fbatch, mlock_new(folio)) ||
 	    !folio_may_be_lru_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
-	local_unlock(&mlock_fbatch.lock);
+	migrate_enable();
+	qpw_unlock(&mlock_fbatch.lock, cpu);
 }
 
 /**
@@ -290,9 +305,13 @@ void mlock_new_folio(struct folio *folio
 void munlock_folio(struct folio *folio)
 {
 	struct folio_batch *fbatch;
+	int cpu;
 
-	local_lock(&mlock_fbatch.lock);
-	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
+	migrate_disable();
+	cpu = smp_processor_id();
+	qpw_lock(&mlock_fbatch.lock, cpu);
+
+	fbatch = per_cpu_ptr(&mlock_fbatch.fbatch, cpu);
 	/*
 	 * folio_test_clear_mlocked(folio) must be left to __munlock_folio(),
 	 * which will check whether the folio is multiply mlocked.
@@ -301,7 +320,8 @@ void munlock_folio(struct folio *folio)
 	if (!folio_batch_add(fbatch, folio) ||
 	    !folio_may_be_lru_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
-	local_unlock(&mlock_fbatch.lock);
+	qpw_unlock(&mlock_fbatch.lock, cpu);
+	migrate_enable();
 }
 
 static inline unsigned int folio_mlock_step(struct folio *folio,
@@ -823,3 +843,18 @@ void user_shm_unlock(size_t size, struct
 	spin_unlock(&shmlock_user_lock);
 	put_ucounts(ucounts);
 }
+
+int __init mlock_init(void)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct mlock_fbatch *fbatch = &per_cpu(mlock_fbatch, cpu);
+
+		qpw_lock_init(&fbatch->lock);
+	}
+
+	return 0;
+}
+
+module_init(mlock_init);
Index: slab/mm/swap.c
===================================================================
--- slab.orig/mm/swap.c
+++ slab/mm/swap.c
@@ -35,7 +35,7 @@
 #include <linux/uio.h>
 #include <linux/hugetlb.h>
 #include <linux/page_idle.h>
-#include <linux/local_lock.h>
+#include <linux/qpw.h>
 #include <linux/buffer_head.h>
 
 #include "internal.h"
@@ -52,7 +52,7 @@ struct cpu_fbatches {
 	 * The following folio batches are grouped together because they are protected
 	 * by disabling preemption (and interrupts remain enabled).
 	 */
-	local_lock_t lock;
+	qpw_lock_t lock;
 	struct folio_batch lru_add;
 	struct folio_batch lru_deactivate_file;
 	struct folio_batch lru_deactivate;
@@ -61,14 +61,11 @@ struct cpu_fbatches {
 	struct folio_batch lru_activate;
 #endif
 	/* Protecting the following batches which require disabling interrupts */
-	local_lock_t lock_irq;
+	qpw_lock_t lock_irq;
 	struct folio_batch lru_move_tail;
 };
 
-static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches) = {
-	.lock = INIT_LOCAL_LOCK(lock),
-	.lock_irq = INIT_LOCAL_LOCK(lock_irq),
-};
+static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches);
 
 static void __page_cache_release(struct folio *folio, struct lruvec **lruvecp,
 		unsigned long *flagsp)
@@ -183,22 +180,24 @@ static void __folio_batch_add_and_move(s
 		struct folio *folio, move_fn_t move_fn, bool disable_irq)
 {
 	unsigned long flags;
+	int cpu;
 
 	folio_get(folio);
 
+	cpu = smp_processor_id();
 	if (disable_irq)
-		local_lock_irqsave(&cpu_fbatches.lock_irq, flags);
+		qpw_lock_irqsave(&cpu_fbatches.lock_irq, flags, cpu);
 	else
-		local_lock(&cpu_fbatches.lock);
+		qpw_lock(&cpu_fbatches.lock, cpu);
 
-	if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
+	if (!folio_batch_add(per_cpu_ptr(fbatch, cpu), folio) ||
 			!folio_may_be_lru_cached(folio) || lru_cache_disabled())
-		folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
+		folio_batch_move_lru(per_cpu_ptr(fbatch, cpu), move_fn);
 
 	if (disable_irq)
-		local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags);
+		qpw_unlock_irqrestore(&cpu_fbatches.lock_irq, flags, cpu);
 	else
-		local_unlock(&cpu_fbatches.lock);
+		qpw_unlock(&cpu_fbatches.lock, cpu);
 }
 
 #define folio_batch_add_and_move(folio, op)		\
@@ -358,9 +357,10 @@ static void __lru_cache_activate_folio(s
 {
 	struct folio_batch *fbatch;
 	int i;
+	int cpu = smp_processor_id();
 
-	local_lock(&cpu_fbatches.lock);
-	fbatch = this_cpu_ptr(&cpu_fbatches.lru_add);
+	qpw_lock(&cpu_fbatches.lock, cpu);
+	fbatch = per_cpu_ptr(&cpu_fbatches.lru_add, cpu);
 
 	/*
 	 * Search backwards on the optimistic assumption that the folio being
@@ -381,7 +381,7 @@ static void __lru_cache_activate_folio(s
 		}
 	}
 
-	local_unlock(&cpu_fbatches.lock);
+	qpw_unlock(&cpu_fbatches.lock, cpu);
 }
 
 #ifdef CONFIG_LRU_GEN
@@ -653,9 +653,9 @@ void lru_add_drain_cpu(int cpu)
 		unsigned long flags;
 
 		/* No harm done if a racing interrupt already did this */
-		local_lock_irqsave(&cpu_fbatches.lock_irq, flags);
+		qpw_lock_irqsave(&cpu_fbatches.lock_irq, flags, cpu);
 		folio_batch_move_lru(fbatch, lru_move_tail);
-		local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags);
+		qpw_unlock_irqrestore(&cpu_fbatches.lock_irq, flags, cpu);
 	}
 
 	fbatch = &fbatches->lru_deactivate_file;
@@ -733,10 +733,12 @@ void folio_mark_lazyfree(struct folio *f
 
 void lru_add_drain(void)
 {
-	local_lock(&cpu_fbatches.lock);
-	lru_add_drain_cpu(smp_processor_id());
-	local_unlock(&cpu_fbatches.lock);
-	mlock_drain_local();
+	int cpu = smp_processor_id();
+
+	qpw_lock(&cpu_fbatches.lock, cpu);
+	lru_add_drain_cpu(cpu);
+	qpw_unlock(&cpu_fbatches.lock, cpu);
+	mlock_drain_cpu(cpu);
 }
 
 /*
@@ -745,30 +747,32 @@ void lru_add_drain(void)
  * the same cpu. It shouldn't be a problem in !SMP case since
  * the core is only one and the locks will disable preemption.
  */
-static void lru_add_mm_drain(void)
+static void lru_add_mm_drain(int cpu)
 {
-	local_lock(&cpu_fbatches.lock);
-	lru_add_drain_cpu(smp_processor_id());
-	local_unlock(&cpu_fbatches.lock);
-	mlock_drain_local();
+	qpw_lock(&cpu_fbatches.lock, cpu);
+	lru_add_drain_cpu(cpu);
+	qpw_unlock(&cpu_fbatches.lock, cpu);
+	mlock_drain_cpu(cpu);
 }
 
 void lru_add_drain_cpu_zone(struct zone *zone)
 {
-	local_lock(&cpu_fbatches.lock);
-	lru_add_drain_cpu(smp_processor_id());
+	int cpu = smp_processor_id();
+
+	qpw_lock(&cpu_fbatches.lock, cpu);
+	lru_add_drain_cpu(cpu);
 	drain_local_pages(zone);
-	local_unlock(&cpu_fbatches.lock);
-	mlock_drain_local();
+	qpw_unlock(&cpu_fbatches.lock, cpu);
+	mlock_drain_cpu(cpu);
 }
 
 #ifdef CONFIG_SMP
 
-static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
+static DEFINE_PER_CPU(struct qpw_struct, lru_add_drain_qpw);
 
-static void lru_add_drain_per_cpu(struct work_struct *dummy)
+static void lru_add_drain_per_cpu(struct work_struct *w)
 {
-	lru_add_mm_drain();
+	lru_add_mm_drain(qpw_get_cpu(w));
 }
 
 static DEFINE_PER_CPU(struct work_struct, bh_add_drain_work);
@@ -883,12 +887,12 @@ static inline void __lru_add_drain_all(b
 	cpumask_clear(&has_mm_work);
 	cpumask_clear(&has_bh_work);
 	for_each_online_cpu(cpu) {
-		struct work_struct *mm_work = &per_cpu(lru_add_drain_work, cpu);
+		struct qpw_struct *mm_qpw = &per_cpu(lru_add_drain_qpw, cpu);
 		struct work_struct *bh_work = &per_cpu(bh_add_drain_work, cpu);
 
 		if (cpu_needs_mm_drain(cpu)) {
-			INIT_WORK(mm_work, lru_add_drain_per_cpu);
-			queue_work_on(cpu, mm_percpu_wq, mm_work);
+			INIT_QPW(mm_qpw, lru_add_drain_per_cpu, cpu);
+			queue_percpu_work_on(cpu, mm_percpu_wq, mm_qpw);
 			__cpumask_set_cpu(cpu, &has_mm_work);
 		}
 
@@ -900,7 +904,7 @@ static inline void __lru_add_drain_all(b
 	}
 
 	for_each_cpu(cpu, &has_mm_work)
-		flush_work(&per_cpu(lru_add_drain_work, cpu));
+		flush_percpu_work(&per_cpu(lru_add_drain_qpw, cpu));
 
 	for_each_cpu(cpu, &has_bh_work)
 		flush_work(&per_cpu(bh_add_drain_work, cpu));
@@ -950,7 +954,7 @@ void lru_cache_disable(void)
 #ifdef CONFIG_SMP
 	__lru_add_drain_all(true);
 #else
-	lru_add_mm_drain();
+	lru_add_mm_drain(smp_processor_id());
 	invalidate_bh_lrus_cpu();
 #endif
 }
@@ -1124,6 +1128,7 @@ static const struct ctl_table swap_sysct
 void __init swap_setup(void)
 {
 	unsigned long megs = PAGES_TO_MB(totalram_pages());
+	unsigned int cpu;
 
 	/* Use a smaller cluster for small-memory machines */
 	if (megs < 16)
@@ -1136,4 +1141,11 @@ void __init swap_setup(void)
 	 */
 
 	register_sysctl_init("vm", swap_sysctl_table);
+
+	for_each_possible_cpu(cpu) {
+		struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
+
+		qpw_lock_init(&fbatches->lock);
+		qpw_lock_init(&fbatches->lock_irq);
+	}
 }
Index: slab/mm/internal.h
===================================================================
--- slab.orig/mm/internal.h
+++ slab/mm/internal.h
@@ -1061,10 +1061,12 @@ static inline void munlock_vma_folio(str
 		munlock_folio(folio);
 }
 
+int __init mlock_init(void);
 void mlock_new_folio(struct folio *folio);
 bool need_mlock_drain(int cpu);
 void mlock_drain_local(void);
-void mlock_drain_remote(int cpu);
+void mlock_drain_cpu(int cpu);
+void mlock_drain_offline(int cpu);
 
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
Index: slab/mm/page_alloc.c
===================================================================
--- slab.orig/mm/page_alloc.c
+++ slab/mm/page_alloc.c
@@ -6251,7 +6251,7 @@ static int page_alloc_cpu_dead(unsigned
 	struct zone *zone;
 
 	lru_add_drain_cpu(cpu);
-	mlock_drain_remote(cpu);
+	mlock_drain_offline(cpu);
 	drain_pages(cpu);
 
 	/*




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/4] swap: apply new queue_percpu_work_on() interface
  2026-02-06 14:34 ` [PATCH 3/4] swap: apply new queue_percpu_work_on() interface Marcelo Tosatti
@ 2026-02-07  1:06   ` Leonardo Bras
  0 siblings, 0 replies; 35+ messages in thread
From: Leonardo Bras @ 2026-02-07  1:06 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Leonardo Bras, linux-kernel, cgroups, linux-mm, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras,
	Thomas Gleixner, Waiman Long, Boqun Feng

On Fri, Feb 06, 2026 at 11:34:33AM -0300, Marcelo Tosatti wrote:
> Make use of the new qpw_{un,}lock*() and queue_percpu_work_on()
> interface to improve performance & latency on PREEMPT_RT kernels.
> 
> For functions that may be scheduled in a different cpu, replace
> local_{un,}lock*() by qpw_{un,}lock*(), and replace schedule_work_on() by
> queue_percpu_work_on(). The same happens for flush_work() and
> flush_percpu_work().
> 
> The change requires allocation of qpw_structs instead of a work_structs,
> and changing parameters of a few functions to include the cpu parameter.
> 
> This should bring no relevant performance impact on non-RT kernels:

I think this is still referencing the previuos version, as there may be 
impact in PREEMPT_RT=n kernels if QPW=y and qpw=1 in kernel cmdline.

I would go with:
This should bring no relevant performance impact on non-QPW kernels

> For functions that may be scheduled in a different cpu, the local_*lock's
> this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
>  mm/internal.h   |    4 +-
>  mm/mlock.c      |   71 ++++++++++++++++++++++++++++++++------------
>  mm/page_alloc.c |    2 -
>  mm/swap.c       |   90 +++++++++++++++++++++++++++++++-------------------------
>  4 files changed, 108 insertions(+), 59 deletions(-)
> 
> Index: slab/mm/mlock.c
> ===================================================================
> --- slab.orig/mm/mlock.c
> +++ slab/mm/mlock.c
> @@ -25,17 +25,16 @@
>  #include <linux/memcontrol.h>
>  #include <linux/mm_inline.h>
>  #include <linux/secretmem.h>
> +#include <linux/qpw.h>
>  
>  #include "internal.h"
>  
>  struct mlock_fbatch {
> -	local_lock_t lock;
> +	qpw_lock_t lock;
>  	struct folio_batch fbatch;
>  };
>  
> -static DEFINE_PER_CPU(struct mlock_fbatch, mlock_fbatch) = {
> -	.lock = INIT_LOCAL_LOCK(lock),
> -};
> +static DEFINE_PER_CPU(struct mlock_fbatch, mlock_fbatch);
>  
>  bool can_do_mlock(void)
>  {
> @@ -209,18 +208,25 @@ static void mlock_folio_batch(struct fol
>  	folios_put(fbatch);
>  }
>  
> -void mlock_drain_local(void)
> +void mlock_drain_cpu(int cpu)
>  {
>  	struct folio_batch *fbatch;
>  
> -	local_lock(&mlock_fbatch.lock);
> -	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
> +	qpw_lock(&mlock_fbatch.lock, cpu);
> +	fbatch = per_cpu_ptr(&mlock_fbatch.fbatch, cpu);
>  	if (folio_batch_count(fbatch))
>  		mlock_folio_batch(fbatch);
> -	local_unlock(&mlock_fbatch.lock);
> +	qpw_unlock(&mlock_fbatch.lock, cpu);
>  }
>  
> -void mlock_drain_remote(int cpu)
> +void mlock_drain_local(void)
> +{
> +	migrate_disable();
> +	mlock_drain_cpu(smp_processor_id());
> +	migrate_enable();
> +}
> +
> +void mlock_drain_offline(int cpu)
>  {
>  	struct folio_batch *fbatch;
>  
> @@ -242,9 +248,12 @@ bool need_mlock_drain(int cpu)
>  void mlock_folio(struct folio *folio)
>  {
>  	struct folio_batch *fbatch;
> +	int cpu;
>  
> -	local_lock(&mlock_fbatch.lock);
> -	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
> +	migrate_disable();
> +	cpu = smp_processor_id();

Wondering if for these cases it would make sense to have something like:

qpw_get_local_cpu() and 
qpw_put_local_cpu() 

so we could encapsulate these migrate_{en,dis}able()
and the smp_processor_id().

Or even,

int qpw_local_lock() {
	migrate_disable();
	cpu = smp_processor_id();
	qpw_lock(..., cpu);

	return cpu;
}

and

qpw_local_unlock(cpu){
	qpw_unlock(...,cpu);
	migrate_enable();
} 

so it's more direct to convert the local-only cases.

What do you think?


> +	qpw_lock(&mlock_fbatch.lock, cpu);
> +	fbatch = per_cpu_ptr(&mlock_fbatch.fbatch, cpu);
>  
>  	if (!folio_test_set_mlocked(folio)) {
>  		int nr_pages = folio_nr_pages(folio);
> @@ -257,7 +266,8 @@ void mlock_folio(struct folio *folio)
>  	if (!folio_batch_add(fbatch, mlock_lru(folio)) ||
>  	    !folio_may_be_lru_cached(folio) || lru_cache_disabled())
>  		mlock_folio_batch(fbatch);
> -	local_unlock(&mlock_fbatch.lock);
> +	qpw_unlock(&mlock_fbatch.lock, cpu);
> +	migrate_enable();
>  }
>  
>  /**
> @@ -268,9 +278,13 @@ void mlock_new_folio(struct folio *folio
>  {
>  	struct folio_batch *fbatch;
>  	int nr_pages = folio_nr_pages(folio);
> +	int cpu;
> +
> +	migrate_disable();
> +	cpu = smp_processor_id();
> +	qpw_lock(&mlock_fbatch.lock, cpu);
>  
> -	local_lock(&mlock_fbatch.lock);
> -	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
> +	fbatch = per_cpu_ptr(&mlock_fbatch.fbatch, cpu);
>  	folio_set_mlocked(folio);
>  
>  	zone_stat_mod_folio(folio, NR_MLOCK, nr_pages);
> @@ -280,7 +294,8 @@ void mlock_new_folio(struct folio *folio
>  	if (!folio_batch_add(fbatch, mlock_new(folio)) ||
>  	    !folio_may_be_lru_cached(folio) || lru_cache_disabled())
>  		mlock_folio_batch(fbatch);
> -	local_unlock(&mlock_fbatch.lock);
> +	migrate_enable();
> +	qpw_unlock(&mlock_fbatch.lock, cpu);

in the above conversion, the migrate_enable() happened after qpw_unlock,
and in this one is the oposite. Any particular reason?

>  }
>  
>  /**
> @@ -290,9 +305,13 @@ void mlock_new_folio(struct folio *folio
>  void munlock_folio(struct folio *folio)
>  {
>  	struct folio_batch *fbatch;
> +	int cpu;
>  
> -	local_lock(&mlock_fbatch.lock);
> -	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
> +	migrate_disable();
> +	cpu = smp_processor_id();
> +	qpw_lock(&mlock_fbatch.lock, cpu);
> +
> +	fbatch = per_cpu_ptr(&mlock_fbatch.fbatch, cpu);
>  	/*
>  	 * folio_test_clear_mlocked(folio) must be left to __munlock_folio(),
>  	 * which will check whether the folio is multiply mlocked.
> @@ -301,7 +320,8 @@ void munlock_folio(struct folio *folio)
>  	if (!folio_batch_add(fbatch, folio) ||
>  	    !folio_may_be_lru_cached(folio) || lru_cache_disabled())
>  		mlock_folio_batch(fbatch);
> -	local_unlock(&mlock_fbatch.lock);
> +	qpw_unlock(&mlock_fbatch.lock, cpu);
> +	migrate_enable();
>  }
>  
>  static inline unsigned int folio_mlock_step(struct folio *folio,
> @@ -823,3 +843,18 @@ void user_shm_unlock(size_t size, struct
>  	spin_unlock(&shmlock_user_lock);
>  	put_ucounts(ucounts);
>  }
> +
> +int __init mlock_init(void)
> +{
> +	unsigned int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct mlock_fbatch *fbatch = &per_cpu(mlock_fbatch, cpu);
> +
> +		qpw_lock_init(&fbatch->lock);
> +	}
> +
> +	return 0;
> +}
> +
> +module_init(mlock_init);
> Index: slab/mm/swap.c
> ===================================================================
> --- slab.orig/mm/swap.c
> +++ slab/mm/swap.c
> @@ -35,7 +35,7 @@
>  #include <linux/uio.h>
>  #include <linux/hugetlb.h>
>  #include <linux/page_idle.h>
> -#include <linux/local_lock.h>
> +#include <linux/qpw.h>
>  #include <linux/buffer_head.h>
>  
>  #include "internal.h"
> @@ -52,7 +52,7 @@ struct cpu_fbatches {
>  	 * The following folio batches are grouped together because they are protected
>  	 * by disabling preemption (and interrupts remain enabled).
>  	 */
> -	local_lock_t lock;
> +	qpw_lock_t lock;
>  	struct folio_batch lru_add;
>  	struct folio_batch lru_deactivate_file;
>  	struct folio_batch lru_deactivate;
> @@ -61,14 +61,11 @@ struct cpu_fbatches {
>  	struct folio_batch lru_activate;
>  #endif
>  	/* Protecting the following batches which require disabling interrupts */
> -	local_lock_t lock_irq;
> +	qpw_lock_t lock_irq;
>  	struct folio_batch lru_move_tail;
>  };
>  
> -static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches) = {
> -	.lock = INIT_LOCAL_LOCK(lock),
> -	.lock_irq = INIT_LOCAL_LOCK(lock_irq),
> -};
> +static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches);
>  
>  static void __page_cache_release(struct folio *folio, struct lruvec **lruvecp,
>  		unsigned long *flagsp)
> @@ -183,22 +180,24 @@ static void __folio_batch_add_and_move(s
>  		struct folio *folio, move_fn_t move_fn, bool disable_irq)
>  {
>  	unsigned long flags;
> +	int cpu;
>  
>  	folio_get(folio);


don't we need the migrate_disable() here?

>  
> +	cpu = smp_processor_id();
>  	if (disable_irq)
> -		local_lock_irqsave(&cpu_fbatches.lock_irq, flags);
> +		qpw_lock_irqsave(&cpu_fbatches.lock_irq, flags, cpu);
>  	else
> -		local_lock(&cpu_fbatches.lock);
> +		qpw_lock(&cpu_fbatches.lock, cpu);
>  
> -	if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
> +	if (!folio_batch_add(per_cpu_ptr(fbatch, cpu), folio) ||
>  			!folio_may_be_lru_cached(folio) || lru_cache_disabled())
> -		folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
> +		folio_batch_move_lru(per_cpu_ptr(fbatch, cpu), move_fn);
>  
>  	if (disable_irq)
> -		local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags);
> +		qpw_unlock_irqrestore(&cpu_fbatches.lock_irq, flags, cpu);
>  	else
> -		local_unlock(&cpu_fbatches.lock);
> +		qpw_unlock(&cpu_fbatches.lock, cpu);
>  }
>  
>  #define folio_batch_add_and_move(folio, op)		\
> @@ -358,9 +357,10 @@ static void __lru_cache_activate_folio(s
>  {
>  	struct folio_batch *fbatch;
>  	int i;

and here?

> +	int cpu = smp_processor_id();
>  
> -	local_lock(&cpu_fbatches.lock);
> -	fbatch = this_cpu_ptr(&cpu_fbatches.lru_add);
> +	qpw_lock(&cpu_fbatches.lock, cpu);
> +	fbatch = per_cpu_ptr(&cpu_fbatches.lru_add, cpu);
>  
>  	/*
>  	 * Search backwards on the optimistic assumption that the folio being
> @@ -381,7 +381,7 @@ static void __lru_cache_activate_folio(s
>  		}
>  	}
>  
> -	local_unlock(&cpu_fbatches.lock);
> +	qpw_unlock(&cpu_fbatches.lock, cpu);
>  }
>  
>  #ifdef CONFIG_LRU_GEN
> @@ -653,9 +653,9 @@ void lru_add_drain_cpu(int cpu)
>  		unsigned long flags;
>  
>  		/* No harm done if a racing interrupt already did this */
> -		local_lock_irqsave(&cpu_fbatches.lock_irq, flags);
> +		qpw_lock_irqsave(&cpu_fbatches.lock_irq, flags, cpu);
>  		folio_batch_move_lru(fbatch, lru_move_tail);
> -		local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags);
> +		qpw_unlock_irqrestore(&cpu_fbatches.lock_irq, flags, cpu);
>  	}
>  
>  	fbatch = &fbatches->lru_deactivate_file;
> @@ -733,10 +733,12 @@ void folio_mark_lazyfree(struct folio *f
>  
>  void lru_add_drain(void)
>  {
> -	local_lock(&cpu_fbatches.lock);
> -	lru_add_drain_cpu(smp_processor_id());
> -	local_unlock(&cpu_fbatches.lock);
> -	mlock_drain_local();

and here?

> +	int cpu = smp_processor_id();
> +
> +	qpw_lock(&cpu_fbatches.lock, cpu);
> +	lru_add_drain_cpu(cpu);
> +	qpw_unlock(&cpu_fbatches.lock, cpu);
> +	mlock_drain_cpu(cpu);
>  }
>  
>  /*
> @@ -745,30 +747,32 @@ void lru_add_drain(void)
>   * the same cpu. It shouldn't be a problem in !SMP case since
>   * the core is only one and the locks will disable preemption.
>   */
> -static void lru_add_mm_drain(void)
> +static void lru_add_mm_drain(int cpu)
>  {
> -	local_lock(&cpu_fbatches.lock);
> -	lru_add_drain_cpu(smp_processor_id());
> -	local_unlock(&cpu_fbatches.lock);
> -	mlock_drain_local();
> +	qpw_lock(&cpu_fbatches.lock, cpu);
> +	lru_add_drain_cpu(cpu);
> +	qpw_unlock(&cpu_fbatches.lock, cpu);
> +	mlock_drain_cpu(cpu);
>  }
>  
>  void lru_add_drain_cpu_zone(struct zone *zone)
>  {
> -	local_lock(&cpu_fbatches.lock);
> -	lru_add_drain_cpu(smp_processor_id());

and here ?

> +	int cpu = smp_processor_id();
> +
> +	qpw_lock(&cpu_fbatches.lock, cpu);
> +	lru_add_drain_cpu(cpu);
>  	drain_local_pages(zone);
> -	local_unlock(&cpu_fbatches.lock);
> -	mlock_drain_local();
> +	qpw_unlock(&cpu_fbatches.lock, cpu);
> +	mlock_drain_cpu(cpu);
>  }
>  
>  #ifdef CONFIG_SMP
>  
> -static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
> +static DEFINE_PER_CPU(struct qpw_struct, lru_add_drain_qpw);
>  
> -static void lru_add_drain_per_cpu(struct work_struct *dummy)
> +static void lru_add_drain_per_cpu(struct work_struct *w)
>  {
> -	lru_add_mm_drain();
> +	lru_add_mm_drain(qpw_get_cpu(w));
>  }
>  
>  static DEFINE_PER_CPU(struct work_struct, bh_add_drain_work);
> @@ -883,12 +887,12 @@ static inline void __lru_add_drain_all(b
>  	cpumask_clear(&has_mm_work);
>  	cpumask_clear(&has_bh_work);
>  	for_each_online_cpu(cpu) {
> -		struct work_struct *mm_work = &per_cpu(lru_add_drain_work, cpu);
> +		struct qpw_struct *mm_qpw = &per_cpu(lru_add_drain_qpw, cpu);
>  		struct work_struct *bh_work = &per_cpu(bh_add_drain_work, cpu);
>  
>  		if (cpu_needs_mm_drain(cpu)) {
> -			INIT_WORK(mm_work, lru_add_drain_per_cpu);
> -			queue_work_on(cpu, mm_percpu_wq, mm_work);
> +			INIT_QPW(mm_qpw, lru_add_drain_per_cpu, cpu);
> +			queue_percpu_work_on(cpu, mm_percpu_wq, mm_qpw);
>  			__cpumask_set_cpu(cpu, &has_mm_work);
>  		}
>  
> @@ -900,7 +904,7 @@ static inline void __lru_add_drain_all(b
>  	}
>  
>  	for_each_cpu(cpu, &has_mm_work)
> -		flush_work(&per_cpu(lru_add_drain_work, cpu));
> +		flush_percpu_work(&per_cpu(lru_add_drain_qpw, cpu));
>  
>  	for_each_cpu(cpu, &has_bh_work)
>  		flush_work(&per_cpu(bh_add_drain_work, cpu));
> @@ -950,7 +954,7 @@ void lru_cache_disable(void)
>  #ifdef CONFIG_SMP
>  	__lru_add_drain_all(true);
>  #else
> -	lru_add_mm_drain();

and here, I wonder

> +	lru_add_mm_drain(smp_processor_id());
>  	invalidate_bh_lrus_cpu();
>  #endif
>  }
> @@ -1124,6 +1128,7 @@ static const struct ctl_table swap_sysct
>  void __init swap_setup(void)
>  {
>  	unsigned long megs = PAGES_TO_MB(totalram_pages());
> +	unsigned int cpu;
>  
>  	/* Use a smaller cluster for small-memory machines */
>  	if (megs < 16)
> @@ -1136,4 +1141,11 @@ void __init swap_setup(void)
>  	 */
>  
>  	register_sysctl_init("vm", swap_sysctl_table);
> +
> +	for_each_possible_cpu(cpu) {
> +		struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
> +
> +		qpw_lock_init(&fbatches->lock);
> +		qpw_lock_init(&fbatches->lock_irq);
> +	}
>  }
> Index: slab/mm/internal.h
> ===================================================================
> --- slab.orig/mm/internal.h
> +++ slab/mm/internal.h
> @@ -1061,10 +1061,12 @@ static inline void munlock_vma_folio(str
>  		munlock_folio(folio);
>  }
>  
> +int __init mlock_init(void);
>  void mlock_new_folio(struct folio *folio);
>  bool need_mlock_drain(int cpu);
>  void mlock_drain_local(void);
> -void mlock_drain_remote(int cpu);
> +void mlock_drain_cpu(int cpu);
> +void mlock_drain_offline(int cpu);
>  
>  extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
>  
> Index: slab/mm/page_alloc.c
> ===================================================================
> --- slab.orig/mm/page_alloc.c
> +++ slab/mm/page_alloc.c
> @@ -6251,7 +6251,7 @@ static int page_alloc_cpu_dead(unsigned
>  	struct zone *zone;
>  
>  	lru_add_drain_cpu(cpu);
> -	mlock_drain_remote(cpu);
> +	mlock_drain_offline(cpu);
>  	drain_pages(cpu);
>  
>  	/*
> 
> 

TBH, I am still trying to understand if we need the migrate_{en,dis}able():
- There is a data dependency beween cpu being filled and being used.
- If we get the cpu, and then migrate to a different cpu, the operation 
  will still be executed with the data from that starting cpu 
- But maybe the compiler tries to optize this because the processor number 
  can be on a register and of easy access, which would break this.

Maybe a READ_ONCE() on smp_processor_id() should suffice?

Other than that, all the conversions done look correct.

That being said, I understand very little about mm code, so let's hope we 
get proper feedback from those who do :) 

Thanks!
Leo



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 4/4] slub: apply new queue_percpu_work_on() interface
  2026-02-06 14:34 [PATCH 0/4] Introduce QPW for per-cpu operations Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2026-02-06 14:34 ` [PATCH 3/4] swap: apply new queue_percpu_work_on() interface Marcelo Tosatti
@ 2026-02-06 14:34 ` Marcelo Tosatti
  2026-02-07  1:27   ` Leonardo Bras
  2026-02-06 23:56 ` [PATCH 0/4] Introduce QPW for per-cpu operations Leonardo Bras
  2026-02-10 14:01 ` Michal Hocko
  5 siblings, 1 reply; 35+ messages in thread
From: Marcelo Tosatti @ 2026-02-06 14:34 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Waiman Long, Boqun Feng,
	Marcelo Tosatti

Make use of the new qpw_{un,}lock*() and queue_percpu_work_on()
interface to improve performance & latency on PREEMPT_RT kernels.

For functions that may be scheduled in a different cpu, replace
local_{un,}lock*() by qpw_{un,}lock*(), and replace schedule_work_on() by
queue_percpu_work_on(). The same happens for flush_work() and
flush_percpu_work().

This change requires allocation of qpw_structs instead of a work_structs,
and changing parameters of a few functions to include the cpu parameter.

This should bring no relevant performance impact on non-RT kernels:
For functions that may be scheduled in a different cpu, the local_*lock's
this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 mm/slub.c |  218 ++++++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 142 insertions(+), 76 deletions(-)

Index: slab/mm/slub.c
===================================================================
--- slab.orig/mm/slub.c
+++ slab/mm/slub.c
@@ -49,6 +49,7 @@
 #include <linux/irq_work.h>
 #include <linux/kprobes.h>
 #include <linux/debugfs.h>
+#include <linux/qpw.h>
 #include <trace/events/kmem.h>
 
 #include "internal.h"
@@ -128,7 +129,7 @@
  *   For debug caches, all allocations are forced to go through a list_lock
  *   protected region to serialize against concurrent validation.
  *
- *   cpu_sheaves->lock (local_trylock)
+ *   cpu_sheaves->lock (qpw_trylock)
  *
  *   This lock protects fastpath operations on the percpu sheaves. On !RT it
  *   only disables preemption and does no atomic operations. As long as the main
@@ -156,7 +157,7 @@
  *   Interrupts are disabled as part of list_lock or barn lock operations, or
  *   around the slab_lock operation, in order to make the slab allocator safe
  *   to use in the context of an irq.
- *   Preemption is disabled as part of local_trylock operations.
+ *   Preemption is disabled as part of qpw_trylock operations.
  *   kmalloc_nolock() and kfree_nolock() are safe in NMI context but see
  *   their limitations.
  *
@@ -417,7 +418,7 @@ struct slab_sheaf {
 };
 
 struct slub_percpu_sheaves {
-	local_trylock_t lock;
+	qpw_trylock_t lock;
 	struct slab_sheaf *main; /* never NULL when unlocked */
 	struct slab_sheaf *spare; /* empty or full, may be NULL */
 	struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
@@ -479,7 +480,7 @@ static nodemask_t slab_nodes;
 static struct workqueue_struct *flushwq;
 
 struct slub_flush_work {
-	struct work_struct work;
+	struct qpw_struct qpw;
 	struct kmem_cache *s;
 	bool skip;
 };
@@ -2826,7 +2827,7 @@ static void __kmem_cache_free_bulk(struc
  *
  * returns true if at least partially flushed
  */
-static bool sheaf_flush_main(struct kmem_cache *s)
+static bool sheaf_flush_main(struct kmem_cache *s, int cpu)
 {
 	struct slub_percpu_sheaves *pcs;
 	unsigned int batch, remaining;
@@ -2835,10 +2836,10 @@ static bool sheaf_flush_main(struct kmem
 	bool ret = false;
 
 next_batch:
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	if (!qpw_trylock(&s->cpu_sheaves->lock, cpu))
 		return ret;
 
-	pcs = this_cpu_ptr(s->cpu_sheaves);
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 	sheaf = pcs->main;
 
 	batch = min(PCS_BATCH_MAX, sheaf->size);
@@ -2848,7 +2849,7 @@ next_batch:
 
 	remaining = sheaf->size;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	qpw_unlock(&s->cpu_sheaves->lock, cpu);
 
 	__kmem_cache_free_bulk(s, batch, &objects[0]);
 
@@ -2932,13 +2933,13 @@ static void rcu_free_sheaf_nobarn(struct
  * flushing operations are rare so let's keep it simple and flush to slabs
  * directly, skipping the barn
  */
-static void pcs_flush_all(struct kmem_cache *s)
+static void pcs_flush_all(struct kmem_cache *s, int cpu)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *spare, *rcu_free;
 
-	local_lock(&s->cpu_sheaves->lock);
-	pcs = this_cpu_ptr(s->cpu_sheaves);
+	qpw_lock(&s->cpu_sheaves->lock, cpu);
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
 	spare = pcs->spare;
 	pcs->spare = NULL;
@@ -2946,7 +2947,7 @@ static void pcs_flush_all(struct kmem_ca
 	rcu_free = pcs->rcu_free;
 	pcs->rcu_free = NULL;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	qpw_unlock(&s->cpu_sheaves->lock, cpu);
 
 	if (spare) {
 		sheaf_flush_unused(s, spare);
@@ -2956,7 +2957,7 @@ static void pcs_flush_all(struct kmem_ca
 	if (rcu_free)
 		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
 
-	sheaf_flush_main(s);
+	sheaf_flush_main(s, cpu);
 }
 
 static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
@@ -3881,13 +3882,13 @@ static void flush_cpu_sheaves(struct wor
 {
 	struct kmem_cache *s;
 	struct slub_flush_work *sfw;
+	int cpu = qpw_get_cpu(w);
 
-	sfw = container_of(w, struct slub_flush_work, work);
-
+	sfw = &per_cpu(slub_flush, cpu);
 	s = sfw->s;
 
 	if (cache_has_sheaves(s))
-		pcs_flush_all(s);
+		pcs_flush_all(s, cpu);
 }
 
 static void flush_all_cpus_locked(struct kmem_cache *s)
@@ -3904,17 +3905,17 @@ static void flush_all_cpus_locked(struct
 			sfw->skip = true;
 			continue;
 		}
-		INIT_WORK(&sfw->work, flush_cpu_sheaves);
+		INIT_QPW(&sfw->qpw, flush_cpu_sheaves, cpu);
 		sfw->skip = false;
 		sfw->s = s;
-		queue_work_on(cpu, flushwq, &sfw->work);
+		queue_percpu_work_on(cpu, flushwq, &sfw->qpw);
 	}
 
 	for_each_online_cpu(cpu) {
 		sfw = &per_cpu(slub_flush, cpu);
 		if (sfw->skip)
 			continue;
-		flush_work(&sfw->work);
+		flush_percpu_work(&sfw->qpw);
 	}
 
 	mutex_unlock(&flush_lock);
@@ -3933,17 +3934,18 @@ static void flush_rcu_sheaf(struct work_
 	struct slab_sheaf *rcu_free;
 	struct slub_flush_work *sfw;
 	struct kmem_cache *s;
+	int cpu = qpw_get_cpu(w);
 
-	sfw = container_of(w, struct slub_flush_work, work);
+	sfw = &per_cpu(slub_flush, cpu);
 	s = sfw->s;
 
-	local_lock(&s->cpu_sheaves->lock);
-	pcs = this_cpu_ptr(s->cpu_sheaves);
+	qpw_lock(&s->cpu_sheaves->lock, cpu);
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
 	rcu_free = pcs->rcu_free;
 	pcs->rcu_free = NULL;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	qpw_unlock(&s->cpu_sheaves->lock, cpu);
 
 	if (rcu_free)
 		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
@@ -3968,14 +3970,14 @@ void flush_rcu_sheaves_on_cache(struct k
 		 * sure the __kfree_rcu_sheaf() finished its call_rcu()
 		 */
 
-		INIT_WORK(&sfw->work, flush_rcu_sheaf);
+		INIT_QPW(&sfw->qpw, flush_rcu_sheaf, cpu);
 		sfw->s = s;
-		queue_work_on(cpu, flushwq, &sfw->work);
+		queue_percpu_work_on(cpu, flushwq, &sfw->qpw);
 	}
 
 	for_each_online_cpu(cpu) {
 		sfw = &per_cpu(slub_flush, cpu);
-		flush_work(&sfw->work);
+		flush_percpu_work(&sfw->qpw);
 	}
 
 	mutex_unlock(&flush_lock);
@@ -4472,22 +4474,24 @@ bool slab_post_alloc_hook(struct kmem_ca
  *
  * Must be called with the cpu_sheaves local lock locked. If successful, returns
  * the pcs pointer and the local lock locked (possibly on a different cpu than
- * initially called). If not successful, returns NULL and the local lock
- * unlocked.
+ * initially called), and migration disabled. If not successful, returns NULL
+ * and the local lock unlocked, with migration enabled.
  */
 static struct slub_percpu_sheaves *
-__pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs, gfp_t gfp)
+__pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs, gfp_t gfp,
+			 int *cpu)
 {
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
 	bool can_alloc;
 
-	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+	qpw_lockdep_assert_held(&s->cpu_sheaves->lock);
 
 	/* Bootstrap or debug cache, back off */
 	if (unlikely(!cache_has_sheaves(s))) {
-		local_unlock(&s->cpu_sheaves->lock);
+		qpw_unlock(&s->cpu_sheaves->lock, *cpu);
+		migrate_enable();
 		return NULL;
 	}
 
@@ -4498,7 +4502,8 @@ __pcs_replace_empty_main(struct kmem_cac
 
 	barn = get_barn(s);
 	if (!barn) {
-		local_unlock(&s->cpu_sheaves->lock);
+		qpw_unlock(&s->cpu_sheaves->lock, *cpu);
+		migrate_enable();
 		return NULL;
 	}
 
@@ -4524,7 +4529,8 @@ __pcs_replace_empty_main(struct kmem_cac
 		}
 	}
 
-	local_unlock(&s->cpu_sheaves->lock);
+	qpw_unlock(&s->cpu_sheaves->lock, *cpu);
+	migrate_enable();
 
 	if (!can_alloc)
 		return NULL;
@@ -4550,7 +4556,9 @@ __pcs_replace_empty_main(struct kmem_cac
 	 * we can reach here only when gfpflags_allow_blocking
 	 * so this must not be an irq
 	 */
-	local_lock(&s->cpu_sheaves->lock);
+	migrate_disable();
+	*cpu = smp_processor_id();
+	qpw_lock(&s->cpu_sheaves->lock, *cpu);
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	/*
@@ -4593,6 +4601,7 @@ void *alloc_from_pcs(struct kmem_cache *
 	struct slub_percpu_sheaves *pcs;
 	bool node_requested;
 	void *object;
+	int cpu;
 
 #ifdef CONFIG_NUMA
 	if (static_branch_unlikely(&strict_numa) &&
@@ -4627,13 +4636,17 @@ void *alloc_from_pcs(struct kmem_cache *
 		return NULL;
 	}
 
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	migrate_disable();
+	cpu = smp_processor_id();
+	if (!qpw_trylock(&s->cpu_sheaves->lock, cpu)) {
+		migrate_enable();
 		return NULL;
+	}
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	if (unlikely(pcs->main->size == 0)) {
-		pcs = __pcs_replace_empty_main(s, pcs, gfp);
+		pcs = __pcs_replace_empty_main(s, pcs, gfp, &cpu);
 		if (unlikely(!pcs))
 			return NULL;
 	}
@@ -4647,7 +4660,8 @@ void *alloc_from_pcs(struct kmem_cache *
 		 * the current allocation or previous freeing process.
 		 */
 		if (page_to_nid(virt_to_page(object)) != node) {
-			local_unlock(&s->cpu_sheaves->lock);
+			qpw_unlock(&s->cpu_sheaves->lock, cpu);
+			migrate_enable();
 			stat(s, ALLOC_NODE_MISMATCH);
 			return NULL;
 		}
@@ -4655,7 +4669,8 @@ void *alloc_from_pcs(struct kmem_cache *
 
 	pcs->main->size--;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	qpw_unlock(&s->cpu_sheaves->lock, cpu);
+	migrate_enable();
 
 	stat(s, ALLOC_FASTPATH);
 
@@ -4670,10 +4685,15 @@ unsigned int alloc_from_pcs_bulk(struct
 	struct slab_sheaf *main;
 	unsigned int allocated = 0;
 	unsigned int batch;
+	int cpu;
 
 next_batch:
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	migrate_disable();
+	cpu = smp_processor_id();
+	if (!qpw_trylock(&s->cpu_sheaves->lock, cpu)) {
+		migrate_enable();
 		return allocated;
+	}
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
@@ -4683,7 +4703,8 @@ next_batch:
 		struct node_barn *barn;
 
 		if (unlikely(!cache_has_sheaves(s))) {
-			local_unlock(&s->cpu_sheaves->lock);
+			qpw_unlock(&s->cpu_sheaves->lock, cpu);
+			migrate_enable();
 			return allocated;
 		}
 
@@ -4694,7 +4715,8 @@ next_batch:
 
 		barn = get_barn(s);
 		if (!barn) {
-			local_unlock(&s->cpu_sheaves->lock);
+			qpw_unlock(&s->cpu_sheaves->lock, cpu);
+			migrate_enable();
 			return allocated;
 		}
 
@@ -4709,7 +4731,8 @@ next_batch:
 
 		stat(s, BARN_GET_FAIL);
 
-		local_unlock(&s->cpu_sheaves->lock);
+		qpw_unlock(&s->cpu_sheaves->lock, cpu);
+		migrate_enable();
 
 		/*
 		 * Once full sheaves in barn are depleted, let the bulk
@@ -4727,7 +4750,8 @@ do_alloc:
 	main->size -= batch;
 	memcpy(p, main->objects + main->size, batch * sizeof(void *));
 
-	local_unlock(&s->cpu_sheaves->lock);
+	qpw_unlock(&s->cpu_sheaves->lock, cpu);
+	migrate_enable();
 
 	stat_add(s, ALLOC_FASTPATH, batch);
 
@@ -4877,6 +4901,7 @@ kmem_cache_prefill_sheaf(struct kmem_cac
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *sheaf = NULL;
 	struct node_barn *barn;
+	int cpu;
 
 	if (unlikely(!size))
 		return NULL;
@@ -4906,7 +4931,9 @@ kmem_cache_prefill_sheaf(struct kmem_cac
 		return sheaf;
 	}
 
-	local_lock(&s->cpu_sheaves->lock);
+	migrate_disable();
+	cpu = smp_processor_id();
+	qpw_lock(&s->cpu_sheaves->lock, cpu);
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	if (pcs->spare) {
@@ -4925,7 +4952,8 @@ kmem_cache_prefill_sheaf(struct kmem_cac
 			stat(s, BARN_GET_FAIL);
 	}
 
-	local_unlock(&s->cpu_sheaves->lock);
+	qpw_unlock(&s->cpu_sheaves->lock, cpu);
+	migrate_enable();
 
 
 	if (!sheaf)
@@ -4961,6 +4989,7 @@ void kmem_cache_return_sheaf(struct kmem
 {
 	struct slub_percpu_sheaves *pcs;
 	struct node_barn *barn;
+	int cpu;
 
 	if (unlikely((sheaf->capacity != s->sheaf_capacity)
 		     || sheaf->pfmemalloc)) {
@@ -4969,7 +4998,9 @@ void kmem_cache_return_sheaf(struct kmem
 		return;
 	}
 
-	local_lock(&s->cpu_sheaves->lock);
+	migrate_disable();
+	cpu = smp_processor_id();
+	qpw_lock(&s->cpu_sheaves->lock, cpu);
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 	barn = get_barn(s);
 
@@ -4979,7 +5010,8 @@ void kmem_cache_return_sheaf(struct kmem
 		stat(s, SHEAF_RETURN_FAST);
 	}
 
-	local_unlock(&s->cpu_sheaves->lock);
+	qpw_unlock(&s->cpu_sheaves->lock, cpu);
+	migrate_enable();
 
 	if (!sheaf)
 		return;
@@ -5507,9 +5539,9 @@ slab_empty:
  */
 static void __pcs_install_empty_sheaf(struct kmem_cache *s,
 		struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty,
-		struct node_barn *barn)
+		struct node_barn *barn, int cpu)
 {
-	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+	qpw_lockdep_assert_held(&s->cpu_sheaves->lock);
 
 	/* This is what we expect to find if nobody interrupted us. */
 	if (likely(!pcs->spare)) {
@@ -5546,31 +5578,34 @@ static void __pcs_install_empty_sheaf(st
 /*
  * Replace the full main sheaf with a (at least partially) empty sheaf.
  *
- * Must be called with the cpu_sheaves local lock locked. If successful, returns
- * the pcs pointer and the local lock locked (possibly on a different cpu than
- * initially called). If not successful, returns NULL and the local lock
- * unlocked.
+ * Must be called with the cpu_sheaves local lock locked, and migration counter
+ * increased. If successful, returns the pcs pointer and the local lock locked
+ * (possibly on a different cpu than initially called), with migration counter
+ * increased. If not successful, returns NULL and the local lock unlocked,
+ * and migration counter decreased.
  */
 static struct slub_percpu_sheaves *
 __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
-			bool allow_spin)
+			bool allow_spin, int *cpu)
 {
 	struct slab_sheaf *empty;
 	struct node_barn *barn;
 	bool put_fail;
 
 restart:
-	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+	qpw_lockdep_assert_held(&s->cpu_sheaves->lock);
 
 	/* Bootstrap or debug cache, back off */
 	if (unlikely(!cache_has_sheaves(s))) {
-		local_unlock(&s->cpu_sheaves->lock);
+		qpw_unlock(&s->cpu_sheaves->lock, *cpu);
+		migrate_enable();
 		return NULL;
 	}
 
 	barn = get_barn(s);
 	if (!barn) {
-		local_unlock(&s->cpu_sheaves->lock);
+		qpw_unlock(&s->cpu_sheaves->lock, *cpu);
+		migrate_enable();
 		return NULL;
 	}
 
@@ -5607,7 +5642,8 @@ restart:
 		stat(s, BARN_PUT_FAIL);
 
 		pcs->spare = NULL;
-		local_unlock(&s->cpu_sheaves->lock);
+		qpw_unlock(&s->cpu_sheaves->lock, *cpu);
+		migrate_enable();
 
 		sheaf_flush_unused(s, to_flush);
 		empty = to_flush;
@@ -5623,7 +5659,8 @@ restart:
 	put_fail = true;
 
 alloc_empty:
-	local_unlock(&s->cpu_sheaves->lock);
+	qpw_unlock(&s->cpu_sheaves->lock, *cpu);
+	migrate_enable();
 
 	/*
 	 * alloc_empty_sheaf() doesn't support !allow_spin and it's
@@ -5640,11 +5677,17 @@ alloc_empty:
 	if (put_fail)
 		 stat(s, BARN_PUT_FAIL);
 
-	if (!sheaf_flush_main(s))
+	migrate_disable();
+	*cpu = smp_processor_id();
+	if (!sheaf_flush_main(s, *cpu)) {
+		migrate_enable();
 		return NULL;
+	}
 
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	if (!qpw_trylock(&s->cpu_sheaves->lock, *cpu)) {
+		migrate_enable();
 		return NULL;
+	}
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
@@ -5659,13 +5702,14 @@ alloc_empty:
 	return pcs;
 
 got_empty:
-	if (!local_trylock(&s->cpu_sheaves->lock)) {
+	if (!qpw_trylock(&s->cpu_sheaves->lock, *cpu)) {
+		migrate_enable();
 		barn_put_empty_sheaf(barn, empty);
 		return NULL;
 	}
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
-	__pcs_install_empty_sheaf(s, pcs, empty, barn);
+	__pcs_install_empty_sheaf(s, pcs, empty, barn, *cpu);
 
 	return pcs;
 }
@@ -5678,22 +5722,28 @@ static __fastpath_inline
 bool free_to_pcs(struct kmem_cache *s, void *object, bool allow_spin)
 {
 	struct slub_percpu_sheaves *pcs;
+	int cpu;
 
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	migrate_disable();
+	cpu = smp_processor_id();
+	if (!qpw_trylock(&s->cpu_sheaves->lock, cpu)) {
+		migrate_enable();
 		return false;
+	}
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
 
-		pcs = __pcs_replace_full_main(s, pcs, allow_spin);
+		pcs = __pcs_replace_full_main(s, pcs, allow_spin, &cpu);
 		if (unlikely(!pcs))
 			return false;
 	}
 
 	pcs->main->objects[pcs->main->size++] = object;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	qpw_unlock(&s->cpu_sheaves->lock, cpu);
+	migrate_enable();
 
 	stat(s, FREE_FASTPATH);
 
@@ -5777,14 +5827,19 @@ bool __kfree_rcu_sheaf(struct kmem_cache
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *rcu_sheaf;
+	int cpu;
 
 	if (WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT)))
 		return false;
 
 	lock_map_acquire_try(&kfree_rcu_sheaf_map);
 
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	migrate_disable();
+	cpu = smp_processor_id();
+	if (!qpw_trylock(&s->cpu_sheaves->lock, cpu)) {
+		migrate_enable();
 		goto fail;
+	}
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
@@ -5795,7 +5850,8 @@ bool __kfree_rcu_sheaf(struct kmem_cache
 
 		/* Bootstrap or debug cache, fall back */
 		if (unlikely(!cache_has_sheaves(s))) {
-			local_unlock(&s->cpu_sheaves->lock);
+			qpw_unlock(&s->cpu_sheaves->lock, cpu);
+			migrate_enable();
 			goto fail;
 		}
 
@@ -5807,7 +5863,8 @@ bool __kfree_rcu_sheaf(struct kmem_cache
 
 		barn = get_barn(s);
 		if (!barn) {
-			local_unlock(&s->cpu_sheaves->lock);
+			qpw_unlock(&s->cpu_sheaves->lock, cpu);
+			migrate_enable();
 			goto fail;
 		}
 
@@ -5818,15 +5875,18 @@ bool __kfree_rcu_sheaf(struct kmem_cache
 			goto do_free;
 		}
 
-		local_unlock(&s->cpu_sheaves->lock);
+		qpw_unlock(&s->cpu_sheaves->lock, cpu);
+		migrate_enable();
 
 		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
 
 		if (!empty)
 			goto fail;
 
-		if (!local_trylock(&s->cpu_sheaves->lock)) {
+		migrate_disable();
+		if (!qpw_trylock(&s->cpu_sheaves->lock, cpu)) {
 			barn_put_empty_sheaf(barn, empty);
+			migrate_enable();
 			goto fail;
 		}
 
@@ -5862,7 +5922,8 @@ do_free:
 	if (rcu_sheaf)
 		call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
 
-	local_unlock(&s->cpu_sheaves->lock);
+	qpw_unlock(&s->cpu_sheaves->lock, cpu);
+	migrate_enable();
 
 	stat(s, FREE_RCU_SHEAF);
 	lock_map_release(&kfree_rcu_sheaf_map);
@@ -5889,6 +5950,7 @@ static void free_to_pcs_bulk(struct kmem
 	void *remote_objects[PCS_BATCH_MAX];
 	unsigned int remote_nr = 0;
 	int node = numa_mem_id();
+	int cpu;
 
 next_remote_batch:
 	while (i < size) {
@@ -5918,7 +5980,9 @@ next_remote_batch:
 		goto flush_remote;
 
 next_batch:
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	migrate_disable();
+	cpu = smp_processor_id();
+	if (!qpw_trylock(&s->cpu_sheaves->lock, cpu))
 		goto fallback;
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
@@ -5961,7 +6025,8 @@ do_free:
 	memcpy(main->objects + main->size, p, batch * sizeof(void *));
 	main->size += batch;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	qpw_unlock(&s->cpu_sheaves->lock, cpu);
+	migrate_enable();
 
 	stat_add(s, FREE_FASTPATH, batch);
 
@@ -5977,7 +6042,8 @@ do_free:
 	return;
 
 no_empty:
-	local_unlock(&s->cpu_sheaves->lock);
+	qpw_unlock(&s->cpu_sheaves->lock, cpu);
+	migrate_enable();
 
 	/*
 	 * if we depleted all empty sheaves in the barn or there are too
@@ -7377,7 +7443,7 @@ static int init_percpu_sheaves(struct km
 
 		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
-		local_trylock_init(&pcs->lock);
+		qpw_trylock_init(&pcs->lock);
 
 		/*
 		 * Bootstrap sheaf has zero size so fast-path allocation fails.




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/4] slub: apply new queue_percpu_work_on() interface
  2026-02-06 14:34 ` [PATCH 4/4] slub: " Marcelo Tosatti
@ 2026-02-07  1:27   ` Leonardo Bras
  0 siblings, 0 replies; 35+ messages in thread
From: Leonardo Bras @ 2026-02-07  1:27 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Leonardo Bras, linux-kernel, cgroups, linux-mm, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras,
	Thomas Gleixner, Waiman Long, Boqun Feng

On Fri, Feb 06, 2026 at 11:34:34AM -0300, Marcelo Tosatti wrote:
> Make use of the new qpw_{un,}lock*() and queue_percpu_work_on()
> interface to improve performance & latency on PREEMPT_RT kernels.
> 
> For functions that may be scheduled in a different cpu, replace
> local_{un,}lock*() by qpw_{un,}lock*(), and replace schedule_work_on() by
> queue_percpu_work_on(). The same happens for flush_work() and
> flush_percpu_work().
> 
> This change requires allocation of qpw_structs instead of a work_structs,
> and changing parameters of a few functions to include the cpu parameter.
> 
> This should bring no relevant performance impact on non-RT kernels:

Same as prev patch

> For functions that may be scheduled in a different cpu, the local_*lock's
> this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
>  mm/slub.c |  218 ++++++++++++++++++++++++++++++++++++++++----------------------
>  1 file changed, 142 insertions(+), 76 deletions(-)
> 
> Index: slab/mm/slub.c
> ===================================================================
> --- slab.orig/mm/slub.c
> +++ slab/mm/slub.c
> @@ -49,6 +49,7 @@
>  #include <linux/irq_work.h>
>  #include <linux/kprobes.h>
>  #include <linux/debugfs.h>
> +#include <linux/qpw.h>
>  #include <trace/events/kmem.h>
>  
>  #include "internal.h"
> @@ -128,7 +129,7 @@
>   *   For debug caches, all allocations are forced to go through a list_lock
>   *   protected region to serialize against concurrent validation.
>   *
> - *   cpu_sheaves->lock (local_trylock)
> + *   cpu_sheaves->lock (qpw_trylock)
>   *
>   *   This lock protects fastpath operations on the percpu sheaves. On !RT it
>   *   only disables preemption and does no atomic operations. As long as the main
> @@ -156,7 +157,7 @@
>   *   Interrupts are disabled as part of list_lock or barn lock operations, or
>   *   around the slab_lock operation, in order to make the slab allocator safe
>   *   to use in the context of an irq.
> - *   Preemption is disabled as part of local_trylock operations.
> + *   Preemption is disabled as part of qpw_trylock operations.
>   *   kmalloc_nolock() and kfree_nolock() are safe in NMI context but see
>   *   their limitations.
>   *
> @@ -417,7 +418,7 @@ struct slab_sheaf {
>  };
>  
>  struct slub_percpu_sheaves {
> -	local_trylock_t lock;
> +	qpw_trylock_t lock;
>  	struct slab_sheaf *main; /* never NULL when unlocked */
>  	struct slab_sheaf *spare; /* empty or full, may be NULL */
>  	struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
> @@ -479,7 +480,7 @@ static nodemask_t slab_nodes;
>  static struct workqueue_struct *flushwq;
>  
>  struct slub_flush_work {
> -	struct work_struct work;
> +	struct qpw_struct qpw;
>  	struct kmem_cache *s;
>  	bool skip;
>  };
> @@ -2826,7 +2827,7 @@ static void __kmem_cache_free_bulk(struc
>   *
>   * returns true if at least partially flushed
>   */
> -static bool sheaf_flush_main(struct kmem_cache *s)
> +static bool sheaf_flush_main(struct kmem_cache *s, int cpu)
>  {
>  	struct slub_percpu_sheaves *pcs;
>  	unsigned int batch, remaining;
> @@ -2835,10 +2836,10 @@ static bool sheaf_flush_main(struct kmem
>  	bool ret = false;
>  
>  next_batch:
> -	if (!local_trylock(&s->cpu_sheaves->lock))
> +	if (!qpw_trylock(&s->cpu_sheaves->lock, cpu))
>  		return ret;
>  
> -	pcs = this_cpu_ptr(s->cpu_sheaves);
> +	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
>  	sheaf = pcs->main;
>  
>  	batch = min(PCS_BATCH_MAX, sheaf->size);
> @@ -2848,7 +2849,7 @@ next_batch:
>  
>  	remaining = sheaf->size;
>  
> -	local_unlock(&s->cpu_sheaves->lock);
> +	qpw_unlock(&s->cpu_sheaves->lock, cpu);
>  
>  	__kmem_cache_free_bulk(s, batch, &objects[0]);
>  
> @@ -2932,13 +2933,13 @@ static void rcu_free_sheaf_nobarn(struct
>   * flushing operations are rare so let's keep it simple and flush to slabs
>   * directly, skipping the barn
>   */
> -static void pcs_flush_all(struct kmem_cache *s)
> +static void pcs_flush_all(struct kmem_cache *s, int cpu)
>  {
>  	struct slub_percpu_sheaves *pcs;
>  	struct slab_sheaf *spare, *rcu_free;
>  
> -	local_lock(&s->cpu_sheaves->lock);
> -	pcs = this_cpu_ptr(s->cpu_sheaves);
> +	qpw_lock(&s->cpu_sheaves->lock, cpu);
> +	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
>  
>  	spare = pcs->spare;
>  	pcs->spare = NULL;
> @@ -2946,7 +2947,7 @@ static void pcs_flush_all(struct kmem_ca
>  	rcu_free = pcs->rcu_free;
>  	pcs->rcu_free = NULL;
>  
> -	local_unlock(&s->cpu_sheaves->lock);
> +	qpw_unlock(&s->cpu_sheaves->lock, cpu);
>  
>  	if (spare) {
>  		sheaf_flush_unused(s, spare);
> @@ -2956,7 +2957,7 @@ static void pcs_flush_all(struct kmem_ca
>  	if (rcu_free)
>  		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
>  
> -	sheaf_flush_main(s);
> +	sheaf_flush_main(s, cpu);
>  }
>  
>  static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
> @@ -3881,13 +3882,13 @@ static void flush_cpu_sheaves(struct wor
>  {
>  	struct kmem_cache *s;
>  	struct slub_flush_work *sfw;
> +	int cpu = qpw_get_cpu(w);
>  
> -	sfw = container_of(w, struct slub_flush_work, work);
> -
> +	sfw = &per_cpu(slub_flush, cpu);
>  	s = sfw->s;
>  
>  	if (cache_has_sheaves(s))
> -		pcs_flush_all(s);
> +		pcs_flush_all(s, cpu);
>  }
>  
>  static void flush_all_cpus_locked(struct kmem_cache *s)
> @@ -3904,17 +3905,17 @@ static void flush_all_cpus_locked(struct
>  			sfw->skip = true;
>  			continue;
>  		}
> -		INIT_WORK(&sfw->work, flush_cpu_sheaves);
> +		INIT_QPW(&sfw->qpw, flush_cpu_sheaves, cpu);
>  		sfw->skip = false;
>  		sfw->s = s;
> -		queue_work_on(cpu, flushwq, &sfw->work);
> +		queue_percpu_work_on(cpu, flushwq, &sfw->qpw);
>  	}
>  
>  	for_each_online_cpu(cpu) {
>  		sfw = &per_cpu(slub_flush, cpu);
>  		if (sfw->skip)
>  			continue;
> -		flush_work(&sfw->work);
> +		flush_percpu_work(&sfw->qpw);
>  	}
>  
>  	mutex_unlock(&flush_lock);
> @@ -3933,17 +3934,18 @@ static void flush_rcu_sheaf(struct work_
>  	struct slab_sheaf *rcu_free;
>  	struct slub_flush_work *sfw;
>  	struct kmem_cache *s;
> +	int cpu = qpw_get_cpu(w);
>  
> -	sfw = container_of(w, struct slub_flush_work, work);
> +	sfw = &per_cpu(slub_flush, cpu);
>  	s = sfw->s;
>  
> -	local_lock(&s->cpu_sheaves->lock);
> -	pcs = this_cpu_ptr(s->cpu_sheaves);
> +	qpw_lock(&s->cpu_sheaves->lock, cpu);
> +	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
>  
>  	rcu_free = pcs->rcu_free;
>  	pcs->rcu_free = NULL;
>  
> -	local_unlock(&s->cpu_sheaves->lock);
> +	qpw_unlock(&s->cpu_sheaves->lock, cpu);
>  
>  	if (rcu_free)
>  		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
> @@ -3968,14 +3970,14 @@ void flush_rcu_sheaves_on_cache(struct k
>  		 * sure the __kfree_rcu_sheaf() finished its call_rcu()
>  		 */
>  
> -		INIT_WORK(&sfw->work, flush_rcu_sheaf);
> +		INIT_QPW(&sfw->qpw, flush_rcu_sheaf, cpu);
>  		sfw->s = s;
> -		queue_work_on(cpu, flushwq, &sfw->work);
> +		queue_percpu_work_on(cpu, flushwq, &sfw->qpw);
>  	}
>  
>  	for_each_online_cpu(cpu) {
>  		sfw = &per_cpu(slub_flush, cpu);
> -		flush_work(&sfw->work);
> +		flush_percpu_work(&sfw->qpw);
>  	}
>  
>  	mutex_unlock(&flush_lock);
> @@ -4472,22 +4474,24 @@ bool slab_post_alloc_hook(struct kmem_ca
>   *
>   * Must be called with the cpu_sheaves local lock locked. If successful, returns
>   * the pcs pointer and the local lock locked (possibly on a different cpu than
> - * initially called). If not successful, returns NULL and the local lock
> - * unlocked.
> + * initially called), and migration disabled. If not successful, returns NULL
> + * and the local lock unlocked, with migration enabled.
>   */
>  static struct slub_percpu_sheaves *
> -__pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs, gfp_t gfp)
> +__pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs, gfp_t gfp,
> +			 int *cpu)
>  {
>  	struct slab_sheaf *empty = NULL;
>  	struct slab_sheaf *full;
>  	struct node_barn *barn;
>  	bool can_alloc;
>  
> -	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
> +	qpw_lockdep_assert_held(&s->cpu_sheaves->lock);
>  
>  	/* Bootstrap or debug cache, back off */
>  	if (unlikely(!cache_has_sheaves(s))) {
> -		local_unlock(&s->cpu_sheaves->lock);
> +		qpw_unlock(&s->cpu_sheaves->lock, *cpu);
> +		migrate_enable();
>  		return NULL;
>  	}
>  
> @@ -4498,7 +4502,8 @@ __pcs_replace_empty_main(struct kmem_cac
>  
>  	barn = get_barn(s);
>  	if (!barn) {
> -		local_unlock(&s->cpu_sheaves->lock);
> +		qpw_unlock(&s->cpu_sheaves->lock, *cpu);
> +		migrate_enable();
>  		return NULL;
>  	}
>  
> @@ -4524,7 +4529,8 @@ __pcs_replace_empty_main(struct kmem_cac
>  		}
>  	}
>  
> -	local_unlock(&s->cpu_sheaves->lock);
> +	qpw_unlock(&s->cpu_sheaves->lock, *cpu);
> +	migrate_enable();
>  
>  	if (!can_alloc)
>  		return NULL;
> @@ -4550,7 +4556,9 @@ __pcs_replace_empty_main(struct kmem_cac
>  	 * we can reach here only when gfpflags_allow_blocking
>  	 * so this must not be an irq
>  	 */
> -	local_lock(&s->cpu_sheaves->lock);
> +	migrate_disable();
> +	*cpu = smp_processor_id();
> +	qpw_lock(&s->cpu_sheaves->lock, *cpu);
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
>  
>  	/*
> @@ -4593,6 +4601,7 @@ void *alloc_from_pcs(struct kmem_cache *
>  	struct slub_percpu_sheaves *pcs;
>  	bool node_requested;
>  	void *object;
> +	int cpu;
>  
>  #ifdef CONFIG_NUMA
>  	if (static_branch_unlikely(&strict_numa) &&
> @@ -4627,13 +4636,17 @@ void *alloc_from_pcs(struct kmem_cache *
>  		return NULL;
>  	}
>  
> -	if (!local_trylock(&s->cpu_sheaves->lock))
> +	migrate_disable();
> +	cpu = smp_processor_id();
> +	if (!qpw_trylock(&s->cpu_sheaves->lock, cpu)) {
> +		migrate_enable();
>  		return NULL;
> +	}
>  
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
>  
>  	if (unlikely(pcs->main->size == 0)) {
> -		pcs = __pcs_replace_empty_main(s, pcs, gfp);
> +		pcs = __pcs_replace_empty_main(s, pcs, gfp, &cpu);
>  		if (unlikely(!pcs))
>  			return NULL;
>  	}
> @@ -4647,7 +4660,8 @@ void *alloc_from_pcs(struct kmem_cache *
>  		 * the current allocation or previous freeing process.
>  		 */
>  		if (page_to_nid(virt_to_page(object)) != node) {
> -			local_unlock(&s->cpu_sheaves->lock);
> +			qpw_unlock(&s->cpu_sheaves->lock, cpu);
> +			migrate_enable();
>  			stat(s, ALLOC_NODE_MISMATCH);
>  			return NULL;
>  		}
> @@ -4655,7 +4669,8 @@ void *alloc_from_pcs(struct kmem_cache *
>  
>  	pcs->main->size--;
>  
> -	local_unlock(&s->cpu_sheaves->lock);
> +	qpw_unlock(&s->cpu_sheaves->lock, cpu);
> +	migrate_enable();
>  
>  	stat(s, ALLOC_FASTPATH);
>  
> @@ -4670,10 +4685,15 @@ unsigned int alloc_from_pcs_bulk(struct
>  	struct slab_sheaf *main;
>  	unsigned int allocated = 0;
>  	unsigned int batch;
> +	int cpu;
>  
>  next_batch:
> -	if (!local_trylock(&s->cpu_sheaves->lock))
> +	migrate_disable();
> +	cpu = smp_processor_id();
> +	if (!qpw_trylock(&s->cpu_sheaves->lock, cpu)) {
> +		migrate_enable();
>  		return allocated;
> +	}
>  
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
>  
> @@ -4683,7 +4703,8 @@ next_batch:
>  		struct node_barn *barn;
>  
>  		if (unlikely(!cache_has_sheaves(s))) {
> -			local_unlock(&s->cpu_sheaves->lock);
> +			qpw_unlock(&s->cpu_sheaves->lock, cpu);
> +			migrate_enable();
>  			return allocated;
>  		}
>  
> @@ -4694,7 +4715,8 @@ next_batch:
>  
>  		barn = get_barn(s);
>  		if (!barn) {
> -			local_unlock(&s->cpu_sheaves->lock);
> +			qpw_unlock(&s->cpu_sheaves->lock, cpu);
> +			migrate_enable();
>  			return allocated;
>  		}
>  
> @@ -4709,7 +4731,8 @@ next_batch:
>  
>  		stat(s, BARN_GET_FAIL);
>  
> -		local_unlock(&s->cpu_sheaves->lock);
> +		qpw_unlock(&s->cpu_sheaves->lock, cpu);
> +		migrate_enable();
>  
>  		/*
>  		 * Once full sheaves in barn are depleted, let the bulk
> @@ -4727,7 +4750,8 @@ do_alloc:
>  	main->size -= batch;
>  	memcpy(p, main->objects + main->size, batch * sizeof(void *));
>  
> -	local_unlock(&s->cpu_sheaves->lock);
> +	qpw_unlock(&s->cpu_sheaves->lock, cpu);
> +	migrate_enable();
>  
>  	stat_add(s, ALLOC_FASTPATH, batch);
>  
> @@ -4877,6 +4901,7 @@ kmem_cache_prefill_sheaf(struct kmem_cac
>  	struct slub_percpu_sheaves *pcs;
>  	struct slab_sheaf *sheaf = NULL;
>  	struct node_barn *barn;
> +	int cpu;
>  
>  	if (unlikely(!size))
>  		return NULL;
> @@ -4906,7 +4931,9 @@ kmem_cache_prefill_sheaf(struct kmem_cac
>  		return sheaf;
>  	}
>  
> -	local_lock(&s->cpu_sheaves->lock);
> +	migrate_disable();
> +	cpu = smp_processor_id();
> +	qpw_lock(&s->cpu_sheaves->lock, cpu);
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
>  
>  	if (pcs->spare) {
> @@ -4925,7 +4952,8 @@ kmem_cache_prefill_sheaf(struct kmem_cac
>  			stat(s, BARN_GET_FAIL);
>  	}
>  
> -	local_unlock(&s->cpu_sheaves->lock);
> +	qpw_unlock(&s->cpu_sheaves->lock, cpu);
> +	migrate_enable();
>  
>  
>  	if (!sheaf)
> @@ -4961,6 +4989,7 @@ void kmem_cache_return_sheaf(struct kmem
>  {
>  	struct slub_percpu_sheaves *pcs;
>  	struct node_barn *barn;
> +	int cpu;
>  
>  	if (unlikely((sheaf->capacity != s->sheaf_capacity)
>  		     || sheaf->pfmemalloc)) {
> @@ -4969,7 +4998,9 @@ void kmem_cache_return_sheaf(struct kmem
>  		return;
>  	}
>  
> -	local_lock(&s->cpu_sheaves->lock);
> +	migrate_disable();
> +	cpu = smp_processor_id();
> +	qpw_lock(&s->cpu_sheaves->lock, cpu);
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
>  	barn = get_barn(s);
>  
> @@ -4979,7 +5010,8 @@ void kmem_cache_return_sheaf(struct kmem
>  		stat(s, SHEAF_RETURN_FAST);
>  	}
>  
> -	local_unlock(&s->cpu_sheaves->lock);
> +	qpw_unlock(&s->cpu_sheaves->lock, cpu);
> +	migrate_enable();
>  
>  	if (!sheaf)
>  		return;
> @@ -5507,9 +5539,9 @@ slab_empty:
>   */
>  static void __pcs_install_empty_sheaf(struct kmem_cache *s,
>  		struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty,
> -		struct node_barn *barn)
> +		struct node_barn *barn, int cpu)
>  {
> -	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
> +	qpw_lockdep_assert_held(&s->cpu_sheaves->lock);
>  
>  	/* This is what we expect to find if nobody interrupted us. */
>  	if (likely(!pcs->spare)) {
> @@ -5546,31 +5578,34 @@ static void __pcs_install_empty_sheaf(st
>  /*
>   * Replace the full main sheaf with a (at least partially) empty sheaf.
>   *
> - * Must be called with the cpu_sheaves local lock locked. If successful, returns
> - * the pcs pointer and the local lock locked (possibly on a different cpu than
> - * initially called). If not successful, returns NULL and the local lock
> - * unlocked.
> + * Must be called with the cpu_sheaves local lock locked, and migration counter

					   ^~ qpw?	

> + * increased. If successful, returns the pcs pointer and the local lock locked
> + * (possibly on a different cpu than initially called), with migration counter
> + * increased. If not successful, returns NULL and the local lock unlocked,

					   		   ^~ qpw?

> + * and migration counter decreased.
>   */
>  static struct slub_percpu_sheaves *
>  __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> -			bool allow_spin)
> +			bool allow_spin, int *cpu)
>  {
>  	struct slab_sheaf *empty;
>  	struct node_barn *barn;
>  	bool put_fail;
>  
>  restart:
> -	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
> +	qpw_lockdep_assert_held(&s->cpu_sheaves->lock);
>  
>  	/* Bootstrap or debug cache, back off */
>  	if (unlikely(!cache_has_sheaves(s))) {
> -		local_unlock(&s->cpu_sheaves->lock);
> +		qpw_unlock(&s->cpu_sheaves->lock, *cpu);
> +		migrate_enable();
>  		return NULL;
>  	}
>  
>  	barn = get_barn(s);
>  	if (!barn) {
> -		local_unlock(&s->cpu_sheaves->lock);
> +		qpw_unlock(&s->cpu_sheaves->lock, *cpu);
> +		migrate_enable();
>  		return NULL;
>  	}
>  
> @@ -5607,7 +5642,8 @@ restart:
>  		stat(s, BARN_PUT_FAIL);
>  
>  		pcs->spare = NULL;
> -		local_unlock(&s->cpu_sheaves->lock);
> +		qpw_unlock(&s->cpu_sheaves->lock, *cpu);
> +		migrate_enable();
>  
>  		sheaf_flush_unused(s, to_flush);
>  		empty = to_flush;
> @@ -5623,7 +5659,8 @@ restart:
>  	put_fail = true;
>  
>  alloc_empty:
> -	local_unlock(&s->cpu_sheaves->lock);
> +	qpw_unlock(&s->cpu_sheaves->lock, *cpu);
> +	migrate_enable();
>  
>  	/*
>  	 * alloc_empty_sheaf() doesn't support !allow_spin and it's
> @@ -5640,11 +5677,17 @@ alloc_empty:
>  	if (put_fail)
>  		 stat(s, BARN_PUT_FAIL);
>  
> -	if (!sheaf_flush_main(s))
> +	migrate_disable();
> +	*cpu = smp_processor_id();
> +	if (!sheaf_flush_main(s, *cpu)) {
> +		migrate_enable();
>  		return NULL;
> +	}
>  
> -	if (!local_trylock(&s->cpu_sheaves->lock))
> +	if (!qpw_trylock(&s->cpu_sheaves->lock, *cpu)) {
> +		migrate_enable();
>  		return NULL;
> +	}
>  
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
>  
> @@ -5659,13 +5702,14 @@ alloc_empty:
>  	return pcs;
>  
>  got_empty:
> -	if (!local_trylock(&s->cpu_sheaves->lock)) {
> +	if (!qpw_trylock(&s->cpu_sheaves->lock, *cpu)) {
> +		migrate_enable();
>  		barn_put_empty_sheaf(barn, empty);
>  		return NULL;
>  	}
>  
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
> -	__pcs_install_empty_sheaf(s, pcs, empty, barn);
> +	__pcs_install_empty_sheaf(s, pcs, empty, barn, *cpu);
>  
>  	return pcs;
>  }
> @@ -5678,22 +5722,28 @@ static __fastpath_inline
>  bool free_to_pcs(struct kmem_cache *s, void *object, bool allow_spin)
>  {
>  	struct slub_percpu_sheaves *pcs;
> +	int cpu;
>  
> -	if (!local_trylock(&s->cpu_sheaves->lock))
> +	migrate_disable();
> +	cpu = smp_processor_id();
> +	if (!qpw_trylock(&s->cpu_sheaves->lock, cpu)) {
> +		migrate_enable();
>  		return false;
> +	}
>  
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
>  
>  	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
>  
> -		pcs = __pcs_replace_full_main(s, pcs, allow_spin);
> +		pcs = __pcs_replace_full_main(s, pcs, allow_spin, &cpu);
>  		if (unlikely(!pcs))
>  			return false;
>  	}
>  
>  	pcs->main->objects[pcs->main->size++] = object;
>  
> -	local_unlock(&s->cpu_sheaves->lock);
> +	qpw_unlock(&s->cpu_sheaves->lock, cpu);
> +	migrate_enable();
>  
>  	stat(s, FREE_FASTPATH);
>  
> @@ -5777,14 +5827,19 @@ bool __kfree_rcu_sheaf(struct kmem_cache
>  {
>  	struct slub_percpu_sheaves *pcs;
>  	struct slab_sheaf *rcu_sheaf;
> +	int cpu;
>  
>  	if (WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT)))
>  		return false;
>  
>  	lock_map_acquire_try(&kfree_rcu_sheaf_map);
>  
> -	if (!local_trylock(&s->cpu_sheaves->lock))
> +	migrate_disable();
> +	cpu = smp_processor_id();
> +	if (!qpw_trylock(&s->cpu_sheaves->lock, cpu)) {
> +		migrate_enable();
>  		goto fail;
> +	}
>  
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
>  
> @@ -5795,7 +5850,8 @@ bool __kfree_rcu_sheaf(struct kmem_cache
>  
>  		/* Bootstrap or debug cache, fall back */
>  		if (unlikely(!cache_has_sheaves(s))) {
> -			local_unlock(&s->cpu_sheaves->lock);
> +			qpw_unlock(&s->cpu_sheaves->lock, cpu);
> +			migrate_enable();
>  			goto fail;
>  		}
>  
> @@ -5807,7 +5863,8 @@ bool __kfree_rcu_sheaf(struct kmem_cache
>  
>  		barn = get_barn(s);
>  		if (!barn) {
> -			local_unlock(&s->cpu_sheaves->lock);
> +			qpw_unlock(&s->cpu_sheaves->lock, cpu);
> +			migrate_enable();
>  			goto fail;
>  		}
>  
> @@ -5818,15 +5875,18 @@ bool __kfree_rcu_sheaf(struct kmem_cache
>  			goto do_free;
>  		}
>  
> -		local_unlock(&s->cpu_sheaves->lock);
> +		qpw_unlock(&s->cpu_sheaves->lock, cpu);
> +		migrate_enable();
>  
>  		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
>  
>  		if (!empty)
>  			goto fail;
>  
> -		if (!local_trylock(&s->cpu_sheaves->lock)) {
> +		migrate_disable();
> +		if (!qpw_trylock(&s->cpu_sheaves->lock, cpu)) {
>  			barn_put_empty_sheaf(barn, empty);
> +			migrate_enable();
>  			goto fail;
>  		}
>  
> @@ -5862,7 +5922,8 @@ do_free:
>  	if (rcu_sheaf)
>  		call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
>  
> -	local_unlock(&s->cpu_sheaves->lock);
> +	qpw_unlock(&s->cpu_sheaves->lock, cpu);
> +	migrate_enable();
>  
>  	stat(s, FREE_RCU_SHEAF);
>  	lock_map_release(&kfree_rcu_sheaf_map);
> @@ -5889,6 +5950,7 @@ static void free_to_pcs_bulk(struct kmem
>  	void *remote_objects[PCS_BATCH_MAX];
>  	unsigned int remote_nr = 0;
>  	int node = numa_mem_id();
> +	int cpu;
>  
>  next_remote_batch:
>  	while (i < size) {
> @@ -5918,7 +5980,9 @@ next_remote_batch:
>  		goto flush_remote;
>  
>  next_batch:
> -	if (!local_trylock(&s->cpu_sheaves->lock))
> +	migrate_disable();
> +	cpu = smp_processor_id();
> +	if (!qpw_trylock(&s->cpu_sheaves->lock, cpu))
>  		goto fallback;
>  
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
> @@ -5961,7 +6025,8 @@ do_free:
>  	memcpy(main->objects + main->size, p, batch * sizeof(void *));
>  	main->size += batch;
>  
> -	local_unlock(&s->cpu_sheaves->lock);
> +	qpw_unlock(&s->cpu_sheaves->lock, cpu);
> +	migrate_enable();
>  
>  	stat_add(s, FREE_FASTPATH, batch);
>  
> @@ -5977,7 +6042,8 @@ do_free:
>  	return;
>  
>  no_empty:
> -	local_unlock(&s->cpu_sheaves->lock);
> +	qpw_unlock(&s->cpu_sheaves->lock, cpu);
> +	migrate_enable();
>  
>  	/*
>  	 * if we depleted all empty sheaves in the barn or there are too
> @@ -7377,7 +7443,7 @@ static int init_percpu_sheaves(struct km
>  
>  		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
>  
> -		local_trylock_init(&pcs->lock);
> +		qpw_trylock_init(&pcs->lock);
>  
>  		/*
>  		 * Bootstrap sheaf has zero size so fast-path allocation fails.
> 
> 


Conversions look correct.

I have some ideas, but I am still not sure about the need of 
migrate_*able() here, but if they are indeed needed, I think we should work 
on having them inside helpers that are special for local_cpu-only 
functions, instead of happening on user code like this.

What do you think?

Thanks for getting this upstream!
Leo


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-06 14:34 [PATCH 0/4] Introduce QPW for per-cpu operations Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2026-02-06 14:34 ` [PATCH 4/4] slub: " Marcelo Tosatti
@ 2026-02-06 23:56 ` Leonardo Bras
  2026-02-10 14:01 ` Michal Hocko
  5 siblings, 0 replies; 35+ messages in thread
From: Leonardo Bras @ 2026-02-06 23:56 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Leonardo Bras, linux-kernel, cgroups, linux-mm, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras,
	Thomas Gleixner, Waiman Long, Boqun Feng

On Fri, Feb 06, 2026 at 11:34:30AM -0300, Marcelo Tosatti wrote:
> The problem:
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
> 
> On the other hand, for RT workloads this can represent a problem: getting
> an important workload scheduled out to deal with remote requests is
> sure to introduce unexpected deadline misses.
> 
> The idea:
> Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> In this case, instead of scheduling work on a remote cpu, it should
> be safe to grab that remote cpu's per-cpu spinlock and run the required
> work locally. That major cost, which is un/locking in every local function,
> already happens in PREEMPT_RT.
> 
> Also, there is no need to worry about extra cache bouncing:
> The cacheline invalidation already happens due to schedule_work_on().
> 
> This will avoid schedule_work_on(), and thus avoid scheduling-out an
> RT workload.
> 

Marcelo, thanks for finishing this series!

> Proposed solution:
> A new interface called Queue PerCPU Work (QPW), which should replace
> Work Queue in the above mentioned use case.
> 
> If PREEMPT_RT=n this interfaces just wraps the current

Are we enabling it by default in PREEMPT_RT=y? If not,

If CONFIG_QPW=n or qpw=0 this interfaces just wraps the current

> local_locks + WorkQueue behavior, so no expected change in runtime.
> 
> If PREEMPT_RT=y, or CONFIG_QPW=y, queue_percpu_work_on(cpu,...) will

Same here

If CONFIG_QPW=y and qpw=1, queue_percpu_work_on(cpu,...) will

> lock that cpu's per-cpu structure and perform work on it locally. 
> This is possible because on functions that can be used for performing
> remote work on remote per-cpu structures, the local_lock (which is already
> a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> is able to get the per_cpu spinlock() for the cpu passed as parameter.
> 
> RFC->v1:
> 
> - Introduce CONFIG_QPW and qpw= kernel boot option to enable 
>   remote spinlocking and execution even on !CONFIG_PREEMPT_RT
>   kernels (Leonardo Bras).
> - Move buffer_head draining to separate workqueue (Marcelo Tosatti).
> - Convert mlock per-CPU page lists to QPW (Marcelo Tosatti).
> - Drop memcontrol convertion (as isolated CPUs are not targets
>   of queue_work_on anymore).
> - Rebase SLUB against Vlastimil's slab/next.
> - Add basic document for QPW (Waiman Long).

A document was a nice touch :)

> 
> 
> The following testcase triggers lru_add_drain_all on an isolated CPU
> (that does sys_write to a file before entering its realtime 
> loop).
> 
> /* 
>  * Simulates a low latency loop program that is interrupted
>  * due to lru_add_drain_all. To trigger lru_add_drain_all, run:
>  *
>  * blockdev --flushbufs /dev/sdX
>  *
>  */ 
> #define _GNU_SOURCE
> #include <fcntl.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <string.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <stdlib.h>
> #include <stdarg.h>
> #include <pthread.h>
> #include <sched.h>
> #include <unistd.h>
> 
> int cpu;
> 
> static void *run(void *arg)
> {
> 	pthread_t current_thread;
> 	cpu_set_t cpuset;
> 	int ret, nrloops;
> 	struct sched_param sched_p;
> 	pid_t pid;
> 	int fd;
> 	char buf[] = "xxxxxxxxxxx";
> 
> 	CPU_ZERO(&cpuset);
> 	CPU_SET(cpu, &cpuset);
> 
> 	current_thread = pthread_self();    
> 	ret = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
> 	if (ret) {
> 		perror("pthread_setaffinity_np failed\n");
> 		exit(0);
> 	}
> 
> 	memset(&sched_p, 0, sizeof(struct sched_param));
> 	sched_p.sched_priority = 1;
> 	pid = gettid();
> 	ret = sched_setscheduler(pid, SCHED_FIFO, &sched_p);
> 	if (ret) {
> 		perror("sched_setscheduler");
> 		exit(0);
> 	}
> 
> 	fd = open("/tmp/tmpfile", O_RDWR|O_CREAT|O_TRUNC);
> 	if (fd == -1) {
> 		perror("open");
> 		exit(0);
> 	}
> 
> 	ret = write(fd, buf, sizeof(buf));
> 	if (ret == -1) {
> 		perror("write");
> 		exit(0);
> 	}
> 
> 	do { 
> 		nrloops = nrloops+2;
> 		nrloops--;
> 	} while (1);
> }
> 
> int main(int argc, char *argv[])
> {
>         int fd, ret;
> 	pthread_t thread;
> 	long val;
> 	char *endptr, *str;
> 	struct sched_param sched_p;
> 	pid_t pid;
> 
> 	if (argc != 2) {
> 		printf("usage: %s cpu-nr\n", argv[0]);
> 		printf("where CPU number is the CPU to pin thread to\n");
> 		exit(0);
> 	}
> 	str = argv[1];
> 	cpu = strtol(str, &endptr, 10);
> 	if (cpu < 0) {
> 		printf("strtol returns %d\n", cpu);
> 		exit(0);
> 	}
> 	printf("cpunr=%d\n", cpu);
> 
> 	memset(&sched_p, 0, sizeof(struct sched_param));
> 	sched_p.sched_priority = 1;
> 	pid = getpid();
> 	ret = sched_setscheduler(pid, SCHED_FIFO, &sched_p);
> 	if (ret) {
> 		perror("sched_setscheduler");
> 		exit(0);
> 	}
> 
> 	pthread_create(&thread, NULL, run, NULL);
> 
> 	sleep(5000);
> 
> 	pthread_join(thread, NULL);
> }
> 
> 

Also, having the reproducer in the cover letter was a great idea!

Thanks!
Leo


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-06 14:34 [PATCH 0/4] Introduce QPW for per-cpu operations Marcelo Tosatti
                   ` (4 preceding siblings ...)
  2026-02-06 23:56 ` [PATCH 0/4] Introduce QPW for per-cpu operations Leonardo Bras
@ 2026-02-10 14:01 ` Michal Hocko
  2026-02-11 12:01   ` Marcelo Tosatti
  5 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2026-02-10 14:01 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, cgroups, linux-mm, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner, Waiman Long,
	Boqun Feng, Frederic Weisbecker

On Fri 06-02-26 11:34:30, Marcelo Tosatti wrote:
> The problem:
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
> 
> On the other hand, for RT workloads this can represent a problem: getting
> an important workload scheduled out to deal with remote requests is
> sure to introduce unexpected deadline misses.
> 
> The idea:
> Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> In this case, instead of scheduling work on a remote cpu, it should
> be safe to grab that remote cpu's per-cpu spinlock and run the required
> work locally. That major cost, which is un/locking in every local function,
> already happens in PREEMPT_RT.
> 
> Also, there is no need to worry about extra cache bouncing:
> The cacheline invalidation already happens due to schedule_work_on().
> 
> This will avoid schedule_work_on(), and thus avoid scheduling-out an
> RT workload.
> 
> Proposed solution:
> A new interface called Queue PerCPU Work (QPW), which should replace
> Work Queue in the above mentioned use case.
> 
> If PREEMPT_RT=n this interfaces just wraps the current
> local_locks + WorkQueue behavior, so no expected change in runtime.
> 
> If PREEMPT_RT=y, or CONFIG_QPW=y, queue_percpu_work_on(cpu,...) will
> lock that cpu's per-cpu structure and perform work on it locally. 
> This is possible because on functions that can be used for performing
> remote work on remote per-cpu structures, the local_lock (which is already
> a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> is able to get the per_cpu spinlock() for the cpu passed as parameter.

What about !PREEMPT_RT? We have people running isolated workloads and
these sorts of pcp disruptions are really unwelcome as well. They do not
have requirements as strong as RT workloads but the underlying
fundamental problem is the same. Frederic (now CCed) is working on
moving those pcp book keeping activities to be executed to the return to
the userspace which should be taking care of both RT and non-RT
configurations AFAICS.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-10 14:01 ` Michal Hocko
@ 2026-02-11 12:01   ` Marcelo Tosatti
  2026-02-11 12:11     ` Marcelo Tosatti
  2026-02-11 16:38     ` Michal Hocko
  0 siblings, 2 replies; 35+ messages in thread
From: Marcelo Tosatti @ 2026-02-11 12:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, cgroups, linux-mm, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner, Waiman Long,
	Boqun Feng, Frederic Weisbecker

On Tue, Feb 10, 2026 at 03:01:10PM +0100, Michal Hocko wrote:
> On Fri 06-02-26 11:34:30, Marcelo Tosatti wrote:
> > The problem:
> > Some places in the kernel implement a parallel programming strategy
> > consisting on local_locks() for most of the work, and some rare remote
> > operations are scheduled on target cpu. This keeps cache bouncing low since
> > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > kernels, even though the very few remote operations will be expensive due
> > to scheduling overhead.
> > 
> > On the other hand, for RT workloads this can represent a problem: getting
> > an important workload scheduled out to deal with remote requests is
> > sure to introduce unexpected deadline misses.
> > 
> > The idea:
> > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > In this case, instead of scheduling work on a remote cpu, it should
> > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > work locally. That major cost, which is un/locking in every local function,
> > already happens in PREEMPT_RT.
> > 
> > Also, there is no need to worry about extra cache bouncing:
> > The cacheline invalidation already happens due to schedule_work_on().
> > 
> > This will avoid schedule_work_on(), and thus avoid scheduling-out an
> > RT workload.
> > 
> > Proposed solution:
> > A new interface called Queue PerCPU Work (QPW), which should replace
> > Work Queue in the above mentioned use case.
> > 
> > If PREEMPT_RT=n this interfaces just wraps the current
> > local_locks + WorkQueue behavior, so no expected change in runtime.
> > 
> > If PREEMPT_RT=y, or CONFIG_QPW=y, queue_percpu_work_on(cpu,...) will
> > lock that cpu's per-cpu structure and perform work on it locally. 
> > This is possible because on functions that can be used for performing
> > remote work on remote per-cpu structures, the local_lock (which is already
> > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> > is able to get the per_cpu spinlock() for the cpu passed as parameter.
> 
> What about !PREEMPT_RT? We have people running isolated workloads and
> these sorts of pcp disruptions are really unwelcome as well. They do not
> have requirements as strong as RT workloads but the underlying
> fundamental problem is the same. Frederic (now CCed) is working on
> moving those pcp book keeping activities to be executed to the return to
> the userspace which should be taking care of both RT and non-RT
> configurations AFAICS.

Michal,

For !PREEMPT_RT, _if_ you select CONFIG_QPW=y, then there is a kernel
boot option qpw=y/n, which controls whether the behaviour will be
similar (the spinlock is taken on local_lock, similar to PREEMPT_RT).

If CONFIG_QPW=n, or kernel boot option qpw=n, then only local_lock 
(and remote work via work_queue) is used.

What "pcp book keeping activities" you refer to ? I don't see how
moving certain activities that happen under SLUB or LRU spinlocks
to happen before return to userspace changes things related 
to avoidance of CPU interruption ?

Thanks



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-11 12:01   ` Marcelo Tosatti
@ 2026-02-11 12:11     ` Marcelo Tosatti
  2026-02-14 21:35       ` Leonardo Bras
  2026-02-11 16:38     ` Michal Hocko
  1 sibling, 1 reply; 35+ messages in thread
From: Marcelo Tosatti @ 2026-02-11 12:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, cgroups, linux-mm, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner, Waiman Long,
	Boqun Feng, Frederic Weisbecker

On Wed, Feb 11, 2026 at 09:01:12AM -0300, Marcelo Tosatti wrote:
> On Tue, Feb 10, 2026 at 03:01:10PM +0100, Michal Hocko wrote:
> > On Fri 06-02-26 11:34:30, Marcelo Tosatti wrote:
> > > The problem:
> > > Some places in the kernel implement a parallel programming strategy
> > > consisting on local_locks() for most of the work, and some rare remote
> > > operations are scheduled on target cpu. This keeps cache bouncing low since
> > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > > kernels, even though the very few remote operations will be expensive due
> > > to scheduling overhead.
> > > 
> > > On the other hand, for RT workloads this can represent a problem: getting
> > > an important workload scheduled out to deal with remote requests is
> > > sure to introduce unexpected deadline misses.
> > > 
> > > The idea:
> > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > > In this case, instead of scheduling work on a remote cpu, it should
> > > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > > work locally. That major cost, which is un/locking in every local function,
> > > already happens in PREEMPT_RT.
> > > 
> > > Also, there is no need to worry about extra cache bouncing:
> > > The cacheline invalidation already happens due to schedule_work_on().
> > > 
> > > This will avoid schedule_work_on(), and thus avoid scheduling-out an
> > > RT workload.
> > > 
> > > Proposed solution:
> > > A new interface called Queue PerCPU Work (QPW), which should replace
> > > Work Queue in the above mentioned use case.
> > > 
> > > If PREEMPT_RT=n this interfaces just wraps the current
> > > local_locks + WorkQueue behavior, so no expected change in runtime.
> > > 
> > > If PREEMPT_RT=y, or CONFIG_QPW=y, queue_percpu_work_on(cpu,...) will
> > > lock that cpu's per-cpu structure and perform work on it locally. 
> > > This is possible because on functions that can be used for performing
> > > remote work on remote per-cpu structures, the local_lock (which is already
> > > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> > > is able to get the per_cpu spinlock() for the cpu passed as parameter.
> > 
> > What about !PREEMPT_RT? We have people running isolated workloads and
> > these sorts of pcp disruptions are really unwelcome as well. They do not
> > have requirements as strong as RT workloads but the underlying
> > fundamental problem is the same. Frederic (now CCed) is working on
> > moving those pcp book keeping activities to be executed to the return to
> > the userspace which should be taking care of both RT and non-RT
> > configurations AFAICS.
> 
> Michal,
> 
> For !PREEMPT_RT, _if_ you select CONFIG_QPW=y, then there is a kernel
> boot option qpw=y/n, which controls whether the behaviour will be
> similar (the spinlock is taken on local_lock, similar to PREEMPT_RT).
> 
> If CONFIG_QPW=n, or kernel boot option qpw=n, then only local_lock 
> (and remote work via work_queue) is used.

OK, this is not true. There is only CONFIG_QPW and the qpw=yes/no kernel 
boot option for control.

CONFIG_PREEMPT_RT should probably select CONFIG_QPW=y and
CONFIG_QPW_DEFAULT=y.

> What "pcp book keeping activities" you refer to ? I don't see how
> moving certain activities that happen under SLUB or LRU spinlocks
> to happen before return to userspace changes things related 
> to avoidance of CPU interruption ?
> 
> Thanks
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-11 12:11     ` Marcelo Tosatti
@ 2026-02-14 21:35       ` Leonardo Bras
  0 siblings, 0 replies; 35+ messages in thread
From: Leonardo Bras @ 2026-02-14 21:35 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Leonardo Bras, Michal Hocko, linux-kernel, cgroups, linux-mm,
	Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras,
	Thomas Gleixner, Waiman Long, Boqun Feng, Frederic Weisbecker

On Wed, Feb 11, 2026 at 09:11:21AM -0300, Marcelo Tosatti wrote:
> On Wed, Feb 11, 2026 at 09:01:12AM -0300, Marcelo Tosatti wrote:
> > On Tue, Feb 10, 2026 at 03:01:10PM +0100, Michal Hocko wrote:
> > > On Fri 06-02-26 11:34:30, Marcelo Tosatti wrote:
> > > > The problem:
> > > > Some places in the kernel implement a parallel programming strategy
> > > > consisting on local_locks() for most of the work, and some rare remote
> > > > operations are scheduled on target cpu. This keeps cache bouncing low since
> > > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > > > kernels, even though the very few remote operations will be expensive due
> > > > to scheduling overhead.
> > > > 
> > > > On the other hand, for RT workloads this can represent a problem: getting
> > > > an important workload scheduled out to deal with remote requests is
> > > > sure to introduce unexpected deadline misses.
> > > > 
> > > > The idea:
> > > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > > > In this case, instead of scheduling work on a remote cpu, it should
> > > > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > > > work locally. That major cost, which is un/locking in every local function,
> > > > already happens in PREEMPT_RT.
> > > > 
> > > > Also, there is no need to worry about extra cache bouncing:
> > > > The cacheline invalidation already happens due to schedule_work_on().
> > > > 
> > > > This will avoid schedule_work_on(), and thus avoid scheduling-out an
> > > > RT workload.
> > > > 
> > > > Proposed solution:
> > > > A new interface called Queue PerCPU Work (QPW), which should replace
> > > > Work Queue in the above mentioned use case.
> > > > 
> > > > If PREEMPT_RT=n this interfaces just wraps the current
> > > > local_locks + WorkQueue behavior, so no expected change in runtime.
> > > > 
> > > > If PREEMPT_RT=y, or CONFIG_QPW=y, queue_percpu_work_on(cpu,...) will
> > > > lock that cpu's per-cpu structure and perform work on it locally. 
> > > > This is possible because on functions that can be used for performing
> > > > remote work on remote per-cpu structures, the local_lock (which is already
> > > > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> > > > is able to get the per_cpu spinlock() for the cpu passed as parameter.
> > > 
> > > What about !PREEMPT_RT? We have people running isolated workloads and
> > > these sorts of pcp disruptions are really unwelcome as well. They do not
> > > have requirements as strong as RT workloads but the underlying
> > > fundamental problem is the same. Frederic (now CCed) is working on
> > > moving those pcp book keeping activities to be executed to the return to
> > > the userspace which should be taking care of both RT and non-RT
> > > configurations AFAICS.
> > 
> > Michal,
> > 
> > For !PREEMPT_RT, _if_ you select CONFIG_QPW=y, then there is a kernel
> > boot option qpw=y/n, which controls whether the behaviour will be
> > similar (the spinlock is taken on local_lock, similar to PREEMPT_RT).
> > 
> > If CONFIG_QPW=n, or kernel boot option qpw=n, then only local_lock 
> > (and remote work via work_queue) is used.
> 
> OK, this is not true. There is only CONFIG_QPW and the qpw=yes/no kernel 
> boot option for control.
> 
> CONFIG_PREEMPT_RT should probably select CONFIG_QPW=y and
> CONFIG_QPW_DEFAULT=y.

Fully agree :)

> 
> > What "pcp book keeping activities" you refer to ? I don't see how
> > moving certain activities that happen under SLUB or LRU spinlocks
> > to happen before return to userspace changes things related 
> > to avoidance of CPU interruption ?
> > 
> > Thanks
> > 
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-11 12:01   ` Marcelo Tosatti
  2026-02-11 12:11     ` Marcelo Tosatti
@ 2026-02-11 16:38     ` Michal Hocko
  2026-02-11 16:50       ` Marcelo Tosatti
                         ` (2 more replies)
  1 sibling, 3 replies; 35+ messages in thread
From: Michal Hocko @ 2026-02-11 16:38 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, cgroups, linux-mm, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner, Waiman Long,
	Boqun Feng, Frederic Weisbecker

On Wed 11-02-26 09:01:12, Marcelo Tosatti wrote:
> On Tue, Feb 10, 2026 at 03:01:10PM +0100, Michal Hocko wrote:
[...]
> > What about !PREEMPT_RT? We have people running isolated workloads and
> > these sorts of pcp disruptions are really unwelcome as well. They do not
> > have requirements as strong as RT workloads but the underlying
> > fundamental problem is the same. Frederic (now CCed) is working on
> > moving those pcp book keeping activities to be executed to the return to
> > the userspace which should be taking care of both RT and non-RT
> > configurations AFAICS.
> 
> Michal,
> 
> For !PREEMPT_RT, _if_ you select CONFIG_QPW=y, then there is a kernel
> boot option qpw=y/n, which controls whether the behaviour will be
> similar (the spinlock is taken on local_lock, similar to PREEMPT_RT).

My bad. I've misread the config space of this.

> If CONFIG_QPW=n, or kernel boot option qpw=n, then only local_lock 
> (and remote work via work_queue) is used.
> 
> What "pcp book keeping activities" you refer to ? I don't see how
> moving certain activities that happen under SLUB or LRU spinlocks
> to happen before return to userspace changes things related 
> to avoidance of CPU interruption ?

Essentially delayed operations like pcp state flushing happens on return
to the userspace on isolated CPUs. No locking changes are required as
the work is still per-cpu.

In other words the approach Frederic is working on is to not change the
locking of pcp delayed work but instead move that work into well defined
place - i.e. return to the userspace.

Btw. have you measure the impact of preempt_disbale -> spinlock on hot
paths like SLUB sheeves?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-11 16:38     ` Michal Hocko
@ 2026-02-11 16:50       ` Marcelo Tosatti
  2026-02-11 16:59         ` Vlastimil Babka
  2026-02-11 17:07         ` Michal Hocko
  2026-02-14 22:02       ` Leonardo Bras
  2026-02-19 13:15       ` Marcelo Tosatti
  2 siblings, 2 replies; 35+ messages in thread
From: Marcelo Tosatti @ 2026-02-11 16:50 UTC (permalink / raw)
  To: Michal Hocko, Vlastimil Babka
  Cc: linux-kernel, cgroups, linux-mm, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner, Waiman Long,
	Boqun Feng, Frederic Weisbecker

On Wed, Feb 11, 2026 at 05:38:47PM +0100, Michal Hocko wrote:
> On Wed 11-02-26 09:01:12, Marcelo Tosatti wrote:
> > On Tue, Feb 10, 2026 at 03:01:10PM +0100, Michal Hocko wrote:
> [...]
> > > What about !PREEMPT_RT? We have people running isolated workloads and
> > > these sorts of pcp disruptions are really unwelcome as well. They do not
> > > have requirements as strong as RT workloads but the underlying
> > > fundamental problem is the same. Frederic (now CCed) is working on
> > > moving those pcp book keeping activities to be executed to the return to
> > > the userspace which should be taking care of both RT and non-RT
> > > configurations AFAICS.
> > 
> > Michal,
> > 
> > For !PREEMPT_RT, _if_ you select CONFIG_QPW=y, then there is a kernel
> > boot option qpw=y/n, which controls whether the behaviour will be
> > similar (the spinlock is taken on local_lock, similar to PREEMPT_RT).
> 
> My bad. I've misread the config space of this.

My bad, actually. Its only CONFIG_QPW on the current patchset.

> > If CONFIG_QPW=n, or kernel boot option qpw=n, then only local_lock 
> > (and remote work via work_queue) is used.
> > 
> > What "pcp book keeping activities" you refer to ? I don't see how
> > moving certain activities that happen under SLUB or LRU spinlocks
> > to happen before return to userspace changes things related 
> > to avoidance of CPU interruption ?
> 
> Essentially delayed operations like pcp state flushing happens on return
> to the userspace on isolated CPUs. No locking changes are required as
> the work is still per-cpu.
> 
> In other words the approach Frederic is working on is to not change the
> locking of pcp delayed work but instead move that work into well defined
> place - i.e. return to the userspace.
> 
> Btw. have you measure the impact of preempt_disbale -> spinlock on hot
> paths like SLUB sheeves?

Nope, i have not. What is/are the standard benchmarks for SLUB/SLAB
allocation ?



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-11 16:50       ` Marcelo Tosatti
@ 2026-02-11 16:59         ` Vlastimil Babka
  2026-02-11 17:07         ` Michal Hocko
  1 sibling, 0 replies; 35+ messages in thread
From: Vlastimil Babka @ 2026-02-11 16:59 UTC (permalink / raw)
  To: Marcelo Tosatti, Michal Hocko
  Cc: linux-kernel, cgroups, linux-mm, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Waiman Long, Boqun Feng,
	Frederic Weisbecker

On 2/11/26 17:50, Marcelo Tosatti wrote:
> On Wed, Feb 11, 2026 at 05:38:47PM +0100, Michal Hocko wrote:
>> On Wed 11-02-26 09:01:12, Marcelo Tosatti wrote:
>> > On Tue, Feb 10, 2026 at 03:01:10PM +0100, Michal Hocko wrote:
>> [...]
>> > > What about !PREEMPT_RT? We have people running isolated workloads and
>> > > these sorts of pcp disruptions are really unwelcome as well. They do not
>> > > have requirements as strong as RT workloads but the underlying
>> > > fundamental problem is the same. Frederic (now CCed) is working on
>> > > moving those pcp book keeping activities to be executed to the return to
>> > > the userspace which should be taking care of both RT and non-RT
>> > > configurations AFAICS.
>> > 
>> > Michal,
>> > 
>> > For !PREEMPT_RT, _if_ you select CONFIG_QPW=y, then there is a kernel
>> > boot option qpw=y/n, which controls whether the behaviour will be
>> > similar (the spinlock is taken on local_lock, similar to PREEMPT_RT).
>> 
>> My bad. I've misread the config space of this.
> 
> My bad, actually. Its only CONFIG_QPW on the current patchset.
> 
>> > If CONFIG_QPW=n, or kernel boot option qpw=n, then only local_lock 
>> > (and remote work via work_queue) is used.
>> > 
>> > What "pcp book keeping activities" you refer to ? I don't see how
>> > moving certain activities that happen under SLUB or LRU spinlocks
>> > to happen before return to userspace changes things related 
>> > to avoidance of CPU interruption ?
>> 
>> Essentially delayed operations like pcp state flushing happens on return
>> to the userspace on isolated CPUs. No locking changes are required as
>> the work is still per-cpu.
>> 
>> In other words the approach Frederic is working on is to not change the
>> locking of pcp delayed work but instead move that work into well defined
>> place - i.e. return to the userspace.
>> 
>> Btw. have you measure the impact of preempt_disbale -> spinlock on hot
>> paths like SLUB sheeves?
> 
> Nope, i have not. What is/are the standard benchmarks for SLUB/SLAB
> allocation ?

Those mentioned here, and I would say also netperf.
https://lore.kernel.org/all/20250913000935.1021068-1-sudarsanm@google.com/




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-11 16:50       ` Marcelo Tosatti
  2026-02-11 16:59         ` Vlastimil Babka
@ 2026-02-11 17:07         ` Michal Hocko
  1 sibling, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2026-02-11 17:07 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Vlastimil Babka, linux-kernel, cgroups, linux-mm,
	Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner,
	Waiman Long, Boqun Feng, Frederic Weisbecker

On Wed 11-02-26 13:50:45, Marcelo Tosatti wrote:
> On Wed, Feb 11, 2026 at 05:38:47PM +0100, Michal Hocko wrote:
> > On Wed 11-02-26 09:01:12, Marcelo Tosatti wrote:
> > > On Tue, Feb 10, 2026 at 03:01:10PM +0100, Michal Hocko wrote:
> > [...]
> > > > What about !PREEMPT_RT? We have people running isolated workloads and
> > > > these sorts of pcp disruptions are really unwelcome as well. They do not
> > > > have requirements as strong as RT workloads but the underlying
> > > > fundamental problem is the same. Frederic (now CCed) is working on
> > > > moving those pcp book keeping activities to be executed to the return to
> > > > the userspace which should be taking care of both RT and non-RT
> > > > configurations AFAICS.
> > > 
> > > Michal,
> > > 
> > > For !PREEMPT_RT, _if_ you select CONFIG_QPW=y, then there is a kernel
> > > boot option qpw=y/n, which controls whether the behaviour will be
> > > similar (the spinlock is taken on local_lock, similar to PREEMPT_RT).
> > 
> > My bad. I've misread the config space of this.
> 
> My bad, actually. Its only CONFIG_QPW on the current patchset.

Yeah. PREEMPT_RT -> CONFIG_QPW=y and cmd line makes no difference
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-11 16:38     ` Michal Hocko
  2026-02-11 16:50       ` Marcelo Tosatti
@ 2026-02-14 22:02       ` Leonardo Bras
  2026-02-16 11:00         ` Michal Hocko
  2026-02-19 13:15       ` Marcelo Tosatti
  2 siblings, 1 reply; 35+ messages in thread
From: Leonardo Bras @ 2026-02-14 22:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Leonardo Bras, Marcelo Tosatti, linux-kernel, cgroups, linux-mm,
	Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras,
	Thomas Gleixner, Waiman Long, Boqun Feng, Frederic Weisbecker

On Wed, Feb 11, 2026 at 05:38:47PM +0100, Michal Hocko wrote:
> On Wed 11-02-26 09:01:12, Marcelo Tosatti wrote:
> > On Tue, Feb 10, 2026 at 03:01:10PM +0100, Michal Hocko wrote:
> [...]
> > > What about !PREEMPT_RT? We have people running isolated workloads and
> > > these sorts of pcp disruptions are really unwelcome as well. They do not
> > > have requirements as strong as RT workloads but the underlying
> > > fundamental problem is the same. Frederic (now CCed) is working on
> > > moving those pcp book keeping activities to be executed to the return to
> > > the userspace which should be taking care of both RT and non-RT
> > > configurations AFAICS.
> > 
> > Michal,
> > 
> > For !PREEMPT_RT, _if_ you select CONFIG_QPW=y, then there is a kernel
> > boot option qpw=y/n, which controls whether the behaviour will be
> > similar (the spinlock is taken on local_lock, similar to PREEMPT_RT).
> 
> My bad. I've misread the config space of this.
> 
> > If CONFIG_QPW=n, or kernel boot option qpw=n, then only local_lock 
> > (and remote work via work_queue) is used.
> > 
> > What "pcp book keeping activities" you refer to ? I don't see how
> > moving certain activities that happen under SLUB or LRU spinlocks
> > to happen before return to userspace changes things related 
> > to avoidance of CPU interruption ?
> 
> Essentially delayed operations like pcp state flushing happens on return
> to the userspace on isolated CPUs. No locking changes are required as
> the work is still per-cpu.
> 
> In other words the approach Frederic is working on is to not change the
> locking of pcp delayed work but instead move that work into well defined
> place - i.e. return to the userspace.
> 
> Btw. have you measure the impact of preempt_disbale -> spinlock on hot
> paths like SLUB sheeves?

Hi Michal,

I have done some study on this (which I presented on Plumbers 2023):
https://lpc.events/event/17/contributions/1484/ 

Since they are per-cpu spinlocks, and the remote operations are not that 
frequent, as per design of the current approach, we are not supposed to see 
contention (I was not able to detect contention even after stress testing 
for weeks), nor relevant cacheline bouncing.

That being said, for RT local_locks already get per-cpu spinlocks, so there 
is only difference for !RT, which as you mention, does preemtp_disable():

The performance impact noticed was mostly about jumping around in 
executable code, as inlining spinlocks (test #2 on presentation) took care 
of most of the added extra cycles, adding about 4-14 extra cycles per 
lock/unlock cycle. (tested on memcg with kmalloc test)

Yeah, as expected there is some extra cycles, as we are doing extra atomic 
operations (even if in a local cacheline) in !RT case, but this could be 
enabled only if the user thinks this is an ok cost for reducing 
interruptions.

What do you think?

Thanks!
Leo 










^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-14 22:02       ` Leonardo Bras
@ 2026-02-16 11:00         ` Michal Hocko
  2026-02-19 15:27           ` Marcelo Tosatti
                             ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Michal Hocko @ 2026-02-16 11:00 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Marcelo Tosatti, linux-kernel, cgroups, linux-mm,
	Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras,
	Thomas Gleixner, Waiman Long, Boqun Feng, Frederic Weisbecker

On Sat 14-02-26 19:02:19, Leonardo Bras wrote:
> On Wed, Feb 11, 2026 at 05:38:47PM +0100, Michal Hocko wrote:
> > On Wed 11-02-26 09:01:12, Marcelo Tosatti wrote:
> > > On Tue, Feb 10, 2026 at 03:01:10PM +0100, Michal Hocko wrote:
> > [...]
> > > > What about !PREEMPT_RT? We have people running isolated workloads and
> > > > these sorts of pcp disruptions are really unwelcome as well. They do not
> > > > have requirements as strong as RT workloads but the underlying
> > > > fundamental problem is the same. Frederic (now CCed) is working on
> > > > moving those pcp book keeping activities to be executed to the return to
> > > > the userspace which should be taking care of both RT and non-RT
> > > > configurations AFAICS.
> > > 
> > > Michal,
> > > 
> > > For !PREEMPT_RT, _if_ you select CONFIG_QPW=y, then there is a kernel
> > > boot option qpw=y/n, which controls whether the behaviour will be
> > > similar (the spinlock is taken on local_lock, similar to PREEMPT_RT).
> > 
> > My bad. I've misread the config space of this.
> > 
> > > If CONFIG_QPW=n, or kernel boot option qpw=n, then only local_lock 
> > > (and remote work via work_queue) is used.
> > > 
> > > What "pcp book keeping activities" you refer to ? I don't see how
> > > moving certain activities that happen under SLUB or LRU spinlocks
> > > to happen before return to userspace changes things related 
> > > to avoidance of CPU interruption ?
> > 
> > Essentially delayed operations like pcp state flushing happens on return
> > to the userspace on isolated CPUs. No locking changes are required as
> > the work is still per-cpu.
> > 
> > In other words the approach Frederic is working on is to not change the
> > locking of pcp delayed work but instead move that work into well defined
> > place - i.e. return to the userspace.
> > 
> > Btw. have you measure the impact of preempt_disbale -> spinlock on hot
> > paths like SLUB sheeves?
> 
> Hi Michal,
> 
> I have done some study on this (which I presented on Plumbers 2023):
> https://lpc.events/event/17/contributions/1484/ 
> 
> Since they are per-cpu spinlocks, and the remote operations are not that 
> frequent, as per design of the current approach, we are not supposed to see 
> contention (I was not able to detect contention even after stress testing 
> for weeks), nor relevant cacheline bouncing.
> 
> That being said, for RT local_locks already get per-cpu spinlocks, so there 
> is only difference for !RT, which as you mention, does preemtp_disable():
> 
> The performance impact noticed was mostly about jumping around in 
> executable code, as inlining spinlocks (test #2 on presentation) took care 
> of most of the added extra cycles, adding about 4-14 extra cycles per 
> lock/unlock cycle. (tested on memcg with kmalloc test)
> 
> Yeah, as expected there is some extra cycles, as we are doing extra atomic 
> operations (even if in a local cacheline) in !RT case, but this could be 
> enabled only if the user thinks this is an ok cost for reducing 
> interruptions.
> 
> What do you think?

The fact that the behavior is opt-in for !RT is certainly a plus. I also
do not expect the overhead to be really be really big. To me, a much
more important question is which of the two approaches is easier to
maintain long term. The pcp work needs to be done one way or the other.
Whether we want to tweak locking or do it at a very well defined time is
the bigger question.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-16 11:00         ` Michal Hocko
@ 2026-02-19 15:27           ` Marcelo Tosatti
  2026-02-19 19:30             ` Michal Hocko
  2026-02-20 10:48             ` Vlastimil Babka
  2026-02-20 16:51           ` Marcelo Tosatti
  2026-02-20 21:58           ` Leonardo Bras
  2 siblings, 2 replies; 35+ messages in thread
From: Marcelo Tosatti @ 2026-02-19 15:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Leonardo Bras, linux-kernel, cgroups, linux-mm, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner,
	Waiman Long, Boqun Feng, Frederic Weisbecker

On Mon, Feb 16, 2026 at 12:00:55PM +0100, Michal Hocko wrote:
> On Sat 14-02-26 19:02:19, Leonardo Bras wrote:
> > On Wed, Feb 11, 2026 at 05:38:47PM +0100, Michal Hocko wrote:
> > > On Wed 11-02-26 09:01:12, Marcelo Tosatti wrote:
> > > > On Tue, Feb 10, 2026 at 03:01:10PM +0100, Michal Hocko wrote:
> > > [...]
> > > > > What about !PREEMPT_RT? We have people running isolated workloads and
> > > > > these sorts of pcp disruptions are really unwelcome as well. They do not
> > > > > have requirements as strong as RT workloads but the underlying
> > > > > fundamental problem is the same. Frederic (now CCed) is working on
> > > > > moving those pcp book keeping activities to be executed to the return to
> > > > > the userspace which should be taking care of both RT and non-RT
> > > > > configurations AFAICS.
> > > > 
> > > > Michal,
> > > > 
> > > > For !PREEMPT_RT, _if_ you select CONFIG_QPW=y, then there is a kernel
> > > > boot option qpw=y/n, which controls whether the behaviour will be
> > > > similar (the spinlock is taken on local_lock, similar to PREEMPT_RT).
> > > 
> > > My bad. I've misread the config space of this.
> > > 
> > > > If CONFIG_QPW=n, or kernel boot option qpw=n, then only local_lock 
> > > > (and remote work via work_queue) is used.
> > > > 
> > > > What "pcp book keeping activities" you refer to ? I don't see how
> > > > moving certain activities that happen under SLUB or LRU spinlocks
> > > > to happen before return to userspace changes things related 
> > > > to avoidance of CPU interruption ?
> > > 
> > > Essentially delayed operations like pcp state flushing happens on return
> > > to the userspace on isolated CPUs. No locking changes are required as
> > > the work is still per-cpu.
> > > 
> > > In other words the approach Frederic is working on is to not change the
> > > locking of pcp delayed work but instead move that work into well defined
> > > place - i.e. return to the userspace.
> > > 
> > > Btw. have you measure the impact of preempt_disbale -> spinlock on hot
> > > paths like SLUB sheeves?
> > 
> > Hi Michal,
> > 
> > I have done some study on this (which I presented on Plumbers 2023):
> > https://lpc.events/event/17/contributions/1484/ 
> > 
> > Since they are per-cpu spinlocks, and the remote operations are not that 
> > frequent, as per design of the current approach, we are not supposed to see 
> > contention (I was not able to detect contention even after stress testing 
> > for weeks), nor relevant cacheline bouncing.
> > 
> > That being said, for RT local_locks already get per-cpu spinlocks, so there 
> > is only difference for !RT, which as you mention, does preemtp_disable():
> > 
> > The performance impact noticed was mostly about jumping around in 
> > executable code, as inlining spinlocks (test #2 on presentation) took care 
> > of most of the added extra cycles, adding about 4-14 extra cycles per 
> > lock/unlock cycle. (tested on memcg with kmalloc test)
> > 
> > Yeah, as expected there is some extra cycles, as we are doing extra atomic 
> > operations (even if in a local cacheline) in !RT case, but this could be 
> > enabled only if the user thinks this is an ok cost for reducing 
> > interruptions.
> > 
> > What do you think?
> 
> The fact that the behavior is opt-in for !RT is certainly a plus. I also
> do not expect the overhead to be really be really big. To me, a much
> more important question is which of the two approaches is easier to
> maintain long term. The pcp work needs to be done one way or the other.
> Whether we want to tweak locking or do it at a very well defined time is
> the bigger question.
> -- 
> Michal Hocko
> SUSE Labs

Michal,

Again, i don't see how moving operations to happen at return to 
kernel would help (assuming you are talking about 
"context_tracking,x86: Defer some IPIs until a user->kernel transition").

The IPIs in the patchset above can be deferred until user->kernel
transition because they are TLB flushes, for addresses which do not
exist on the address space mapping in userspace.

What are the per-CPU objects in SLUB ?

struct slab_sheaf {
        union {
                struct rcu_head rcu_head;
                struct list_head barn_list;
                /* only used for prefilled sheafs */
                struct {
                        unsigned int capacity;
                        bool pfmemalloc;
                };
        };
        struct kmem_cache *cache;
        unsigned int size;
        int node; /* only used for rcu_sheaf */
        void *objects[];
};

struct slub_percpu_sheaves {
        local_trylock_t lock;
        struct slab_sheaf *main; /* never NULL when unlocked */
        struct slab_sheaf *spare; /* empty or full, may be NULL */
        struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
};

Examples of local CPU operation that manipulates the data structures:
1) kmalloc, allocates an object from local per CPU list.
2) kfree, returns an object to local per CPU list.

Examples of an operation that would perform changes on the per-CPU lists 
remotely:
kmem_cache_shrink (cache shutdown), kmem_cache_shrink.

You can't delay either kmalloc (removal of object from per-CPU freelist), 
or kfree (return of object from per-CPU freelist), or kmem_cache_shrink 
or kmem_cache_shrink to return to userspace.

What i missing something here? (or do you have something on your mind
which i can't see).



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-19 15:27           ` Marcelo Tosatti
@ 2026-02-19 19:30             ` Michal Hocko
  2026-02-20 14:30               ` Marcelo Tosatti
  2026-02-20 10:48             ` Vlastimil Babka
  1 sibling, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2026-02-19 19:30 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Leonardo Bras, linux-kernel, cgroups, linux-mm, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner,
	Waiman Long, Boqun Feng, Frederic Weisbecker

On Thu 19-02-26 12:27:23, Marcelo Tosatti wrote:
> Michal,
> 
> Again, i don't see how moving operations to happen at return to 
> kernel would help (assuming you are talking about 
> "context_tracking,x86: Defer some IPIs until a user->kernel transition").

Nope, I am not talking about IPIs, although those are an example of pcp
state as well. I am sorry I do not have a link handy, I am pretty sure
Frederic will have that. Another example, though, was vmstat flushes
that need to be pcp. There are many other examples. 

[...]

> You can't delay either kmalloc (removal of object from per-CPU freelist), 
> or kfree (return of object from per-CPU freelist), or kmem_cache_shrink 
> or kmem_cache_shrink to return to userspace.

Why?

> What i missing something here? (or do you have something on your mind
> which i can't see).

I am really sorry for being really vague here. Let me try to draw
a more abstract problem definition and let's see whether we are trying
to solve the same problem here. Maybe not...

I believe the main usecase of the interest here is uninterrupted
userspace execution and delayed pcp work that migh disturb such workload
after it has returned to the userspace. Right?
That is usually hauskeeping work that for, performance reasons, doesn't
happen in hot paths while the workload was executing in the kernel
space.

There are more ways to deal with that. You can either change the hot
path to not require deferred operation (tricky withtout introducing
regressions for most workloads) or you can define a more suitable place
to perform the housekeeping while still running in the kernel. 

Your QWP work relies on local_lock -> spin_lock transition and
performing the pcp work remotely so you do not need to disturb that
remote cpu. Correct?

Alternative approach is to define a moment when the housekeeping
operation is performed on that local cpu while still running in the
kernel space - e.g. when returning to the userspace. Delayed work is
then not necessary and userspace is not disrupted after returning to the
userspace.

Do I make more sense or does the above sound like a complete gibberish?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-19 19:30             ` Michal Hocko
@ 2026-02-20 14:30               ` Marcelo Tosatti
  0 siblings, 0 replies; 35+ messages in thread
From: Marcelo Tosatti @ 2026-02-20 14:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Leonardo Bras, linux-kernel, cgroups, linux-mm, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner,
	Waiman Long, Boqun Feng, Frederic Weisbecker

On Thu, Feb 19, 2026 at 08:30:31PM +0100, Michal Hocko wrote:
> On Thu 19-02-26 12:27:23, Marcelo Tosatti wrote:
> > Michal,
> > 
> > Again, i don't see how moving operations to happen at return to 
> > kernel would help (assuming you are talking about 
> > "context_tracking,x86: Defer some IPIs until a user->kernel transition").
> 
> Nope, I am not talking about IPIs, although those are an example of pcp
> state as well. I am sorry I do not have a link handy, I am pretty sure
> Frederic will have that. Another example, though, was vmstat flushes
> that need to be pcp. There are many other examples. 
> 
> [...]
> 
> > You can't delay either kmalloc (removal of object from per-CPU freelist), 
> > or kfree (return of object from per-CPU freelist), or kmem_cache_shrink 
> > or kmem_cache_shrink to return to userspace.
> 
> Why?

Because kernel code might need to use that object right away, so it
needs to be allocated right after kmalloc returns.

> > What i missing something here? (or do you have something on your mind
> > which i can't see).
> 
> I am really sorry for being really vague here. Let me try to draw
> a more abstract problem definition and let's see whether we are trying
> to solve the same problem here. Maybe not...
> 
> I believe the main usecase of the interest here is uninterrupted
> userspace execution

The main usecase of interest is uninterrupted userspace execution, yes.

It is a good thing if you can enter the kernel, say perform system 
calls, and not be interrupted as well.

> and delayed pcp work that migh disturb such workload
> after it has returned to the userspace. Right?
> That is usually hauskeeping work that for, performance reasons, doesn't
> happen in hot paths while the workload was executing in the kernel
> space.
> 
> There are more ways to deal with that. You can either change the hot
> path to not require deferred operation (tricky withtout introducing
> regressions for most workloads) or you can define a more suitable place
> to perform the housekeeping while still running in the kernel. 
> 
> Your QWP work relies on local_lock -> spin_lock transition and
> performing the pcp work remotely so you do not need to disturb that
> remote cpu. Correct?
> 
> Alternative approach is to define a moment when the housekeeping
> operation is performed on that local cpu while still running in the
> kernel space - e.g. when returning to the userspace. Delayed work is
> then not necessary and userspace is not disrupted after returning to the
> userspace.
> 
> Do I make more sense or does the above sound like a complete gibberish?

OK, sure, but can't see how you can do that with per-CPU caches for
kmalloc, for example.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-19 15:27           ` Marcelo Tosatti
  2026-02-19 19:30             ` Michal Hocko
@ 2026-02-20 10:48             ` Vlastimil Babka
  2026-02-20 12:31               ` Michal Hocko
  2026-02-20 17:35               ` Marcelo Tosatti
  1 sibling, 2 replies; 35+ messages in thread
From: Vlastimil Babka @ 2026-02-20 10:48 UTC (permalink / raw)
  To: Marcelo Tosatti, Michal Hocko
  Cc: Leonardo Bras, linux-kernel, cgroups, linux-mm, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner,
	Waiman Long, Boqun Feng, Frederic Weisbecker

On 2/19/26 16:27, Marcelo Tosatti wrote:
> On Mon, Feb 16, 2026 at 12:00:55PM +0100, Michal Hocko wrote:
> 
> Michal,
> 
> Again, i don't see how moving operations to happen at return to 
> kernel would help (assuming you are talking about 
> "context_tracking,x86: Defer some IPIs until a user->kernel transition").
> 
> The IPIs in the patchset above can be deferred until user->kernel
> transition because they are TLB flushes, for addresses which do not
> exist on the address space mapping in userspace.
> 
> What are the per-CPU objects in SLUB ?
> 
> struct slab_sheaf {
>         union {
>                 struct rcu_head rcu_head;
>                 struct list_head barn_list;
>                 /* only used for prefilled sheafs */
>                 struct {
>                         unsigned int capacity;
>                         bool pfmemalloc;
>                 };
>         };
>         struct kmem_cache *cache;
>         unsigned int size;
>         int node; /* only used for rcu_sheaf */
>         void *objects[];
> };
> 
> struct slub_percpu_sheaves {
>         local_trylock_t lock;
>         struct slab_sheaf *main; /* never NULL when unlocked */
>         struct slab_sheaf *spare; /* empty or full, may be NULL */
>         struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
> };
> 
> Examples of local CPU operation that manipulates the data structures:
> 1) kmalloc, allocates an object from local per CPU list.
> 2) kfree, returns an object to local per CPU list.
> 
> Examples of an operation that would perform changes on the per-CPU lists 
> remotely:
> kmem_cache_shrink (cache shutdown), kmem_cache_shrink.
> 
> You can't delay either kmalloc (removal of object from per-CPU freelist), 
> or kfree (return of object from per-CPU freelist), or kmem_cache_shrink 
> or kmem_cache_shrink to return to userspace.
> 
> What i missing something here? (or do you have something on your mind
> which i can't see).

Let's try and analyze when we need to do the flushing in SLUB

- memory offline - would anyone do that with isolcpus? if yes, they probably
deserve the disruption

- cache shrinking (mainly from sysfs handler) - not necessary for
correctness, can probably skip cpu if needed, also kinda shooting your own
foot on isolcpu systems

- kmem_cache is being destroyed (__kmem_cache_shutdown()) - this is
important for correctness. destroying caches should be rare, but can't rule
it out

- kvfree_rcu_barrier() - a very tricky one; currently has only a debugging
caller, but that can change

(BTW, see the note in flush_rcu_sheaves_on_cache() and how it relies on the
flush actually happening on the cpu. Won't QPW violate that?)

How would this work with houskeeping on return to userspace approach?

- Would we just walk the list of all caches to flush them? could be
expensive. Would we somehow note only those that need it? That would make
the fast paths do something extra?

- If some other CPU executed kmem_cache_destroy(), it would have to wait for
the isolated cpu returning to userspace. Do we have the means for
synchronizing on that? Would that risk a deadlock? We used to have a
deferred finishing of the destroy for other reasons but were glad to get rid
of it when it was possible, now it might be necessary to revive it?

How would this work with QPW?

- probably fast paths more expensive due to spin lock vs local_trylock_t

- flush_rcu_sheaves_on_cache() needs to be solved safely (see above)

What if we avoid percpu sheaves completely on isolated cpus and instead
allocate/free using the slowpaths?

- It could probably be achieved without affecting fastpaths, as we already
handle bootstrap without sheaves, so it's implemented in a way to not affect
fastpaths.

- Would it slow the isolcpu workloads down too much when they do a syscall?
  - compared to "houskeeping on return to userspace" flushing, maybe not?
Because in that case the syscall starts with sheaves flushed from previous
return, it has to do something expensive to get the initial sheaf, then
maybe will use only on or few objects, then on return has to flush
everything. Likely the slowpath might be faster, unless it allocates/frees
many objects from the same cache.
  - compared to QPW - it would be slower as QPW would mostly retain sheaves
populated, the need for flushes should be very rare

So if we can assume that workloads on isolated cpus make syscalls only
rarely, and when they do they can tolerate them being slower, I think the
"avoid sheaves on isolated cpus" would be the best way here.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-20 10:48             ` Vlastimil Babka
@ 2026-02-20 12:31               ` Michal Hocko
  2026-02-20 17:35               ` Marcelo Tosatti
  1 sibling, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2026-02-20 12:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Marcelo Tosatti, Leonardo Bras, linux-kernel, cgroups, linux-mm,
	Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras,
	Thomas Gleixner, Waiman Long, Boqun Feng, Frederic Weisbecker

On Fri 20-02-26 11:48:00, Vlastimil Babka wrote:
> On 2/19/26 16:27, Marcelo Tosatti wrote:
> > On Mon, Feb 16, 2026 at 12:00:55PM +0100, Michal Hocko wrote:
> > 
> > Michal,
> > 
> > Again, i don't see how moving operations to happen at return to 
> > kernel would help (assuming you are talking about 
> > "context_tracking,x86: Defer some IPIs until a user->kernel transition").
> > 
> > The IPIs in the patchset above can be deferred until user->kernel
> > transition because they are TLB flushes, for addresses which do not
> > exist on the address space mapping in userspace.
> > 
> > What are the per-CPU objects in SLUB ?
> > 
> > struct slab_sheaf {
> >         union {
> >                 struct rcu_head rcu_head;
> >                 struct list_head barn_list;
> >                 /* only used for prefilled sheafs */
> >                 struct {
> >                         unsigned int capacity;
> >                         bool pfmemalloc;
> >                 };
> >         };
> >         struct kmem_cache *cache;
> >         unsigned int size;
> >         int node; /* only used for rcu_sheaf */
> >         void *objects[];
> > };
> > 
> > struct slub_percpu_sheaves {
> >         local_trylock_t lock;
> >         struct slab_sheaf *main; /* never NULL when unlocked */
> >         struct slab_sheaf *spare; /* empty or full, may be NULL */
> >         struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
> > };
> > 
> > Examples of local CPU operation that manipulates the data structures:
> > 1) kmalloc, allocates an object from local per CPU list.
> > 2) kfree, returns an object to local per CPU list.
> > 
> > Examples of an operation that would perform changes on the per-CPU lists 
> > remotely:
> > kmem_cache_shrink (cache shutdown), kmem_cache_shrink.
> > 
> > You can't delay either kmalloc (removal of object from per-CPU freelist), 
> > or kfree (return of object from per-CPU freelist), or kmem_cache_shrink 
> > or kmem_cache_shrink to return to userspace.
> > 
> > What i missing something here? (or do you have something on your mind
> > which i can't see).
> 
> Let's try and analyze when we need to do the flushing in SLUB
> 
> - memory offline - would anyone do that with isolcpus? if yes, they probably
> deserve the disruption
> 
> - cache shrinking (mainly from sysfs handler) - not necessary for
> correctness, can probably skip cpu if needed, also kinda shooting your own
> foot on isolcpu systems
> 
> - kmem_cache is being destroyed (__kmem_cache_shutdown()) - this is
> important for correctness. destroying caches should be rare, but can't rule
> it out
> 
> - kvfree_rcu_barrier() - a very tricky one; currently has only a debugging
> caller, but that can change
> 
> (BTW, see the note in flush_rcu_sheaves_on_cache() and how it relies on the
> flush actually happening on the cpu. Won't QPW violate that?)

Thanks, this is a very useful insight.
 
> How would this work with houskeeping on return to userspace approach?
> 
> - Would we just walk the list of all caches to flush them? could be
> expensive. Would we somehow note only those that need it? That would make
> the fast paths do something extra?
> 
> - If some other CPU executed kmem_cache_destroy(), it would have to wait for
> the isolated cpu returning to userspace. Do we have the means for
> synchronizing on that? Would that risk a deadlock? We used to have a
> deferred finishing of the destroy for other reasons but were glad to get rid
> of it when it was possible, now it might be necessary to revive it?

This would be tricky because there is no time guarantee when isolated
workload enters the kernel again. Maybe never if all the
pre-initialization was sufficient. On the other hand if the flush
happens on the way to userspace then you only need to wait for the
isolated workload to return from a syscall (modulo task dying and
similar edge cases).
 
> How would this work with QPW?
> 
> - probably fast paths more expensive due to spin lock vs local_trylock_t
> 
> - flush_rcu_sheaves_on_cache() needs to be solved safely (see above)
> 
> What if we avoid percpu sheaves completely on isolated cpus and instead
> allocate/free using the slowpaths?

That seems like a reasonable performance price to pay for very edge case
(isolated workload).
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-20 10:48             ` Vlastimil Babka
  2026-02-20 12:31               ` Michal Hocko
@ 2026-02-20 17:35               ` Marcelo Tosatti
  2026-02-20 17:58                 ` Vlastimil Babka
  1 sibling, 1 reply; 35+ messages in thread
From: Marcelo Tosatti @ 2026-02-20 17:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Leonardo Bras, linux-kernel, cgroups, linux-mm,
	Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras,
	Thomas Gleixner, Waiman Long, Boqun Feng, Frederic Weisbecker

Hi Vlastimil,

On Fri, Feb 20, 2026 at 11:48:00AM +0100, Vlastimil Babka wrote:
> On 2/19/26 16:27, Marcelo Tosatti wrote:
> > On Mon, Feb 16, 2026 at 12:00:55PM +0100, Michal Hocko wrote:
> > 
> > Michal,
> > 
> > Again, i don't see how moving operations to happen at return to 
> > kernel would help (assuming you are talking about 
> > "context_tracking,x86: Defer some IPIs until a user->kernel transition").
> > 
> > The IPIs in the patchset above can be deferred until user->kernel
> > transition because they are TLB flushes, for addresses which do not
> > exist on the address space mapping in userspace.
> > 
> > What are the per-CPU objects in SLUB ?
> > 
> > struct slab_sheaf {
> >         union {
> >                 struct rcu_head rcu_head;
> >                 struct list_head barn_list;
> >                 /* only used for prefilled sheafs */
> >                 struct {
> >                         unsigned int capacity;
> >                         bool pfmemalloc;
> >                 };
> >         };
> >         struct kmem_cache *cache;
> >         unsigned int size;
> >         int node; /* only used for rcu_sheaf */
> >         void *objects[];
> > };
> > 
> > struct slub_percpu_sheaves {
> >         local_trylock_t lock;
> >         struct slab_sheaf *main; /* never NULL when unlocked */
> >         struct slab_sheaf *spare; /* empty or full, may be NULL */
> >         struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
> > };
> > 
> > Examples of local CPU operation that manipulates the data structures:
> > 1) kmalloc, allocates an object from local per CPU list.
> > 2) kfree, returns an object to local per CPU list.
> > 
> > Examples of an operation that would perform changes on the per-CPU lists 
> > remotely:
> > kmem_cache_shrink (cache shutdown), kmem_cache_shrink.
> > 
> > You can't delay either kmalloc (removal of object from per-CPU freelist), 
> > or kfree (return of object from per-CPU freelist), or kmem_cache_shrink 
> > or kmem_cache_shrink to return to userspace.
> > 
> > What i missing something here? (or do you have something on your mind
> > which i can't see).
> 
> Let's try and analyze when we need to do the flushing in SLUB
> 
> - memory offline - would anyone do that with isolcpus? if yes, they probably
> deserve the disruption

I think its OK to avoid memory offline on such systems.

> - cache shrinking (mainly from sysfs handler) - not necessary for
> correctness, can probably skip cpu if needed, also kinda shooting your own
> foot on isolcpu systems
> 
> - kmem_cache is being destroyed (__kmem_cache_shutdown()) - this is
> important for correctness. destroying caches should be rare, but can't rule
> it out
> 
> - kvfree_rcu_barrier() - a very tricky one; currently has only a debugging
> caller, but that can change
> 
> (BTW, see the note in flush_rcu_sheaves_on_cache() and how it relies on the
> flush actually happening on the cpu. Won't QPW violate that?)

(struct kmem_cache *s)->cpu_sheaves (percpu)->rcu_free with the
s->cpu_sheaves->lock lock held:

do_free:

        rcu_sheaf = pcs->rcu_free;

        /*
         * Since we flush immediately when size reaches capacity, we never reach
         * this with size already at capacity, so no OOB write is possible.
         */
        rcu_sheaf->objects[rcu_sheaf->size++] = obj;

        if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
                rcu_sheaf = NULL;
        } else {
                pcs->rcu_free = NULL;
                rcu_sheaf->node = numa_mem_id();
        }

        /*
         * we flush before local_unlock to make sure a racing
         * flush_all_rcu_sheaves() doesn't miss this sheaf
         */
        if (rcu_sheaf)
                call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);

        qpw_unlock(&s->cpu_sheaves->lock, cpu);

So if it invokes call_rcu, it sets pcs->rcu_free = NULL. In that case,
for flush_rcu_sheaf executing remotely from flush_rcu_sheaves_on_cache
will:

static void flush_rcu_sheaf(struct work_struct *w)
{
        struct slub_percpu_sheaves *pcs;
        struct slab_sheaf *rcu_free;
        struct slub_flush_work *sfw;
        struct kmem_cache *s;
        int cpu = qpw_get_cpu(w);

        sfw = &per_cpu(slub_flush, cpu);
        s = sfw->s;

        qpw_lock(&s->cpu_sheaves->lock, cpu);
        pcs = per_cpu_ptr(s->cpu_sheaves, cpu);

        rcu_free = pcs->rcu_free;
        pcs->rcu_free = NULL;

        qpw_unlock(&s->cpu_sheaves->lock, cpu);

        if (rcu_free)
                call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
}

Only call rcu_free_sheaf_nobarn if pcs->rcu_free is not NULL.

So it seems safe?

> How would this work with houskeeping on return to userspace approach?
> 
> - Would we just walk the list of all caches to flush them? could be
> expensive. Would we somehow note only those that need it? That would make
> the fast paths do something extra?
> 
> - If some other CPU executed kmem_cache_destroy(), it would have to wait for
> the isolated cpu returning to userspace. Do we have the means for
> synchronizing on that? Would that risk a deadlock? We used to have a
> deferred finishing of the destroy for other reasons but were glad to get rid
> of it when it was possible, now it might be necessary to revive it?

I don't think you can expect system calls to return to userspace in 
a given amount of time. Could be in kernel mode for long periods of
time.

> How would this work with QPW?
> 
> - probably fast paths more expensive due to spin lock vs local_trylock_t
> 
> - flush_rcu_sheaves_on_cache() needs to be solved safely (see above)
> 
> What if we avoid percpu sheaves completely on isolated cpus and instead
> allocate/free using the slowpaths?
> 
> - It could probably be achieved without affecting fastpaths, as we already
> handle bootstrap without sheaves, so it's implemented in a way to not affect
> fastpaths.
> 
> - Would it slow the isolcpu workloads down too much when they do a syscall?
>   - compared to "houskeeping on return to userspace" flushing, maybe not?
> Because in that case the syscall starts with sheaves flushed from previous
> return, it has to do something expensive to get the initial sheaf, then
> maybe will use only on or few objects, then on return has to flush
> everything. Likely the slowpath might be faster, unless it allocates/frees
> many objects from the same cache.
>   - compared to QPW - it would be slower as QPW would mostly retain sheaves
> populated, the need for flushes should be very rare
> 
> So if we can assume that workloads on isolated cpus make syscalls only
> rarely, and when they do they can tolerate them being slower, I think the
> "avoid sheaves on isolated cpus" would be the best way here.

I am not sure its safe to assume that. Ask Gemini about isolcpus use
cases and:

1. High-Frequency Trading (HFT)
In the world of HFT, microseconds are the difference between profit and loss. 
Traders use isolcpus to pin their execution engines to specific cores.

The Goal: Eliminate "jitter" caused by the OS moving other processes onto the same core.

The Benefit: Guaranteed execution time and ultra-low latency.

2. Real-Time Audio & Video Processing
If you are running a Digital Audio Workstation (DAW) or a live video encoding rig, a tiny "hiccup" in CPU availability results in an audible pop or a dropped frame.

The Goal: Reserve cores specifically for the Digital Signal Processor (DSP) or the encoder.

The Benefit: Smooth, glitch-free media streams even when the rest of the system is busy.

3. Network Function Virtualization (NFV) & DPDK
For high-speed networking (like 10Gbps+ traffic), the Data Plane Development Kit (DPDK) uses "poll mode" drivers. These drivers constantly loop to check for new packets rather than waiting for interrupts.

The Goal: Isolate cores so they can run at 100% utilization just checking for network packets.

The Benefit: Maximum throughput and zero packet loss in high-traffic environments.

4. Gaming & Simulation
Competitive gamers or flight simulator enthusiasts sometimes isolate a few cores to handle the game's main thread, while leaving the rest of the OS (Discord, Chrome, etc.) to the remaining cores.

The Goal: Prevent background Windows/Linux tasks from stealing cycles from the game engine.

The Benefit: More consistent 1% low FPS and reduced input lag.

5. Deterministic Scientific Computing
If you're running a simulation that needs to take exactly the same amount of time every time it runs (for benchmarking or safety-critical testing), you can't have the OS interference messing with your metrics.

The Goal: Remove the variability of the Linux scheduler.

The Benefit: Highly repeatable, deterministic results.

===

For example, AF_XDP bypass uses system calls (and wants isolcpus):

https://www.quantvps.com/blog/kernel-bypass-in-hft?srsltid=AfmBOoryeSxuuZjzTJIC9O-Ag8x4gSwjs-V4Xukm2wQpGmwDJ6t4szuE




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-20 17:35               ` Marcelo Tosatti
@ 2026-02-20 17:58                 ` Vlastimil Babka
  2026-02-20 19:01                   ` Marcelo Tosatti
  0 siblings, 1 reply; 35+ messages in thread
From: Vlastimil Babka @ 2026-02-20 17:58 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Michal Hocko, Leonardo Bras, linux-kernel, cgroups, linux-mm,
	Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras,
	Thomas Gleixner, Waiman Long, Boqun Feng, Frederic Weisbecker

On 2/20/26 18:35, Marcelo Tosatti wrote:
> 
> Only call rcu_free_sheaf_nobarn if pcs->rcu_free is not NULL.
> 
> So it seems safe?

I guess it is.

>> How would this work with houskeeping on return to userspace approach?
>> 
>> - Would we just walk the list of all caches to flush them? could be
>> expensive. Would we somehow note only those that need it? That would make
>> the fast paths do something extra?
>> 
>> - If some other CPU executed kmem_cache_destroy(), it would have to wait for
>> the isolated cpu returning to userspace. Do we have the means for
>> synchronizing on that? Would that risk a deadlock? We used to have a
>> deferred finishing of the destroy for other reasons but were glad to get rid
>> of it when it was possible, now it might be necessary to revive it?
> 
> I don't think you can expect system calls to return to userspace in 
> a given amount of time. Could be in kernel mode for long periods of
> time.
> 
>> How would this work with QPW?
>> 
>> - probably fast paths more expensive due to spin lock vs local_trylock_t
>> 
>> - flush_rcu_sheaves_on_cache() needs to be solved safely (see above)
>> 
>> What if we avoid percpu sheaves completely on isolated cpus and instead
>> allocate/free using the slowpaths?
>> 
>> - It could probably be achieved without affecting fastpaths, as we already
>> handle bootstrap without sheaves, so it's implemented in a way to not affect
>> fastpaths.
>> 
>> - Would it slow the isolcpu workloads down too much when they do a syscall?
>>   - compared to "houskeeping on return to userspace" flushing, maybe not?
>> Because in that case the syscall starts with sheaves flushed from previous
>> return, it has to do something expensive to get the initial sheaf, then
>> maybe will use only on or few objects, then on return has to flush
>> everything. Likely the slowpath might be faster, unless it allocates/frees
>> many objects from the same cache.
>>   - compared to QPW - it would be slower as QPW would mostly retain sheaves
>> populated, the need for flushes should be very rare
>> 
>> So if we can assume that workloads on isolated cpus make syscalls only
>> rarely, and when they do they can tolerate them being slower, I think the
>> "avoid sheaves on isolated cpus" would be the best way here.
> 
> I am not sure its safe to assume that. Ask Gemini about isolcpus use
> cases and:

I don't think it's answering the question about syscalls. But didn't read
too closely given the nature of it.

> 
> For example, AF_XDP bypass uses system calls (and wants isolcpus):
> 
> https://www.quantvps.com/blog/kernel-bypass-in-hft?srsltid=AfmBOoryeSxuuZjzTJIC9O-Ag8x4gSwjs-V4Xukm2wQpGmwDJ6t4szuE

Didn't spot system calls mentioned TBH.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-20 17:58                 ` Vlastimil Babka
@ 2026-02-20 19:01                   ` Marcelo Tosatti
  0 siblings, 0 replies; 35+ messages in thread
From: Marcelo Tosatti @ 2026-02-20 19:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Leonardo Bras, linux-kernel, cgroups, linux-mm,
	Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras,
	Thomas Gleixner, Waiman Long, Boqun Feng, Frederic Weisbecker

On Fri, Feb 20, 2026 at 06:58:10PM +0100, Vlastimil Babka wrote:
> On 2/20/26 18:35, Marcelo Tosatti wrote:
> > 
> > Only call rcu_free_sheaf_nobarn if pcs->rcu_free is not NULL.
> > 
> > So it seems safe?
> 
> I guess it is.
> 
> >> How would this work with houskeeping on return to userspace approach?
> >> 
> >> - Would we just walk the list of all caches to flush them? could be
> >> expensive. Would we somehow note only those that need it? That would make
> >> the fast paths do something extra?
> >> 
> >> - If some other CPU executed kmem_cache_destroy(), it would have to wait for
> >> the isolated cpu returning to userspace. Do we have the means for
> >> synchronizing on that? Would that risk a deadlock? We used to have a
> >> deferred finishing of the destroy for other reasons but were glad to get rid
> >> of it when it was possible, now it might be necessary to revive it?
> > 
> > I don't think you can expect system calls to return to userspace in 
> > a given amount of time. Could be in kernel mode for long periods of
> > time.
> > 
> >> How would this work with QPW?
> >> 
> >> - probably fast paths more expensive due to spin lock vs local_trylock_t
> >> 
> >> - flush_rcu_sheaves_on_cache() needs to be solved safely (see above)
> >> 
> >> What if we avoid percpu sheaves completely on isolated cpus and instead
> >> allocate/free using the slowpaths?
> >> 
> >> - It could probably be achieved without affecting fastpaths, as we already
> >> handle bootstrap without sheaves, so it's implemented in a way to not affect
> >> fastpaths.
> >> 
> >> - Would it slow the isolcpu workloads down too much when they do a syscall?
> >>   - compared to "houskeeping on return to userspace" flushing, maybe not?
> >> Because in that case the syscall starts with sheaves flushed from previous
> >> return, it has to do something expensive to get the initial sheaf, then
> >> maybe will use only on or few objects, then on return has to flush
> >> everything. Likely the slowpath might be faster, unless it allocates/frees
> >> many objects from the same cache.
> >>   - compared to QPW - it would be slower as QPW would mostly retain sheaves
> >> populated, the need for flushes should be very rare
> >> 
> >> So if we can assume that workloads on isolated cpus make syscalls only
> >> rarely, and when they do they can tolerate them being slower, I think the
> >> "avoid sheaves on isolated cpus" would be the best way here.
> > 
> > I am not sure its safe to assume that. Ask Gemini about isolcpus use
> > cases and:
> 
> I don't think it's answering the question about syscalls. But didn't read
> too closely given the nature of it.

People use isolcpus with all kinds of programs. 

> > For example, AF_XDP bypass uses system calls (and wants isolcpus):
> > 
> > https://www.quantvps.com/blog/kernel-bypass-in-hft?srsltid=AfmBOoryeSxuuZjzTJIC9O-Ag8x4gSwjs-V4Xukm2wQpGmwDJ6t4szuE
> 
> Didn't spot system calls mentioned TBH.

I don't see why you want to reduce performance of applications that 
execute on isolcpus=, if you can avoid that.

Also, won't bypassing the per-CPU caches increase contention on the 
global locks, say kmem_cache_node->list_lock.

But if you prefer disabling the per-CPU caches for isolcpus
(or a separate option other than isolcpus), then see if 
people complain about that... works for me.

Two examples:

1)

https://github.com/xdp-project/bpf-examples/blob/main/AF_XDP-example/README.org

Busy-Poll mode
Busy-poll mode. In this mode both the application and the driver can be run efficiently on the same core. The kernel driver is explicitly invoked by the application by calling either recvmsg() or sendto(). Invoke this by setting the -B option. The -b option can be used to set the batch size that the driver will use. For example:

sudo taskset -c 2 ./xdpsock -i <interface> -q 2 -l -N -B -b 256

2)

https://vstinner.github.io/journey-to-stable-benchmark-system.html

Example of effect of CPU isolation on a microbenchmark
Example with Linux parameters:

isolcpus=2,3,6,7 nohz_full=2,3,6,7
Microbenchmark on an idle system (without CPU isolation):

$ python3 -m timeit 'sum(range(10**7))'
10 loops, best of 3: 229 msec per loop
Result on a busy system using system_load.py 10 and find / commands running in other terminals:

$ python3 -m timeit 'sum(range(10**7))'
10 loops, best of 3: 372 msec per loop
The microbenchmark is 56% slower because of the high system load!

Result on the same busy system but using isolated CPUs. The taskset command allows to pin an application to specific CPUs:

$ taskset -c 1,3 python3 -m timeit 'sum(range(10**7))'
10 loops, best of 3: 230 msec per loop
Just to check, new run without CPU isolation:

$ python3 -m timeit 'sum(range(10**7))'
10 loops, best of 3: 357 msec per loop
The result with CPU isolation on a busy system is the same than the result an idle system! CPU isolation removes most of the noise of the system.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-16 11:00         ` Michal Hocko
  2026-02-19 15:27           ` Marcelo Tosatti
@ 2026-02-20 16:51           ` Marcelo Tosatti
  2026-02-20 16:55             ` Marcelo Tosatti
  2026-02-20 21:58           ` Leonardo Bras
  2 siblings, 1 reply; 35+ messages in thread
From: Marcelo Tosatti @ 2026-02-20 16:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Leonardo Bras, linux-kernel, cgroups, linux-mm, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner,
	Waiman Long, Boqun Feng, Frederic Weisbecker

On Mon, Feb 16, 2026 at 12:00:55PM +0100, Michal Hocko wrote:
> On Sat 14-02-26 19:02:19, Leonardo Bras wrote:
> > On Wed, Feb 11, 2026 at 05:38:47PM +0100, Michal Hocko wrote:
> > > On Wed 11-02-26 09:01:12, Marcelo Tosatti wrote:
> > > > On Tue, Feb 10, 2026 at 03:01:10PM +0100, Michal Hocko wrote:
> > > [...]
> > > > > What about !PREEMPT_RT? We have people running isolated workloads and
> > > > > these sorts of pcp disruptions are really unwelcome as well. They do not
> > > > > have requirements as strong as RT workloads but the underlying
> > > > > fundamental problem is the same. Frederic (now CCed) is working on
> > > > > moving those pcp book keeping activities to be executed to the return to
> > > > > the userspace which should be taking care of both RT and non-RT
> > > > > configurations AFAICS.
> > > > 
> > > > Michal,
> > > > 
> > > > For !PREEMPT_RT, _if_ you select CONFIG_QPW=y, then there is a kernel
> > > > boot option qpw=y/n, which controls whether the behaviour will be
> > > > similar (the spinlock is taken on local_lock, similar to PREEMPT_RT).
> > > 
> > > My bad. I've misread the config space of this.
> > > 
> > > > If CONFIG_QPW=n, or kernel boot option qpw=n, then only local_lock 
> > > > (and remote work via work_queue) is used.
> > > > 
> > > > What "pcp book keeping activities" you refer to ? I don't see how
> > > > moving certain activities that happen under SLUB or LRU spinlocks
> > > > to happen before return to userspace changes things related 
> > > > to avoidance of CPU interruption ?
> > > 
> > > Essentially delayed operations like pcp state flushing happens on return
> > > to the userspace on isolated CPUs. No locking changes are required as
> > > the work is still per-cpu.
> > > 
> > > In other words the approach Frederic is working on is to not change the
> > > locking of pcp delayed work but instead move that work into well defined
> > > place - i.e. return to the userspace.
> > > 
> > > Btw. have you measure the impact of preempt_disbale -> spinlock on hot
> > > paths like SLUB sheeves?
> > 
> > Hi Michal,
> > 
> > I have done some study on this (which I presented on Plumbers 2023):
> > https://lpc.events/event/17/contributions/1484/ 
> > 
> > Since they are per-cpu spinlocks, and the remote operations are not that 
> > frequent, as per design of the current approach, we are not supposed to see 
> > contention (I was not able to detect contention even after stress testing 
> > for weeks), nor relevant cacheline bouncing.
> > 
> > That being said, for RT local_locks already get per-cpu spinlocks, so there 
> > is only difference for !RT, which as you mention, does preemtp_disable():
> > 
> > The performance impact noticed was mostly about jumping around in 
> > executable code, as inlining spinlocks (test #2 on presentation) took care 
> > of most of the added extra cycles, adding about 4-14 extra cycles per 
> > lock/unlock cycle. (tested on memcg with kmalloc test)
> > 
> > Yeah, as expected there is some extra cycles, as we are doing extra atomic 
> > operations (even if in a local cacheline) in !RT case, but this could be 
> > enabled only if the user thinks this is an ok cost for reducing 
> > interruptions.
> > 
> > What do you think?
> 
> The fact that the behavior is opt-in for !RT is certainly a plus. I also
> do not expect the overhead to be really be really big. To me, a much
> more important question is which of the two approaches is easier to
> maintain long term. The pcp work needs to be done one way or the other.
> Whether we want to tweak locking or do it at a very well defined time is
> the bigger question.

Without patchset:
================

[ 1188.050725] kmalloc_bench: Avg cycles per kmalloc: 159

With qpw patchset, CONFIG_QPW=n:
================================

[   50.292190] kmalloc_bench: Avg cycles per kmalloc: 163

With qpw patchset, CONFIG_QPW=y, qpw=0:
=======================================

[   29.872153] kmalloc_bench: Avg cycles per kmalloc: 170


With qpw patchset, CONFIG_QPW=y, qpw=1:
========================================

[   37.494687] kmalloc_bench: Avg cycles per kmalloc: 190

With PREEMPT_RT enabled, qpw=0:
===============================

[   65.163251] kmalloc_bench: Avg cycles per kmalloc: 181

With PREEMPT_RT enabled, no patchset:
=====================================
[   52.701639] kmalloc_bench: Avg cycles per kmalloc: 185

With PREEMPT_RT enabled, qpw=1:
==============================

[   35.103830] kmalloc_bench: Avg cycles per kmalloc: 196




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-20 16:51           ` Marcelo Tosatti
@ 2026-02-20 16:55             ` Marcelo Tosatti
  2026-02-20 22:38               ` Leonardo Bras
  0 siblings, 1 reply; 35+ messages in thread
From: Marcelo Tosatti @ 2026-02-20 16:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Leonardo Bras, linux-kernel, cgroups, linux-mm, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner,
	Waiman Long, Boqun Feng, Frederic Weisbecker

On Fri, Feb 20, 2026 at 01:51:13PM -0300, Marcelo Tosatti wrote:
> On Mon, Feb 16, 2026 at 12:00:55PM +0100, Michal Hocko wrote:
> > On Sat 14-02-26 19:02:19, Leonardo Bras wrote:
> > > On Wed, Feb 11, 2026 at 05:38:47PM +0100, Michal Hocko wrote:
> > > > On Wed 11-02-26 09:01:12, Marcelo Tosatti wrote:
> > > > > On Tue, Feb 10, 2026 at 03:01:10PM +0100, Michal Hocko wrote:
> > > > [...]
> > > > > > What about !PREEMPT_RT? We have people running isolated workloads and
> > > > > > these sorts of pcp disruptions are really unwelcome as well. They do not
> > > > > > have requirements as strong as RT workloads but the underlying
> > > > > > fundamental problem is the same. Frederic (now CCed) is working on
> > > > > > moving those pcp book keeping activities to be executed to the return to
> > > > > > the userspace which should be taking care of both RT and non-RT
> > > > > > configurations AFAICS.
> > > > > 
> > > > > Michal,
> > > > > 
> > > > > For !PREEMPT_RT, _if_ you select CONFIG_QPW=y, then there is a kernel
> > > > > boot option qpw=y/n, which controls whether the behaviour will be
> > > > > similar (the spinlock is taken on local_lock, similar to PREEMPT_RT).
> > > > 
> > > > My bad. I've misread the config space of this.
> > > > 
> > > > > If CONFIG_QPW=n, or kernel boot option qpw=n, then only local_lock 
> > > > > (and remote work via work_queue) is used.
> > > > > 
> > > > > What "pcp book keeping activities" you refer to ? I don't see how
> > > > > moving certain activities that happen under SLUB or LRU spinlocks
> > > > > to happen before return to userspace changes things related 
> > > > > to avoidance of CPU interruption ?
> > > > 
> > > > Essentially delayed operations like pcp state flushing happens on return
> > > > to the userspace on isolated CPUs. No locking changes are required as
> > > > the work is still per-cpu.
> > > > 
> > > > In other words the approach Frederic is working on is to not change the
> > > > locking of pcp delayed work but instead move that work into well defined
> > > > place - i.e. return to the userspace.
> > > > 
> > > > Btw. have you measure the impact of preempt_disbale -> spinlock on hot
> > > > paths like SLUB sheeves?
> > > 
> > > Hi Michal,
> > > 
> > > I have done some study on this (which I presented on Plumbers 2023):
> > > https://lpc.events/event/17/contributions/1484/ 
> > > 
> > > Since they are per-cpu spinlocks, and the remote operations are not that 
> > > frequent, as per design of the current approach, we are not supposed to see 
> > > contention (I was not able to detect contention even after stress testing 
> > > for weeks), nor relevant cacheline bouncing.
> > > 
> > > That being said, for RT local_locks already get per-cpu spinlocks, so there 
> > > is only difference for !RT, which as you mention, does preemtp_disable():
> > > 
> > > The performance impact noticed was mostly about jumping around in 
> > > executable code, as inlining spinlocks (test #2 on presentation) took care 
> > > of most of the added extra cycles, adding about 4-14 extra cycles per 
> > > lock/unlock cycle. (tested on memcg with kmalloc test)
> > > 
> > > Yeah, as expected there is some extra cycles, as we are doing extra atomic 
> > > operations (even if in a local cacheline) in !RT case, but this could be 
> > > enabled only if the user thinks this is an ok cost for reducing 
> > > interruptions.
> > > 
> > > What do you think?
> > 
> > The fact that the behavior is opt-in for !RT is certainly a plus. I also
> > do not expect the overhead to be really be really big. To me, a much
> > more important question is which of the two approaches is easier to
> > maintain long term. The pcp work needs to be done one way or the other.
> > Whether we want to tweak locking or do it at a very well defined time is
> > the bigger question.
> 
> Without patchset:
> ================
> 
> [ 1188.050725] kmalloc_bench: Avg cycles per kmalloc: 159
> 
> With qpw patchset, CONFIG_QPW=n:
> ================================
> 
> [   50.292190] kmalloc_bench: Avg cycles per kmalloc: 163
> 
> With qpw patchset, CONFIG_QPW=y, qpw=0:
> =======================================
> 
> [   29.872153] kmalloc_bench: Avg cycles per kmalloc: 170
> 
> 
> With qpw patchset, CONFIG_QPW=y, qpw=1:
> ========================================
> 
> [   37.494687] kmalloc_bench: Avg cycles per kmalloc: 190
> 
> With PREEMPT_RT enabled, qpw=0:
> ===============================
> 
> [   65.163251] kmalloc_bench: Avg cycles per kmalloc: 181
> 
> With PREEMPT_RT enabled, no patchset:
> =====================================
> [   52.701639] kmalloc_bench: Avg cycles per kmalloc: 185
> 
> With PREEMPT_RT enabled, qpw=1:
> ==============================
> 
> [   35.103830] kmalloc_bench: Avg cycles per kmalloc: 196

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/slab.h>
#include <linux/timex.h>
#include <linux/preempt.h>
#include <linux/irqflags.h>
#include <linux/vmalloc.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Gemini AI");
MODULE_DESCRIPTION("A simple kmalloc performance benchmark");

static int size = 64; // Default allocation size in bytes
module_param(size, int, 0644);

static int iterations = 1000000; // Default number of iterations
module_param(iterations, int, 0644);

static int __init kmalloc_bench_init(void) {
    void **ptrs;
    cycles_t start, end;
    uint64_t total_cycles;
    int i;
    pr_info("kmalloc_bench: Starting test (size=%d, iterations=%d)\n", size, iterations);

    // Allocate an array to store pointers to avoid immediate kfree-reuse optimization
    ptrs = vmalloc(sizeof(void *) * iterations);
    if (!ptrs) {
        pr_err("kmalloc_bench: Failed to allocate pointer array\n");
        return -ENOMEM;
    }

    preempt_disable();
    start = get_cycles();

    for (i = 0; i < iterations; i++) {
        ptrs[i] = kmalloc(size, GFP_ATOMIC);
    }

    end = get_cycles();

    total_cycles = end - start;
    preempt_enable();

    pr_info("kmalloc_bench: Total cycles for %d allocs: %llu\n", iterations, total_cycles);
    pr_info("kmalloc_bench: Avg cycles per kmalloc: %llu\n", total_cycles / iterations);

    // Cleanup
    for (i = 0; i < iterations; i++) {
        kfree(ptrs[i]);
    }
    vfree(ptrs);

    return 0;
}

static void __exit kmalloc_bench_exit(void) {
    pr_info("kmalloc_bench: Module unloaded\n");
}




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-20 16:55             ` Marcelo Tosatti
@ 2026-02-20 22:38               ` Leonardo Bras
  0 siblings, 0 replies; 35+ messages in thread
From: Leonardo Bras @ 2026-02-20 22:38 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Leonardo Bras, Michal Hocko, linux-kernel, cgroups, linux-mm,
	Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras,
	Thomas Gleixner, Waiman Long, Boqun Feng, Frederic Weisbecker

On Fri, Feb 20, 2026 at 01:55:57PM -0300, Marcelo Tosatti wrote:
> On Fri, Feb 20, 2026 at 01:51:13PM -0300, Marcelo Tosatti wrote:
> > On Mon, Feb 16, 2026 at 12:00:55PM +0100, Michal Hocko wrote:
> > > On Sat 14-02-26 19:02:19, Leonardo Bras wrote:
> > > > On Wed, Feb 11, 2026 at 05:38:47PM +0100, Michal Hocko wrote:
> > > > > On Wed 11-02-26 09:01:12, Marcelo Tosatti wrote:
> > > > > > On Tue, Feb 10, 2026 at 03:01:10PM +0100, Michal Hocko wrote:
> > > > > [...]
> > > > > > > What about !PREEMPT_RT? We have people running isolated workloads and
> > > > > > > these sorts of pcp disruptions are really unwelcome as well. They do not
> > > > > > > have requirements as strong as RT workloads but the underlying
> > > > > > > fundamental problem is the same. Frederic (now CCed) is working on
> > > > > > > moving those pcp book keeping activities to be executed to the return to
> > > > > > > the userspace which should be taking care of both RT and non-RT
> > > > > > > configurations AFAICS.
> > > > > > 
> > > > > > Michal,
> > > > > > 
> > > > > > For !PREEMPT_RT, _if_ you select CONFIG_QPW=y, then there is a kernel
> > > > > > boot option qpw=y/n, which controls whether the behaviour will be
> > > > > > similar (the spinlock is taken on local_lock, similar to PREEMPT_RT).
> > > > > 
> > > > > My bad. I've misread the config space of this.
> > > > > 
> > > > > > If CONFIG_QPW=n, or kernel boot option qpw=n, then only local_lock 
> > > > > > (and remote work via work_queue) is used.
> > > > > > 
> > > > > > What "pcp book keeping activities" you refer to ? I don't see how
> > > > > > moving certain activities that happen under SLUB or LRU spinlocks
> > > > > > to happen before return to userspace changes things related 
> > > > > > to avoidance of CPU interruption ?
> > > > > 
> > > > > Essentially delayed operations like pcp state flushing happens on return
> > > > > to the userspace on isolated CPUs. No locking changes are required as
> > > > > the work is still per-cpu.
> > > > > 
> > > > > In other words the approach Frederic is working on is to not change the
> > > > > locking of pcp delayed work but instead move that work into well defined
> > > > > place - i.e. return to the userspace.
> > > > > 
> > > > > Btw. have you measure the impact of preempt_disbale -> spinlock on hot
> > > > > paths like SLUB sheeves?
> > > > 
> > > > Hi Michal,
> > > > 
> > > > I have done some study on this (which I presented on Plumbers 2023):
> > > > https://lpc.events/event/17/contributions/1484/ 
> > > > 
> > > > Since they are per-cpu spinlocks, and the remote operations are not that 
> > > > frequent, as per design of the current approach, we are not supposed to see 
> > > > contention (I was not able to detect contention even after stress testing 
> > > > for weeks), nor relevant cacheline bouncing.
> > > > 
> > > > That being said, for RT local_locks already get per-cpu spinlocks, so there 
> > > > is only difference for !RT, which as you mention, does preemtp_disable():
> > > > 
> > > > The performance impact noticed was mostly about jumping around in 
> > > > executable code, as inlining spinlocks (test #2 on presentation) took care 
> > > > of most of the added extra cycles, adding about 4-14 extra cycles per 
> > > > lock/unlock cycle. (tested on memcg with kmalloc test)
> > > > 
> > > > Yeah, as expected there is some extra cycles, as we are doing extra atomic 
> > > > operations (even if in a local cacheline) in !RT case, but this could be 
> > > > enabled only if the user thinks this is an ok cost for reducing 
> > > > interruptions.
> > > > 
> > > > What do you think?
> > > 
> > > The fact that the behavior is opt-in for !RT is certainly a plus. I also
> > > do not expect the overhead to be really be really big. To me, a much
> > > more important question is which of the two approaches is easier to
> > > maintain long term. The pcp work needs to be done one way or the other.
> > > Whether we want to tweak locking or do it at a very well defined time is
> > > the bigger question.
> > 
> > Without patchset:
> > ================
> > 
> > [ 1188.050725] kmalloc_bench: Avg cycles per kmalloc: 159
> > 
> > With qpw patchset, CONFIG_QPW=n:
> > ================================
> > 
> > [   50.292190] kmalloc_bench: Avg cycles per kmalloc: 163

Weird.. with CONFIG_QPW we should see no difference.
Oh, maybe the changes in the code, such as adding a new cpu parameter in 
some functions may have caused this.

(oh, there is the migrate_disable as well)

> > 
> > With qpw patchset, CONFIG_QPW=y, qpw=0:
> > =======================================
> > 
> > [   29.872153] kmalloc_bench: Avg cycles per kmalloc: 170
> > 

Humm, what changed here is basically from

+#define qpw_lock(lock, cpu)			\
+	local_lock(lock)

to 

+#define qpw_lock(lock, cpu)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock(per_cpu_ptr(lock.sl, cpu));				\
+		else									\
+			local_lock(lock.ll);						\
+	} while (0)


So only the cost of a static branch.. maybe I did something wrong here 
with the static_branch_maybe, as any cpu branch predictor should make this 
delta close to zero.

> > 
> > With qpw patchset, CONFIG_QPW=y, qpw=1:
> > ========================================
> > 
> > [   37.494687] kmalloc_bench: Avg cycles per kmalloc: 190
> > 

20 cycles as a price for a local_lock->spinlock seems too much.
Taking in account the previous message, maybe we should work on making them 
inlined spinlocks, if not already.
(Yeah, I missed that verification :| )

> > With PREEMPT_RT enabled, qpw=0:
> > ===============================
> > 
> > [   65.163251] kmalloc_bench: Avg cycles per kmalloc: 181
> > 
> > With PREEMPT_RT enabled, no patchset:
> > =====================================
> > [   52.701639] kmalloc_bench: Avg cycles per kmalloc: 185
> > 

Nice, having the QPW patch saved some cycles :)


> > With PREEMPT_RT enabled, qpw=1:
> > ==============================
> > 
> > [   35.103830] kmalloc_bench: Avg cycles per kmalloc: 196
> 

This is odd, though. The spinlock is already there, so from qpw=0 to qpw=1 
there should be no performance change. Maybe in local_lock they do some 
optimization in their spinlock?


> #include <linux/module.h>
> #include <linux/kernel.h>
> #include <linux/slab.h>
> #include <linux/timex.h>
> #include <linux/preempt.h>
> #include <linux/irqflags.h>
> #include <linux/vmalloc.h>
> 
> MODULE_LICENSE("GPL");
> MODULE_AUTHOR("Gemini AI");
> MODULE_DESCRIPTION("A simple kmalloc performance benchmark");
> 
> static int size = 64; // Default allocation size in bytes
> module_param(size, int, 0644);
> 
> static int iterations = 1000000; // Default number of iterations
> module_param(iterations, int, 0644);
> 
> static int __init kmalloc_bench_init(void) {
>     void **ptrs;
>     cycles_t start, end;
>     uint64_t total_cycles;
>     int i;
>     pr_info("kmalloc_bench: Starting test (size=%d, iterations=%d)\n", size, iterations);
> 
>     // Allocate an array to store pointers to avoid immediate kfree-reuse optimization
>     ptrs = vmalloc(sizeof(void *) * iterations);
>     if (!ptrs) {
>         pr_err("kmalloc_bench: Failed to allocate pointer array\n");
>         return -ENOMEM;
>     }
> 
>     preempt_disable();
>     start = get_cycles();
> 
>     for (i = 0; i < iterations; i++) {
>         ptrs[i] = kmalloc(size, GFP_ATOMIC);
>     }
> 
>     end = get_cycles();
> 
>     total_cycles = end - start;
>     preempt_enable();
> 
>     pr_info("kmalloc_bench: Total cycles for %d allocs: %llu\n", iterations, total_cycles);
>     pr_info("kmalloc_bench: Avg cycles per kmalloc: %llu\n", total_cycles / iterations);
> 
>     // Cleanup
>     for (i = 0; i < iterations; i++) {
>         kfree(ptrs[i]);
>     }
>     vfree(ptrs);
> 
>     return 0;
> }
> 
> static void __exit kmalloc_bench_exit(void) {
>     pr_info("kmalloc_bench: Module unloaded\n");
> }
> 
> 


Nice!
Please collect min and max as well, maybe we can have an insight of what 
could have happened, then :)

What was the system you used for testing?

Thanks!
Leo


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-16 11:00         ` Michal Hocko
  2026-02-19 15:27           ` Marcelo Tosatti
  2026-02-20 16:51           ` Marcelo Tosatti
@ 2026-02-20 21:58           ` Leonardo Bras
  2 siblings, 0 replies; 35+ messages in thread
From: Leonardo Bras @ 2026-02-20 21:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Leonardo Bras, Marcelo Tosatti, linux-kernel, cgroups, linux-mm,
	Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras,
	Thomas Gleixner, Waiman Long, Boqun Feng, Frederic Weisbecker

On Mon, Feb 16, 2026 at 12:00:55PM +0100, Michal Hocko wrote:
> On Sat 14-02-26 19:02:19, Leonardo Bras wrote:
> > On Wed, Feb 11, 2026 at 05:38:47PM +0100, Michal Hocko wrote:
> > > On Wed 11-02-26 09:01:12, Marcelo Tosatti wrote:
> > > > On Tue, Feb 10, 2026 at 03:01:10PM +0100, Michal Hocko wrote:
> > > [...]
> > > > > What about !PREEMPT_RT? We have people running isolated workloads and
> > > > > these sorts of pcp disruptions are really unwelcome as well. They do not
> > > > > have requirements as strong as RT workloads but the underlying
> > > > > fundamental problem is the same. Frederic (now CCed) is working on
> > > > > moving those pcp book keeping activities to be executed to the return to
> > > > > the userspace which should be taking care of both RT and non-RT
> > > > > configurations AFAICS.
> > > > 
> > > > Michal,
> > > > 
> > > > For !PREEMPT_RT, _if_ you select CONFIG_QPW=y, then there is a kernel
> > > > boot option qpw=y/n, which controls whether the behaviour will be
> > > > similar (the spinlock is taken on local_lock, similar to PREEMPT_RT).
> > > 
> > > My bad. I've misread the config space of this.
> > > 
> > > > If CONFIG_QPW=n, or kernel boot option qpw=n, then only local_lock 
> > > > (and remote work via work_queue) is used.
> > > > 
> > > > What "pcp book keeping activities" you refer to ? I don't see how
> > > > moving certain activities that happen under SLUB or LRU spinlocks
> > > > to happen before return to userspace changes things related 
> > > > to avoidance of CPU interruption ?
> > > 
> > > Essentially delayed operations like pcp state flushing happens on return
> > > to the userspace on isolated CPUs. No locking changes are required as
> > > the work is still per-cpu.
> > > 
> > > In other words the approach Frederic is working on is to not change the
> > > locking of pcp delayed work but instead move that work into well defined
> > > place - i.e. return to the userspace.
> > > 
> > > Btw. have you measure the impact of preempt_disbale -> spinlock on hot
> > > paths like SLUB sheeves?
> > 
> > Hi Michal,
> > 
> > I have done some study on this (which I presented on Plumbers 2023):
> > https://lpc.events/event/17/contributions/1484/ 
> > 
> > Since they are per-cpu spinlocks, and the remote operations are not that 
> > frequent, as per design of the current approach, we are not supposed to see 
> > contention (I was not able to detect contention even after stress testing 
> > for weeks), nor relevant cacheline bouncing.
> > 
> > That being said, for RT local_locks already get per-cpu spinlocks, so there 
> > is only difference for !RT, which as you mention, does preemtp_disable():
> > 
> > The performance impact noticed was mostly about jumping around in 
> > executable code, as inlining spinlocks (test #2 on presentation) took care 
> > of most of the added extra cycles, adding about 4-14 extra cycles per 
> > lock/unlock cycle. (tested on memcg with kmalloc test)
> > 
> > Yeah, as expected there is some extra cycles, as we are doing extra atomic 
> > operations (even if in a local cacheline) in !RT case, but this could be 
> > enabled only if the user thinks this is an ok cost for reducing 
> > interruptions.
> > 
> > What do you think?
> 
> The fact that the behavior is opt-in for !RT is certainly a plus. I also
> do not expect the overhead to be really be really big. 

Awesome! Thanks for reviewing!

> To me, a much
> more important question is which of the two approaches is easier to
> maintain long term. The pcp work needs to be done one way or the other.
> Whether we want to tweak locking or do it at a very well defined time is
> the bigger question.

That crossed my mind as well, and I went with the idea of changing locking 
because I was working on workloads in which deferring work to a kernel 
re-entry would cause deadline misses as well. Or more critically, the 
drains could take forever, as some of those tasks would avoid returning to 
kernel as much as possible. 

Thanks!
Leo


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Introduce QPW for per-cpu operations
  2026-02-11 16:38     ` Michal Hocko
  2026-02-11 16:50       ` Marcelo Tosatti
  2026-02-14 22:02       ` Leonardo Bras
@ 2026-02-19 13:15       ` Marcelo Tosatti
  2 siblings, 0 replies; 35+ messages in thread
From: Marcelo Tosatti @ 2026-02-19 13:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, cgroups, linux-mm, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner, Waiman Long,
	Boqun Feng, Frederic Weisbecker

On Wed, Feb 11, 2026 at 05:38:47PM +0100, Michal Hocko wrote:
> On Wed 11-02-26 09:01:12, Marcelo Tosatti wrote:
> > On Tue, Feb 10, 2026 at 03:01:10PM +0100, Michal Hocko wrote:
> [...]
> > > What about !PREEMPT_RT? We have people running isolated workloads and
> > > these sorts of pcp disruptions are really unwelcome as well. They do not
> > > have requirements as strong as RT workloads but the underlying
> > > fundamental problem is the same. Frederic (now CCed) is working on
> > > moving those pcp book keeping activities to be executed to the return to
> > > the userspace which should be taking care of both RT and non-RT
> > > configurations AFAICS.
> > 
> > Michal,
> > 
> > For !PREEMPT_RT, _if_ you select CONFIG_QPW=y, then there is a kernel
> > boot option qpw=y/n, which controls whether the behaviour will be
> > similar (the spinlock is taken on local_lock, similar to PREEMPT_RT).
> 
> My bad. I've misread the config space of this.
> 
> > If CONFIG_QPW=n, or kernel boot option qpw=n, then only local_lock 
> > (and remote work via work_queue) is used.
> > 
> > What "pcp book keeping activities" you refer to ? I don't see how
> > moving certain activities that happen under SLUB or LRU spinlocks
> > to happen before return to userspace changes things related 
> > to avoidance of CPU interruption ?
> 
> Essentially delayed operations like pcp state flushing happens on return
> to the userspace on isolated CPUs. No locking changes are required as
> the work is still per-cpu.
> 
> In other words the approach Frederic is working on is to not change the
> locking of pcp delayed work but instead move that work into well defined
> place - i.e. return to the userspace.

Michal,

I can't find such work from Frederic. Do you mean:

Subject: Re: [PATCH v6 00/29] context_tracking,x86: Defer some IPIs until a user->kernel transition

From Valentin ?

Or if you have a pointer to Frederic's work.

> Btw. have you measure the impact of preempt_disbale -> spinlock on hot
> paths like SLUB sheeves?

Doing that, will post results as soon as possible.



^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2026-02-20 22:38 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-06 14:34 [PATCH 0/4] Introduce QPW for per-cpu operations Marcelo Tosatti
2026-02-06 14:34 ` [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
2026-02-06 15:20   ` Marcelo Tosatti
2026-02-07  0:16   ` Leonardo Bras
2026-02-11 12:09     ` Marcelo Tosatti
2026-02-14 21:32       ` Leonardo Bras
2026-02-06 14:34 ` [PATCH 2/4] mm/swap: move bh draining into a separate workqueue Marcelo Tosatti
2026-02-06 14:34 ` [PATCH 3/4] swap: apply new queue_percpu_work_on() interface Marcelo Tosatti
2026-02-07  1:06   ` Leonardo Bras
2026-02-06 14:34 ` [PATCH 4/4] slub: " Marcelo Tosatti
2026-02-07  1:27   ` Leonardo Bras
2026-02-06 23:56 ` [PATCH 0/4] Introduce QPW for per-cpu operations Leonardo Bras
2026-02-10 14:01 ` Michal Hocko
2026-02-11 12:01   ` Marcelo Tosatti
2026-02-11 12:11     ` Marcelo Tosatti
2026-02-14 21:35       ` Leonardo Bras
2026-02-11 16:38     ` Michal Hocko
2026-02-11 16:50       ` Marcelo Tosatti
2026-02-11 16:59         ` Vlastimil Babka
2026-02-11 17:07         ` Michal Hocko
2026-02-14 22:02       ` Leonardo Bras
2026-02-16 11:00         ` Michal Hocko
2026-02-19 15:27           ` Marcelo Tosatti
2026-02-19 19:30             ` Michal Hocko
2026-02-20 14:30               ` Marcelo Tosatti
2026-02-20 10:48             ` Vlastimil Babka
2026-02-20 12:31               ` Michal Hocko
2026-02-20 17:35               ` Marcelo Tosatti
2026-02-20 17:58                 ` Vlastimil Babka
2026-02-20 19:01                   ` Marcelo Tosatti
2026-02-20 16:51           ` Marcelo Tosatti
2026-02-20 16:55             ` Marcelo Tosatti
2026-02-20 22:38               ` Leonardo Bras
2026-02-20 21:58           ` Leonardo Bras
2026-02-19 13:15       ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox