From: Marcelo Tosatti <mtosatti@redhat.com>
To: Vlastimil Babka <vbabka@suse.com>
Cc: Michal Hocko <mhocko@suse.com>,
Leonardo Bras <leobras.c@gmail.com>,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
linux-mm@kvack.org, Johannes Weiner <hannes@cmpxchg.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Muchun Song <muchun.song@linux.dev>,
Andrew Morton <akpm@linux-foundation.org>,
Christoph Lameter <cl@linux.com>,
Pekka Enberg <penberg@kernel.org>,
David Rientjes <rientjes@google.com>,
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
Vlastimil Babka <vbabka@suse.cz>,
Hyeonggon Yoo <42.hyeyoo@gmail.com>,
Leonardo Bras <leobras@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
Waiman Long <longman@redhat.com>,
Boqun Feng <boqun.feng@gmail.com>,
Frederic Weisbecker <fweisbecker@suse.de>
Subject: Re: [PATCH 0/4] Introduce QPW for per-cpu operations
Date: Fri, 20 Feb 2026 14:35:41 -0300 [thread overview]
Message-ID: <aZibbYH7yrDZlnJh@tpad> (raw)
In-Reply-To: <3f2b985a-2fb0-4d63-9dce-8a9cad8ce464@suse.com>
Hi Vlastimil,
On Fri, Feb 20, 2026 at 11:48:00AM +0100, Vlastimil Babka wrote:
> On 2/19/26 16:27, Marcelo Tosatti wrote:
> > On Mon, Feb 16, 2026 at 12:00:55PM +0100, Michal Hocko wrote:
> >
> > Michal,
> >
> > Again, i don't see how moving operations to happen at return to
> > kernel would help (assuming you are talking about
> > "context_tracking,x86: Defer some IPIs until a user->kernel transition").
> >
> > The IPIs in the patchset above can be deferred until user->kernel
> > transition because they are TLB flushes, for addresses which do not
> > exist on the address space mapping in userspace.
> >
> > What are the per-CPU objects in SLUB ?
> >
> > struct slab_sheaf {
> > union {
> > struct rcu_head rcu_head;
> > struct list_head barn_list;
> > /* only used for prefilled sheafs */
> > struct {
> > unsigned int capacity;
> > bool pfmemalloc;
> > };
> > };
> > struct kmem_cache *cache;
> > unsigned int size;
> > int node; /* only used for rcu_sheaf */
> > void *objects[];
> > };
> >
> > struct slub_percpu_sheaves {
> > local_trylock_t lock;
> > struct slab_sheaf *main; /* never NULL when unlocked */
> > struct slab_sheaf *spare; /* empty or full, may be NULL */
> > struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
> > };
> >
> > Examples of local CPU operation that manipulates the data structures:
> > 1) kmalloc, allocates an object from local per CPU list.
> > 2) kfree, returns an object to local per CPU list.
> >
> > Examples of an operation that would perform changes on the per-CPU lists
> > remotely:
> > kmem_cache_shrink (cache shutdown), kmem_cache_shrink.
> >
> > You can't delay either kmalloc (removal of object from per-CPU freelist),
> > or kfree (return of object from per-CPU freelist), or kmem_cache_shrink
> > or kmem_cache_shrink to return to userspace.
> >
> > What i missing something here? (or do you have something on your mind
> > which i can't see).
>
> Let's try and analyze when we need to do the flushing in SLUB
>
> - memory offline - would anyone do that with isolcpus? if yes, they probably
> deserve the disruption
I think its OK to avoid memory offline on such systems.
> - cache shrinking (mainly from sysfs handler) - not necessary for
> correctness, can probably skip cpu if needed, also kinda shooting your own
> foot on isolcpu systems
>
> - kmem_cache is being destroyed (__kmem_cache_shutdown()) - this is
> important for correctness. destroying caches should be rare, but can't rule
> it out
>
> - kvfree_rcu_barrier() - a very tricky one; currently has only a debugging
> caller, but that can change
>
> (BTW, see the note in flush_rcu_sheaves_on_cache() and how it relies on the
> flush actually happening on the cpu. Won't QPW violate that?)
(struct kmem_cache *s)->cpu_sheaves (percpu)->rcu_free with the
s->cpu_sheaves->lock lock held:
do_free:
rcu_sheaf = pcs->rcu_free;
/*
* Since we flush immediately when size reaches capacity, we never reach
* this with size already at capacity, so no OOB write is possible.
*/
rcu_sheaf->objects[rcu_sheaf->size++] = obj;
if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
rcu_sheaf = NULL;
} else {
pcs->rcu_free = NULL;
rcu_sheaf->node = numa_mem_id();
}
/*
* we flush before local_unlock to make sure a racing
* flush_all_rcu_sheaves() doesn't miss this sheaf
*/
if (rcu_sheaf)
call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
qpw_unlock(&s->cpu_sheaves->lock, cpu);
So if it invokes call_rcu, it sets pcs->rcu_free = NULL. In that case,
for flush_rcu_sheaf executing remotely from flush_rcu_sheaves_on_cache
will:
static void flush_rcu_sheaf(struct work_struct *w)
{
struct slub_percpu_sheaves *pcs;
struct slab_sheaf *rcu_free;
struct slub_flush_work *sfw;
struct kmem_cache *s;
int cpu = qpw_get_cpu(w);
sfw = &per_cpu(slub_flush, cpu);
s = sfw->s;
qpw_lock(&s->cpu_sheaves->lock, cpu);
pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
rcu_free = pcs->rcu_free;
pcs->rcu_free = NULL;
qpw_unlock(&s->cpu_sheaves->lock, cpu);
if (rcu_free)
call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
}
Only call rcu_free_sheaf_nobarn if pcs->rcu_free is not NULL.
So it seems safe?
> How would this work with houskeeping on return to userspace approach?
>
> - Would we just walk the list of all caches to flush them? could be
> expensive. Would we somehow note only those that need it? That would make
> the fast paths do something extra?
>
> - If some other CPU executed kmem_cache_destroy(), it would have to wait for
> the isolated cpu returning to userspace. Do we have the means for
> synchronizing on that? Would that risk a deadlock? We used to have a
> deferred finishing of the destroy for other reasons but were glad to get rid
> of it when it was possible, now it might be necessary to revive it?
I don't think you can expect system calls to return to userspace in
a given amount of time. Could be in kernel mode for long periods of
time.
> How would this work with QPW?
>
> - probably fast paths more expensive due to spin lock vs local_trylock_t
>
> - flush_rcu_sheaves_on_cache() needs to be solved safely (see above)
>
> What if we avoid percpu sheaves completely on isolated cpus and instead
> allocate/free using the slowpaths?
>
> - It could probably be achieved without affecting fastpaths, as we already
> handle bootstrap without sheaves, so it's implemented in a way to not affect
> fastpaths.
>
> - Would it slow the isolcpu workloads down too much when they do a syscall?
> - compared to "houskeeping on return to userspace" flushing, maybe not?
> Because in that case the syscall starts with sheaves flushed from previous
> return, it has to do something expensive to get the initial sheaf, then
> maybe will use only on or few objects, then on return has to flush
> everything. Likely the slowpath might be faster, unless it allocates/frees
> many objects from the same cache.
> - compared to QPW - it would be slower as QPW would mostly retain sheaves
> populated, the need for flushes should be very rare
>
> So if we can assume that workloads on isolated cpus make syscalls only
> rarely, and when they do they can tolerate them being slower, I think the
> "avoid sheaves on isolated cpus" would be the best way here.
I am not sure its safe to assume that. Ask Gemini about isolcpus use
cases and:
1. High-Frequency Trading (HFT)
In the world of HFT, microseconds are the difference between profit and loss.
Traders use isolcpus to pin their execution engines to specific cores.
The Goal: Eliminate "jitter" caused by the OS moving other processes onto the same core.
The Benefit: Guaranteed execution time and ultra-low latency.
2. Real-Time Audio & Video Processing
If you are running a Digital Audio Workstation (DAW) or a live video encoding rig, a tiny "hiccup" in CPU availability results in an audible pop or a dropped frame.
The Goal: Reserve cores specifically for the Digital Signal Processor (DSP) or the encoder.
The Benefit: Smooth, glitch-free media streams even when the rest of the system is busy.
3. Network Function Virtualization (NFV) & DPDK
For high-speed networking (like 10Gbps+ traffic), the Data Plane Development Kit (DPDK) uses "poll mode" drivers. These drivers constantly loop to check for new packets rather than waiting for interrupts.
The Goal: Isolate cores so they can run at 100% utilization just checking for network packets.
The Benefit: Maximum throughput and zero packet loss in high-traffic environments.
4. Gaming & Simulation
Competitive gamers or flight simulator enthusiasts sometimes isolate a few cores to handle the game's main thread, while leaving the rest of the OS (Discord, Chrome, etc.) to the remaining cores.
The Goal: Prevent background Windows/Linux tasks from stealing cycles from the game engine.
The Benefit: More consistent 1% low FPS and reduced input lag.
5. Deterministic Scientific Computing
If you're running a simulation that needs to take exactly the same amount of time every time it runs (for benchmarking or safety-critical testing), you can't have the OS interference messing with your metrics.
The Goal: Remove the variability of the Linux scheduler.
The Benefit: Highly repeatable, deterministic results.
===
For example, AF_XDP bypass uses system calls (and wants isolcpus):
https://www.quantvps.com/blog/kernel-bypass-in-hft?srsltid=AfmBOoryeSxuuZjzTJIC9O-Ag8x4gSwjs-V4Xukm2wQpGmwDJ6t4szuE
next prev parent reply other threads:[~2026-02-20 17:36 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-06 14:34 Marcelo Tosatti
2026-02-06 14:34 ` [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
2026-02-06 15:20 ` Marcelo Tosatti
2026-02-07 0:16 ` Leonardo Bras
2026-02-11 12:09 ` Marcelo Tosatti
2026-02-14 21:32 ` Leonardo Bras
2026-02-06 14:34 ` [PATCH 2/4] mm/swap: move bh draining into a separate workqueue Marcelo Tosatti
2026-02-06 14:34 ` [PATCH 3/4] swap: apply new queue_percpu_work_on() interface Marcelo Tosatti
2026-02-07 1:06 ` Leonardo Bras
2026-02-06 14:34 ` [PATCH 4/4] slub: " Marcelo Tosatti
2026-02-07 1:27 ` Leonardo Bras
2026-02-06 23:56 ` [PATCH 0/4] Introduce QPW for per-cpu operations Leonardo Bras
2026-02-10 14:01 ` Michal Hocko
2026-02-11 12:01 ` Marcelo Tosatti
2026-02-11 12:11 ` Marcelo Tosatti
2026-02-14 21:35 ` Leonardo Bras
2026-02-11 16:38 ` Michal Hocko
2026-02-11 16:50 ` Marcelo Tosatti
2026-02-11 16:59 ` Vlastimil Babka
2026-02-11 17:07 ` Michal Hocko
2026-02-14 22:02 ` Leonardo Bras
2026-02-16 11:00 ` Michal Hocko
2026-02-19 15:27 ` Marcelo Tosatti
2026-02-19 19:30 ` Michal Hocko
2026-02-20 14:30 ` Marcelo Tosatti
2026-02-20 10:48 ` Vlastimil Babka
2026-02-20 12:31 ` Michal Hocko
2026-02-20 17:35 ` Marcelo Tosatti [this message]
2026-02-20 17:58 ` Vlastimil Babka
2026-02-20 19:01 ` Marcelo Tosatti
2026-02-20 16:51 ` Marcelo Tosatti
2026-02-20 16:55 ` Marcelo Tosatti
2026-02-20 22:38 ` Leonardo Bras
2026-02-20 21:58 ` Leonardo Bras
2026-02-19 13:15 ` Marcelo Tosatti
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aZibbYH7yrDZlnJh@tpad \
--to=mtosatti@redhat.com \
--cc=42.hyeyoo@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=boqun.feng@gmail.com \
--cc=cgroups@vger.kernel.org \
--cc=cl@linux.com \
--cc=fweisbecker@suse.de \
--cc=hannes@cmpxchg.org \
--cc=iamjoonsoo.kim@lge.com \
--cc=leobras.c@gmail.com \
--cc=leobras@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=longman@redhat.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=penberg@kernel.org \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=tglx@linutronix.de \
--cc=vbabka@suse.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox