From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: Breno Leitao <leitao@debian.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org, kas@kernel.org,
shakeel.butt@linux.dev, usama.arif@linux.dev,
kernel-team@meta.com
Subject: Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
Date: Wed, 8 Apr 2026 12:13:04 +0200 [thread overview]
Message-ID: <a9129d39-f9dc-4f09-b951-203c8e28b600@kernel.org> (raw)
In-Reply-To: <adUhtrhU7c1TJQww@gmail.com>
On 4/7/26 17:39, Breno Leitao wrote:
> On Thu, Apr 02, 2026 at 06:33:17AM -0700, Breno Leitao wrote:
>> > >
>> > > Cool!
>> > >
>> > > I noticed __round_jiffies_relative() exists and the description looks like
>> > > it's meant for exactly this use case?
>> >
>> > On closer look, using round_jiffies_relative() as before your patch
>> > means it's calling __round_jiffies_relative(j, raw_smp_processor_id())
>> > so that's already doing this spread internally. You're also relying
>> > smp_processor_id() so it's not about using a different cpu id.
>> >
>> > But your patch has better results, why? I still think it's not doing
>> > what it intends - I think it makes every cpu have different interval
>> > length (up to twice the original length), not skew. Is it that, or that
>> > the 3 jiffies skew per cpu used in round_jiffies_common() is
>> > insufficient? Or it a bug in its skew implementation?
>> >
>> > Ideally once that's clear, the findings could be used to improve
>> > round_jiffies_common() and hopefully there's nothing here that's vmstat
>> > specific.
>>
>> Excellent observation. I believe there are two key differences:
>>
>> 1) The interval duration now varies per CPU. Specifically, vmstat_update()
>> is scheduled at sysctl_stat_interval*2 for the highest CPU with my
>> proposed change, rather than a uniform sysctl_stat_interval across
>> all CPUs. (as you raised in the first email)
>>
>> 2) round_jiffies_relative() applies a 3-jiffies shift per CPU, whereas
>> vmstat_spread_delay distributes all CPUs across the full second
>> interval. (My tests were on HZ=1000)
>>
>> I'll investigate this further to provide more concrete data.
>
> After further investigation, I can confirm that both factors mentioned
> above contribute to the performance improvement.
>
> However, we certainly don't want scenario (1) where the delay varies per
> CPU, resulting in the last CPU having vmstat_update() scheduled every
> 2 seconds instead of 1 second.
Indeed.
> I've implemented a patch following Dmitry's suggestion, and the
> performance gains are measurable.
>
> Here's my testing methodology:
>
> 1) Use ftrace to measure the execution time of refresh_cpu_vm_stats()
> * Applied a custom instrumentation patch [1]
>
> 2) Execute stress-ng:
> * stress-ng --vm 72 --vm-bytes 11256M --vm-method all --timeout 60s ; cat /sys/kernel/debug/tracing/trace
>
> 3) Parse the output using a Python script [2]
>
> While the results are not as dramatic as initially reported (since
> approach (1) was good but incorrect), the improvement is still
> substantial:
>
>
> ┌─────────┬────────────┬────────────┬───────┐
> │ Metric │ upstream* │ fix** │ Delta │
> ├─────────┼────────────┼────────────┼───────┤
> │ samples │ 36,981 │ 37,267 │ ~same │
> ├─────────┼────────────┼────────────┼───────┤
> │ avg │ 31,511 ns │ 21,337 ns │ -32% │
> ├─────────┼────────────┼────────────┼───────┤
> │ p50 │ 2,644 ns │ 2,925 ns │ ~same │
> ├─────────┼────────────┼────────────┼───────┤
> │ p99 │ 382,083 ns │ 304,357 ns │ -20% │
> ├─────────┼────────────┼────────────┼───────┤
> │ max │ 72.6 ms │ 16.0 ms │ -78% │
> └─────────┴────────────┴────────────┴───────┘
So you have 72 cpus, the vmstat interval is 1s, and what's the CONFIG_HZ?
If it's 1000, it means 13 jiffies per cpu. Would changing the
round_jiffies_common() implementation to add 13 jiffies per cpu instead of 3
have the same effect?
>
> * Upstream is based on linux-next commit f3e6330d7fe42 ("Add linux-next specific files for 20260407")
> ** "fix" contains the patch below:
>
> Link: https://github.com/leitao/linux/commit/ac200164df1bda45ee8504cc3db5bff5b696245e [1]
> Link: https://github.com/leitao/linux/commit/baa2ea6ea4c4c2b1df689de6db0a2a6f119e51be [2]
>
>
> commit 41b7aaa1a51f07fc1f0db0614d140fbca78463d3
> Author: Breno Leitao <leitao@debian.org>
> Date: Tue Apr 7 07:56:35 2026 -0700
>
> mm/vmstat: spread per-cpu vmstat work to reduce zone->lock contention
>
> vmstat_shepherd() queues all per-cpu vmstat_update work with zero delay,
> and vmstat_update() re-queues itself with round_jiffies_relative(), which
> clusters timers near the same second boundary due to the small per-CPU
> spread in round_jiffies_common(). On many-CPU systems this causes
> thundering-herd contention on zone->lock when multiple CPUs
> simultaneously call refresh_cpu_vm_stats() -> decay_pcp_high() ->
> free_pcppages_bulk().
>
> Introduce vmstat_spread_delay() to assign each CPU a unique offset
> distributed evenly across sysctl_stat_interval. The shepherd uses this
> when initially queuing per-cpu work, and vmstat_update re-queues with a
> plain sysctl_stat_interval to preserve the spread (round_jiffies_relative
> would snap CPUs back to the same boundary).
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
I think this approach could have the following problems:
- the initially spread delays can drift over time, there's nothing keeping
them in sync
- not using round_jiffies_relative() means firing at other times than other
timers that are using the rounding, so this could be working against the
power savings effects of rounding
- it's a vmstat-specific workaround for some yet unclear underlying
suboptimality that's likely not vmstat specific
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 3704f6ca7a268..8d93eee3b1f75 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2040,6 +2040,22 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
> }
> #endif /* CONFIG_PROC_FS */
>
> +/*
> + * Return a per-cpu initial delay that spreads vmstat_update work evenly
> + * across the stat interval, so that CPUs do not all fire at the same
> + * second boundary.
> + */
> +static unsigned long vmstat_spread_delay(int cpu)
> +{
> + unsigned long interval = sysctl_stat_interval;
> + unsigned int nr_cpus = num_online_cpus();
> +
> + if (nr_cpus <= 1)
> + return 0;
> +
> + return (interval * (cpu % nr_cpus)) / nr_cpus;
> +}
> +
> static void vmstat_update(struct work_struct *w)
> {
> if (refresh_cpu_vm_stats(true)) {
> @@ -2047,10 +2063,13 @@ static void vmstat_update(struct work_struct *w)
> * Counters were updated so we expect more updates
> * to occur in the future. Keep on running the
> * update worker thread.
> + * Avoid round_jiffies_relative() here -- it would snap
> + * every CPU back to the same second boundary, undoing
> + * the initial spread from vmstat_shepherd.
> */
> queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
> this_cpu_ptr(&vmstat_work),
> - round_jiffies_relative(sysctl_stat_interval));
> + sysctl_stat_interval);
> }
> }
>
> @@ -2148,7 +2167,8 @@ static void vmstat_shepherd(struct work_struct *w)
> continue;
>
> if (!delayed_work_pending(dw) && need_update(cpu))
> - queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
> + queue_delayed_work_on(cpu, mm_percpu_wq, dw,
> + vmstat_spread_delay(cpu));
> }
>
> cond_resched();
next prev parent reply other threads:[~2026-04-08 10:13 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-01 13:57 Breno Leitao
2026-04-01 14:25 ` Johannes Weiner
2026-04-01 14:39 ` Breno Leitao
2026-04-01 14:57 ` Johannes Weiner
2026-04-01 14:47 ` Breno Leitao
2026-04-01 15:01 ` Kiryl Shutsemau
2026-04-01 15:23 ` Usama Arif
2026-04-01 15:43 ` Breno Leitao
2026-04-01 15:50 ` Usama Arif
2026-04-01 15:52 ` Breno Leitao
2026-04-01 17:46 ` Vlastimil Babka (SUSE)
2026-04-02 12:40 ` Vlastimil Babka (SUSE)
2026-04-02 13:33 ` Breno Leitao
2026-04-07 15:39 ` Breno Leitao
2026-04-08 10:13 ` Vlastimil Babka (SUSE) [this message]
2026-04-08 15:13 ` Breno Leitao
2026-04-08 17:00 ` Breno Leitao
2026-04-02 12:43 ` Dmitry Ilvokhin
2026-04-02 7:18 ` Michal Hocko
2026-04-02 12:49 ` Matthew Wilcox
2026-04-02 13:26 ` Breno Leitao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a9129d39-f9dc-4f09-b951-203c8e28b600@kernel.org \
--to=vbabka@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=david@kernel.org \
--cc=kas@kernel.org \
--cc=kernel-team@meta.com \
--cc=leitao@debian.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=usama.arif@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox