Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Michal Koutný" <mkoutny@suse.com>
To: Waiman Long <longman@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	 Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	 Muchun Song <muchun.song@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Tejun Heo <tj@kernel.org>, Shuah Khan <shuah@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	 linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org,  linux-kselftest@vger.kernel.org,
	Sean Christopherson <seanjc@google.com>,
	 James Houghton <jthoughton@google.com>,
	Sebastian Chlad <sebastianchlad@gmail.com>,
	 Guopeng Zhang <zhangguopeng@kylinos.cn>,
	Li Wang <liwan@redhat.com>, Li Wang <liwang@redhat.com>
Subject: Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)
Date: Wed, 1 Apr 2026 20:41:49 +0200	[thread overview]
Message-ID: <n6mhkjsxsami3qmczkdh57eep4lmcgbtyl7ox3ajzveke44yf6@m4bjevvsr47k> (raw)
In-Reply-To: <20260320204241.1613861-2-longman@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 5082 bytes --]

Hello Waiman and Li.

On Fri, Mar 20, 2026 at 04:42:35PM -0400, Waiman Long <longman@redhat.com> wrote:
> The vmstats flush threshold currently increases linearly with the
> number of online CPUs. As the number of CPUs increases over time, it
> will become increasingly difficult to meet the threshold and update the
> vmstats data in a timely manner. These days, systems with hundreds of
> CPUs or even thousands of them are becoming more common.
> 
> For example, the test_memcg_sock test of test_memcontrol always fails
> when running on an arm64 system with 128 CPUs. It is because the
> threshold is now 64*128 = 8192. With 4k page size, it needs changes in
> 32 MB of memory. It will be even worse with larger page size like 64k.
> 
> To make the output of memory.stat more correct, it is better to scale
> up the threshold slower than linearly with the number of CPUs. The
> int_sqrt() function is a good compromise as suggested by Li Wang [1].
> An extra 2 is added to make sure that we will double the threshold for
> a 2-core system. The increase will be slower after that.

The explanation seems [1] to just pick a function because log seemed too
slow.

(We should add a BPF hook to calculate the threshold. Haha, Date:)

The threshold has twofold role: to bound error and to preserve some
performance thanks to laziness and these two go against each other when
determining the threshold. The reasoning for linear scaling is that
_each_ CPU contributes some updates so that preserves the laziness.
Whereas error capping would hint to no dependency on nr_cpus.

My idea is that a job associated to a selected memcg doesn't necessarily
run on _all_ CPUs of (such big) machines but effectively cause updates
on J CPUs. (Either they're artificially constrained or they simply are
not-so-parallel jobs.) 
Hence the threshold should be based on that J and not actual nr_cpus.

Now the question is what is expected (CPU) size of a job and for that
I'd would consider a distribution like:
- 1 job of size nr_cpus, // you'd overcommit your machine with bigger job
- 2 jobs of size nr_cpus/2,
- 3 jobs of size nr_cpus/3,
- ...
- nr_cpus jobs of size 1. // you'd underutilize the machine with fewer

Note this is quite naïve and arbitrary deliberation of mine but it
results in something like Pareto distribution which is IMO quite
reasonable. With (only) that assumption, I can estimate the average size
of jobs like
	nr_cpus / (log(nr_cpus) + 1)
(it's natural logarithm from harmonic series and +1 is from that
approximation too, it comes handy also on UP)

	log(x) = ilog2(x) * log(2)/log(e) ~ ilog2(x) * 0.69
	log(x) ~ 45426 * ilog2(x) / 65536

or 
	65536*nr_cpus / (45426 * ilog2(nr_cpus) + 65536)


with kernel functions:
	var1 = 65536*nr_cpus / (45426 * ilog2(nr_cpus) + 65536)
	var2 = DIV_ROUND_UP(65536*nr_cpus, 45426 * ilog2(nr_cpus) + 65536)
	var3 = roundup_pow_of_two(var2)

I hope I don't need to present any more numbers at this moment because
the parameter derivation is backed by solid theory ;-) [*]


> With the int_sqrt() scale, we can use the possibly larger
> num_possible_cpus() instead of num_online_cpus() which may change at
> run time.

Hm, the inverted log turns this into dilemma whether to support hotplug
or keep performance at threshold comparisons. But it wouldn't be first
place where static initialization with possible count is used.


> Although there is supposed to be a periodic and asynchronous flush of
> vmstats every 2 seconds, the actual time lag between succesive runs
> can actually vary quite a bit. In fact, I have seen time lags of up
> to 10s of seconds in some cases. So we couldn't too rely on the hope
> that there will be an asynchronous vmstats flush every 2 seconds. This
> may be something we need to look into.

Yes, this sounds like a separate issue. I wouldn't mention it in this
commit unless you mean it's particularly related to the large nr_cpus.

> @@ -5191,6 +5191,14 @@ int __init mem_cgroup_init(void)
>  
>  	memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node,
>  				     SLAB_PANIC | SLAB_HWCACHE_ALIGN);
> +	/*
> +	 * Scale up vmstats flush threshold with int_sqrt(nr_cpus+2). The extra
> +	 * 2 constant is to make sure that the threshold is double for a 2-core
> +	 * system. After that, it will increase by MEMCG_CHARGE_BATCH when the
> +	 * number of the CPUs reaches the next (2^n - 2) value.

when you switched to sqrt, the comment should read n^2

> +	 */
> +	vmstats_flush_threshold = MEMCG_CHARGE_BATCH *
> +				  (int_sqrt(num_possible_cpus() + 2));
>  
>  	return 0;
>  }
> -- 
> 2.53.0

(I will look at the rest of the series later. It looks interesting.)

[*]
nr_cpus	var1	var2	var3
1       1       1       1
2       1       2       2
4       1       2       2
8       2       3       4
16      4       5       8
32      7       8       8
64      12      13      16
128     21      22      32
256     39      40      64
512     70      71      128
1024    129     130     256

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

next prev parent reply	other threads:[~2026-04-01 18:41 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-20 20:42 [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
2026-03-20 20:42 ` [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2) Waiman Long
2026-03-23 12:46   ` Li Wang
2026-03-24  0:15     ` Yosry Ahmed
2026-03-25 16:47       ` Waiman Long
2026-03-25 17:23         ` Yosry Ahmed
2026-04-01 18:41   ` Michal Koutný [this message]
2026-04-02  9:27     ` Li Wang
2026-04-02 10:19       ` Li Wang
2026-03-20 20:42 ` [PATCH v2 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE Waiman Long
2026-03-23 12:47   ` Li Wang
2026-03-24  0:17     ` Yosry Ahmed
2026-03-20 20:42 ` [PATCH v2 3/7] selftests: memcg: Iterate pages based on the actual page size Waiman Long
2026-03-23  2:53   ` Li Wang
2026-03-23  2:56     ` Li Wang
2026-03-25  3:33     ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 4/7] selftests: memcg: Increase error tolerance in accordance with " Waiman Long
2026-03-23  8:01   ` Li Wang
2026-03-25 16:42     ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 5/7] selftests: memcg: Reduce the expected swap.peak with larger " Waiman Long
2026-03-23  8:24   ` Li Wang
2026-03-25  3:47     ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 6/7] selftests: memcg: Don't call reclaim_until() if already in target Waiman Long
2026-03-23  8:53   ` Li Wang
2026-03-20 20:42 ` [PATCH v2 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL Waiman Long
2026-03-23  9:44   ` Li Wang
2026-03-21  1:16 ` [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=n6mhkjsxsami3qmczkdh57eep4lmcgbtyl7ox3ajzveke44yf6@m4bjevvsr47k \
    --to=mkoutny@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=jthoughton@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liwan@redhat.com \
    --cc=liwang@redhat.com \
    --cc=longman@redhat.com \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=seanjc@google.com \
    --cc=sebastianchlad@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shuah@kernel.org \
    --cc=tj@kernel.org \
    --cc=zhangguopeng@kylinos.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox