From: Waiman Long <longman@redhat.com>
To: Li Wang <liwang@redhat.com>
Cc: "Johannes Weiner" <hannes@cmpxchg.org>,
"Michal Hocko" <mhocko@kernel.org>,
"Roman Gushchin" <roman.gushchin@linux.dev>,
"Shakeel Butt" <shakeel.butt@linux.dev>,
"Muchun Song" <muchun.song@linux.dev>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Tejun Heo" <tj@kernel.org>, "Michal Koutný" <mkoutny@suse.com>,
"Shuah Khan" <shuah@kernel.org>,
"Mike Rapoport" <rppt@kernel.org>,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
"Sean Christopherson" <seanjc@google.com>,
"James Houghton" <jthoughton@google.com>,
"Sebastian Chlad" <sebastianchlad@gmail.com>,
"Guopeng Zhang" <zhangguopeng@kylinos.cn>,
"Li Wang" <liwan@redhat.com>
Subject: Re: [PATCH 1/7] memcg: Scale up vmstats flush threshold with log2(nums_possible_cpus)
Date: Fri, 20 Mar 2026 09:19:01 -0400 [thread overview]
Message-ID: <bf33746b-46ae-47ec-a735-d2a29226bf9c@redhat.com> (raw)
In-Reply-To: <ab0kAE7mJkEL9kWb@redhat.com>
On 3/20/26 6:40 AM, Li Wang wrote:
> On Thu, Mar 19, 2026 at 01:37:46PM -0400, Waiman Long wrote:
>> The vmstats flush threshold currently increases linearly with the
>> number of online CPUs. As the number of CPUs increases over time, it
>> will become increasingly difficult to meet the threshold and update the
>> vmstats data in a timely manner. These days, systems with hundreds of
>> CPUs or even thousands of them are becoming more common.
>>
>> For example, the test_memcg_sock test of test_memcontrol always fails
>> when running on an arm64 system with 128 CPUs. It is because the
>> threshold is now 64*128 = 8192. With 4k page size, it needs changes in
>> 32 MB of memory. It will be even worse with larger page size like 64k.
>>
>> To make the output of memory.stat more correct, it is better to
>> scale up the threshold logarithmically instead of linearly with the
>> number of CPUs. With the log2 scale, we can use the possibly larger
>> num_possible_cpus() instead of num_online_cpus() which may change at
>> run time.
>>
>> Although there is supposed to be a periodic and asynchronous flush of
>> vmstats every 2 seconds, the actual time lag between succesive runs
>> can actually vary quite a bit. In fact, I have seen time lags of up
>> to 10s of seconds in some cases. So we couldn't too rely on the hope
>> that there will be an asynchronous vmstats flush every 2 seconds. This
>> may be something we need to look into.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>> mm/memcontrol.c | 17 ++++++++++++-----
>> 1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 772bac21d155..8d4ede72f05c 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -548,20 +548,20 @@ struct memcg_vmstats {
>> * rstat update tree grow unbounded.
>> *
>> * 2) Flush the stats synchronously on reader side only when there are more than
>> - * (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization
>> - * will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but
>> - * only for 2 seconds due to (1).
>> + * (MEMCG_CHARGE_BATCH * (ilog2(nr_cpus) + 1)) update events. Though this
>> + * optimization will let stats be out of sync by up to that amount but only
>> + * for 2 seconds due to (1).
>> */
>> static void flush_memcg_stats_dwork(struct work_struct *w);
>> static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
>> static u64 flush_last_time;
>> +static int vmstats_flush_threshold __ro_after_init;
>>
>> #define FLUSH_TIME (2UL*HZ)
>>
>> static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
>> {
>> - return atomic_read(&vmstats->stats_updates) >
>> - MEMCG_CHARGE_BATCH * num_online_cpus();
>> + return atomic_read(&vmstats->stats_updates) > vmstats_flush_threshold;
>> }
>>
>> static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
>> @@ -5191,6 +5191,13 @@ int __init mem_cgroup_init(void)
>>
>> memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node,
>> SLAB_PANIC | SLAB_HWCACHE_ALIGN);
>> + /*
>> + * Logarithmically scale up vmstats flush threshold with the number
>> + * of CPUs.
>> + * N.B. ilog2(1) = 0.
>> + */
>> + vmstats_flush_threshold = MEMCG_CHARGE_BATCH *
>> + (ilog2(num_possible_cpus()) + 1);
> Changing the threashold from linearly to logarithmically looks smarter,
> but my concern is that, on large systems (hundreds/thousands of CPUs),
> the threshold drops dramatically.
>
> For example, 1024 CPUs it goes from 65536 (256MB) to only 704 (2.7MB),
> that's almost 100x. Could this potentially raise a performance issue
> as frequently read 'memory.stat' on a heavily loaded system?
>
> Maybe go with MEMCG_CHARGE_BATCH * int_sqrt(num_possible_cpus()),
> which sits between linear and log2?
I have also been thinking about scaling faster than log2 but still below
linear. I believe int_sqrt() is a good suggestion and I will adopt it in
the next version.
Thanks,
Longman
next prev parent reply other threads:[~2026-03-20 13:19 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-19 17:37 [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
2026-03-19 17:37 ` [PATCH 1/7] memcg: Scale up vmstats flush threshold with log2(nums_possible_cpus) Waiman Long
2026-03-20 10:40 ` Li Wang
2026-03-20 13:19 ` Waiman Long [this message]
2026-03-19 17:37 ` [PATCH 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE Waiman Long
2026-03-20 11:26 ` Li Wang
2026-03-20 13:20 ` Waiman Long
2026-03-19 17:37 ` [PATCH 3/7] selftests: memcg: Iterate pages based on the actual page size Waiman Long
2026-03-20 11:34 ` Li Wang
2026-03-19 17:37 ` [PATCH 4/7] selftests: memcg: Increase error tolerance in accordance with " Waiman Long
2026-03-19 17:37 ` [PATCH 5/7] selftests: memcg: Reduce the expected swap.peak with larger " Waiman Long
2026-03-19 17:37 ` [PATCH 6/7] selftests: memcg: Don't call reclaim_until() if already in target Waiman Long
2026-03-19 17:37 ` [PATCH 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL Waiman Long
2026-03-20 2:43 ` [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Andrew Morton
2026-03-20 15:56 ` Waiman Long
2026-03-20 20:26 ` Waiman Long
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bf33746b-46ae-47ec-a735-d2a29226bf9c@redhat.com \
--to=longman@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=jthoughton@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=liwan@redhat.com \
--cc=liwang@redhat.com \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=seanjc@google.com \
--cc=sebastianchlad@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=shuah@kernel.org \
--cc=tj@kernel.org \
--cc=zhangguopeng@kylinos.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox