linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Waiman Long <longman@redhat.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Roman Gushchin" <roman.gushchin@linux.dev>,
	"Shakeel Butt" <shakeel.butt@linux.dev>,
	"Muchun Song" <muchun.song@linux.dev>,
	"Tejun Heo" <tj@kernel.org>, "Michal Koutný" <mkoutny@suse.com>,
	"Shuah Khan" <shuah@kernel.org>,
	"Mike Rapoport" <rppt@kernel.org>,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
	"Sean Christopherson" <seanjc@google.com>,
	"James Houghton" <jthoughton@google.com>,
	"Sebastian Chlad" <sebastianchlad@gmail.com>,
	"Guopeng Zhang" <zhangguopeng@kylinos.cn>,
	"Li Wang" <liwan@redhat.com>
Subject: Re: [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes
Date: Fri, 20 Mar 2026 11:56:03 -0400	[thread overview]
Message-ID: <cee91a5b-5b37-4e19-b0c9-eea985ab490b@redhat.com> (raw)
In-Reply-To: <20260319194347.1048fc8d737b6e8f9d82663d@linux-foundation.org>

On 3/19/26 10:43 PM, Andrew Morton wrote:
> On Thu, 19 Mar 2026 13:37:45 -0400 Waiman Long <longman@redhat.com> wrote:
>
>> There are a number of test failures with the running of the
>> test_memcontrol selftest on a 128-core arm64 system on kernels with
>> 4k/16k/64k page sizes. This patch series makes some minor changes to
>> the kernel and the test_memcontrol selftest to address these failures.
>>
>> The first kernel patch scales the memcg vmstats flush threshold
>> logarithmetically instead of linearly with the total number of CPUs. The
>> second kernel patch scale down MEMCG_CHARGE_BATCH with increases in page
>> size. These 2 patches help to reduce the discrepancies between the
>> reported usage data with the real ones.
>>
>> The next 5 test_memcontrol selftest patches adjust the testing code to
>> greatly reduce the chance that it will report failure, though some
>> occasional failures is still possible.
>>
>> To verify the changes, the test_memcontrol selftest was run 100
>> times each on a 128-core arm64 system on kernels with 4k/16k/64k
>> page sizes.  No failure was observed other than some failures of the
>> test_memcg_reclaim test when running on a 16k page size kernel. The
>> reclaim_until() call failed because of the unexpected over-reclaim of
>> memory. This will need a further look but it happens with the 16k page
>> size kernel only and I don't have a production ready kernel config file
>> to use in buildinig this 16k page size kernel. The new test_memcontrol
>> selftest and kernel were also run on a 96-core x86 system to make sure
>> there was no regression.
> AI reviewbot asks questions:
> 	https://sashiko.dev/#/patchset/20260319173752.1472864-1-longman%40redhat.com

 >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 >> --- a/mm/memcontrol.c
 >> +++ b/mm/memcontrol.c
 >> @@ -548,20 +548,20 @@ struct memcg_vmstats {
 >>   *    rstat update tree grow unbounded.
 >>   *
 >>   * 2) Flush the stats synchronously on reader side only when there 
are more than
 >> - *    (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this 
optimization
 >> - *    will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * 
nr_cpus) but
 >> - *    only for 2 seconds due to (1).
 >> + *    (MEMCG_CHARGE_BATCH * (ilog2(nr_cpus) + 1)) update events. 
Though this
 >> + *    optimization will let stats be out of sync by up to that 
amount but only
 >> + *    for 2 seconds due to (1).
 > Is this description accurate regarding the maximum out of sync amount?
 > Looking at memcg_rstat_updated(), updates are buffered locally on 
each CPU
 > up to MEMCG_CHARGE_BATCH - 1 before they are added to the global
 > vmstats->stats_updates counter.
 > Because memcg_vmstats_needs_flush() only checks the global counter, could
 > N CPUs each buffer MEMCG_CHARGE_BATCH - 1 updates without triggering a
 > synchronous flush?
 > If so, wouldn't the actual worst-case out-of-sync error be
 > N * (MEMCG_CHARGE_BATCH - 1) + vmstats_flush_threshold, which remains
 > linear with the number of CPUs rather than scaling logarithmically?

Good point, the worst case scenario can indeed be worse than that. I
will update the comment accordingly.

 >> @@ -328,8 +328,14 @@ struct mem_cgroup {
 >>   * size of first charge trial.
 >>   * TODO: maybe necessary to use big numbers in big irons or dynamic 
based of the
 >>   * workload.
 >> + *
 >> + * There are 3 common base page sizes - 4k, 16k & 64k. In order to 
limit the
 >> + * amount of memory that can be hidden in each percpu memcg_stock 
for a given
 >> + * memcg, we scale down MEMCG_CHARGE_BATCH by 2 for 16k and 4 for 64k.
 >>   */
 >> -#define MEMCG_CHARGE_BATCH 64U
 >> +#define MEMCG_CHARGE_BATCH_BASE  64U
 >> +#define MEMCG_CHARGE_BATCH_SHIFT ((PAGE_SHIFT <= 16) ? (PAGE_SHIFT 
- 12)/2 : 2)
 >> +#define MEMCG_CHARGE_BATCH     (MEMCG_CHARGE_BATCH_BASE >> 
MEMCG_CHARGE_BATCH_SHIFT)
 > Will this cause false failures in the cgroup selftests on systems with
 > larger page sizes?

Exactly, Li Wang had posted a patch [1] to address this issue. We will have
to update this code again if this patch is adopted to reduce the expected
MAX_VMSTAT_ERROR accordingly.

[1] https://lore.kernel.org/lkml/20260306071843.149147-1-liwang@redhat.com/

 > In tools/testing/selftests/cgroup/test_kmem.c, MAX_VMSTAT_ERROR is 
hardcoded
 > assuming a 4KB page size and a 64-page charge batch (256KB per CPU):
 > #define MAX_VMSTAT_ERROR (4096 * 64 * get_nprocs())
 > ...
 >     if (labs(sum - current) < MAX_VMSTAT_ERROR) {
 > ...
 > With this change, a 64KB page system will have a batch size of 16. The
 > theoretical per-CPU error becomes 16 pages * 64KB = 1024KB per CPU, which
 > is 4 times larger than the hardcoded MAX_VMSTAT_ERROR.
 > Does MAX_VMSTAT_ERROR in test_kmem.c need to be updated to compute 
the error
 > using the runtime system page size and the adjusted batch limit?

 >> diff --git 
a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h 
b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
 >> index 77f386dab5e8..c25228a78b8b 100644
 >> --- a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
 >> +++ b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
 >> @@ -6,6 +6,7 @@
 >>  #define PAGE_SIZE 4096
 >>  #endif
 >>
 >> +#define KB(x) (x << 10)
 > This isn't a bug, but could this macro definition risk operator 
precedence
 > issues if invoked with an expression? Should it be defined with inner
 > parentheses around the argument, like #define KB(x) ((x) << 10)?

So far, only a single integer value is used for x. So the current macro
is still OK. Yes, I do agree that adding parenthesis will be more safe
for future use cases.

 >> @@ -1181,7 +1199,7 @@ static int test_memcg_swap_max_peak(const char 
*root)
 >>      if (cg_read_long(memcg, "memory.peak") < MB(29))
 >>          goto cleanup;
 >>
 >> -    if (cg_read_long(memcg, "memory.swap.peak") < MB(29))
 >> +    if (cg_read_long(memcg, "memory.swap.peak") < swap_peak)
 >>          goto cleanup;
 >>
 >>      if (cg_run(memcg, alloc_anon_50M_check_swap, (void *)MB(30)))
 >>          goto cleanup;
 > Does leaving memory.high set to 29M alter the semantics of the subsequent
 > alloc_anon_50M_check_swap test?
 > When alloc_anon_50M_check_swap runs, it allocates 50MB and expects
 > memory.current to be close to the 30MB memory.max limit.
 > Because memory.high is still set to 29MB from the earlier operation, 
the 50MB
 > allocation gets throttled at 29MB and enters synchronous reclaim before
 > reaching the 30MB memory.max limit.
 > Could this inadvertently change the second test from verifying swap 
behavior
 > under a hard limit to verifying it under a soft limit?

The purpose of setting memory.high to 29M is to slow down the memory
allocation process in order to enable the swapping code to swap out
more anonymous memory before the OOM killer comes in and kill the
process. Otherwise, the actual swap out value will be even lower with
larger page size. I can drop setting memory.high and set the threshold
even lower in order to avoid expected failures.

 >> @@ -1477,12 +1477,20 @@ static int test_memcg_sock(const char *root)
 >>       * Poll memory.stat for up to 3 seconds (~FLUSH_TIME plus some
 >>       * scheduling slack) and require that the "sock " counter
 >>       * eventually drops to zero.
 >> +     *
 >> +     * The actual run-to-run elapse time between consecutive run
 >> +     * of asynchronous memcg rstat flush may varies quite a bit.
 >> +     * So the 3 seconds wait time may not be enough for the "sock"
 >> +     * counter to go down to 0. Treat it as a XFAIL instead of
 >> +     * a FAIL.
 >>       */
 >>      sock_post = cg_read_key_long_poll(memcg, "memory.stat", "sock ", 0,
 >>                       MEMCG_SOCKSTAT_WAIT_RETRIES,
 >>                       DEFAULT_WAIT_INTERVAL_US);
 >> -    if (sock_post)
 >> +    if (sock_post) {
 >> +        ret = KSFT_XFAIL;
 >>          goto cleanup;
 >> +    }
 > Does this code inadvertently mask actual system or I/O errors?
 > If cg_read_key_long_poll() returns -1 because it failed to read the 
file or
 > the key is missing, this check will treat it as an expected timeout 
failure
 > rather than an actual test failure.

I will add a positive value check before setting KSFT_XFAIL.

 > Does marking this condition as KSFT_XFAIL prevent the test from catching
 > genuine socket memory leaks?
 > If a kernel regression causes socket memory to actually leak, the 
test will
 > time out and report an expected failure, which CI systems might ignore.
 > Would it be more robust to increase the polling timeout to 
accommodate the
 > maximum latency observed, or manually trigger a synchronous flush, 
instead
 > of masking the timeout?

We may have to increase the timeout excessively in order to allow for
the possible variations of the asynchronous vmstats flush delay. That may
make the test take too long to run. In my own test, the current code 
will fail
rather frequently without this change.

I do suggest that we will have to look into this issue and we can remove 
this expected failure if the issue is fixed.

Cheers,
Longman



  reply	other threads:[~2026-03-20 15:56 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-19 17:37 Waiman Long
2026-03-19 17:37 ` [PATCH 1/7] memcg: Scale up vmstats flush threshold with log2(nums_possible_cpus) Waiman Long
2026-03-20 10:40   ` Li Wang
2026-03-20 13:19     ` Waiman Long
2026-03-19 17:37 ` [PATCH 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE Waiman Long
2026-03-20 11:26   ` Li Wang
2026-03-20 13:20     ` Waiman Long
2026-03-19 17:37 ` [PATCH 3/7] selftests: memcg: Iterate pages based on the actual page size Waiman Long
2026-03-20 11:34   ` Li Wang
2026-03-19 17:37 ` [PATCH 4/7] selftests: memcg: Increase error tolerance in accordance with " Waiman Long
2026-03-19 17:37 ` [PATCH 5/7] selftests: memcg: Reduce the expected swap.peak with larger " Waiman Long
2026-03-19 17:37 ` [PATCH 6/7] selftests: memcg: Don't call reclaim_until() if already in target Waiman Long
2026-03-19 17:37 ` [PATCH 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL Waiman Long
2026-03-20  2:43 ` [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Andrew Morton
2026-03-20 15:56   ` Waiman Long [this message]
2026-03-20 20:26     ` Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cee91a5b-5b37-4e19-b0c9-eea985ab490b@redhat.com \
    --to=longman@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=jthoughton@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liwan@redhat.com \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=seanjc@google.com \
    --cc=sebastianchlad@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shuah@kernel.org \
    --cc=tj@kernel.org \
    --cc=zhangguopeng@kylinos.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox