* selftests: cgroup: Failures – Timeouts & OOM Issues Analysis
@ 2025-03-04 11:56 Naresh Kamboju
2025-03-04 14:20 ` Michal Koutný
0 siblings, 1 reply; 3+ messages in thread
From: Naresh Kamboju @ 2025-03-04 11:56 UTC (permalink / raw)
To: Cgroups, linux-mm, open list:KERNEL SELFTEST FRAMEWORK, open list
Cc: Shuah Khan, Dan Carpenter, Arnd Bergmann, Anders Roxell,
Tejun Heo, Johannes Weiner, Michal Koutný
As part of LKFT’s re-validation of known issues, we have observed that
the selftests: cgroup suite is consistently failing across almost all
LKFT-supported devices due to:
- Test timeouts (45 seconds limit reached)
- OOM-killer invocation
## Key Questions for Discussion:
- Would it be beneficial to increase the test timeout to ~180 seconds
to allow sufficient execution time?
- Should we enhance logging to explicitly print failure reasons when a
test fails?
- Are there any missing dependencies that could be causing these failures?
Note: The required selftests/cgroup/config options were included in
LKFT's build and test plans.
## Devices Affected:
The following DUTs consistently experience these failures:
- dragonboard-410c (arm64)
- dragonboard-845c (arm64)
- e850-96 (arm64)
- juno-r2 (arm64)
- qemu-arm64 (arm64)
- qemu-armv7
- qemu-x86_64
- rk3399-rock-pi-4b (arm64)
- x15 (arm)
- x86_64
Regression Analysis:
- New regression? No (these failures have been observed for months/years).
- Reproducibility? Yes, the failures occur consistently.
- Test suite affected? selftests: cgroup (timeouts and OOM-related failures).
Test regression: selftests cgroup fails timeout and oom-killer
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
## Test log:
# selftests: cgroup: test_cpu
# ok 1 test_cpucg_subtree_control
# ok 2 test_cpucg_stats
# ok 3 test_cpucg_nice
# not ok 4 test_cpucg_weight_overprovisioned
# ok 5 test_cpucg_weight_underprovisioned
# ok 6 test_cpucg_nested_weight_overprovisioned
# ok 7 test_cpucg_nested_weight_underprovisioned
#
not ok 2 selftests: cgroup: test_cpu # TIMEOUT 45 seconds
<trim>
# selftests: cgroup: test_freezer
# ok 1 test_cgfreezer_simple
# ok 2 test_cgfreezer_tree
# ok 3 test_cgfreezer_forkbomb
# ok 4 test_cgfreezer_mkdir
# ok 5 test_cgfreezer_rmdir
# ok 6 test_cgfreezer_migrate
# Cgroup /sys/fs/cgroup/cg_test_ptrace isn't frozen
# not ok 7 test_cgfreezer_ptrace
# ok 8 test_cgfreezer_stopped
# ok 9 test_cgfreezer_ptraced
# ok 10 test_cgfreezer_vfork
not ok 4 selftests: cgroup: test_freezer # exit=1
<trim>
selftests: cgroup: test_kmem
#
not ok 7 selftests: cgroup: test_kmem # TIMEOUT 45 seconds
<trim>
# selftests: cgroup: test_memcontrol
# ok 1 test_memcg_subtree_control
# not ok 2 test_memcg_current_peak
# not ok 3 test_memcg_min
# not ok 4 test_memcg_low
# not ok 5 test_memcg_high
# ok 6 test_memcg_high_sync
[ 270.699078] test_memcontrol invoked oom-killer:
gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[ 270.699921] CPU: 1 UID: 0 PID: 946 Comm: test_memcontrol Not
tainted 6.14.0-rc5-next-20250303 #1
[ 270.699930] Hardware name: Radxa ROCK Pi 4B (DT)
<trim>
[ 270.729527] Memory cgroup out of memory: Killed process 946
(test_memcontrol) total-vm:104840kB, anon-rss:30596kB,
file-rss:1056kB, shmem-rss:0kB, UID:0 pgtables:104kB oom_score_adj:0
# not ok 7 test_memcg_max
# not ok 8 test_memcg_reclaim
<trim>
not ok 8 selftests: cgroup: test_memcontrol # exit=1
## Source
* Kernel version: 6.14.0-rc5-next-20250303
* Git tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
* Git sha: cd3215bbcb9d4321def93fea6cfad4d5b42b9d1d
* Git describe: 6.14.0-rc5-next-20250303
* Project details:
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250303/
## Test data
* Test log: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250303/testrun/27482450/suite/kselftest-cgroup/test/cgroup_test_memcontrol/log
* Test history:
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250303/testrun/27482450/suite/kselftest-cgroup/test/cgroup_test_memcontrol/history/
* Test details:
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250303/testrun/27482450/suite/kselftest-cgroup/test/cgroup_test_memcontrol/
* Test logs rock pi:
https://lkft.validation.linaro.org/scheduler/job/8148789#L1774
* Test logs x86: https://lkft.validation.linaro.org/scheduler/job/8148731#L1948
--
Linaro LKFT
https://lkft.linaro.org
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: selftests: cgroup: Failures – Timeouts & OOM Issues Analysis
2025-03-04 11:56 selftests: cgroup: Failures – Timeouts & OOM Issues Analysis Naresh Kamboju
@ 2025-03-04 14:20 ` Michal Koutný
2025-04-14 13:46 ` Michal Koutný
0 siblings, 1 reply; 3+ messages in thread
From: Michal Koutný @ 2025-03-04 14:20 UTC (permalink / raw)
To: Naresh Kamboju
Cc: Cgroups, linux-mm, open list:KERNEL SELFTEST FRAMEWORK,
open list, Shuah Khan, Dan Carpenter, Arnd Bergmann,
Anders Roxell, Tejun Heo, Johannes Weiner
[-- Attachment #1: Type: text/plain, Size: 2515 bytes --]
Hello Naresh.
On Tue, Mar 04, 2025 at 05:26:45PM +0530, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> As part of LKFT’s re-validation of known issues, we have observed that
> the selftests: cgroup suite is consistently failing across almost all
> LKFT-supported devices due to:
> - Test timeouts (45 seconds limit reached)
> - OOM-killer invocation
Thanks for reporting the issues with the tests.
> ## Key Questions for Discussion:
> - Would it be beneficial to increase the test timeout to ~180 seconds
> to allow sufficient execution time?
That depends.
test_cpu has some lenghtier checks and they can in sum surpass 45s,
it'd might be better to shorten them (withing precision margin) instead
of prolonging the limit.
test_kmem -- it shouldn't take so long, if anything I'd suspect
/proc/kpagecgroup -- are your systems larger than 100GiB of memory
(that's my rough estimate for this reads to take above the limit)?
(Are there any other timeouts?)
OOM -- some tests are supposed to trigger memcg OOM.
> - Should we enhance logging to explicitly print failure reasons when a
> test fails?
These tests are useful when run by developers them_selves_. In such a
case it's handy to obtain more info running them understrace (since
they're so simple).
> - Are there any missing dependencies that could be causing these failures?
> Note: The required selftests/cgroup/config options were included in
> LKFT's build and test plans.
The deps are rather minimal, only some coreutils (cgroup selftests
should be covered by e.g. this list [1]).
>
> ## Devices Affected:
> The following DUTs consistently experience these failures:
> - dragonboard-410c (arm64)
> - dragonboard-845c (arm64)
> - e850-96 (arm64)
> - juno-r2 (arm64)
> - qemu-arm64 (arm64)
> - qemu-armv7
> - qemu-x86_64
> - rk3399-rock-pi-4b (arm64)
> - x15 (arm)
> - x86_64
>
> Regression Analysis:
> - New regression? No (these failures have been observed for months/years).
Actually, I noticed test_memcontrol failure yesterday (with ~mainline
kernel) but I remember they used to work also rather recently. I haven't
got time to look into that but at least that one may be a regression (in
code or test).
> - Reproducibility? Yes, the failures occur consistently.
+/- as that may depend no nr_cpus or totalram.
> - Test suite affected? selftests: cgroup (timeouts and OOM-related failures).
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: selftests: cgroup: Failures – Timeouts & OOM Issues Analysis
2025-03-04 14:20 ` Michal Koutný
@ 2025-04-14 13:46 ` Michal Koutný
0 siblings, 0 replies; 3+ messages in thread
From: Michal Koutný @ 2025-04-14 13:46 UTC (permalink / raw)
To: Naresh Kamboju
Cc: Cgroups, linux-mm, open list:KERNEL SELFTEST FRAMEWORK, open list
[-- Attachment #1: Type: text/plain, Size: 1282 bytes --]
-Cc: non-lists
On Tue, Mar 04, 2025 at 03:20:58PM +0100, Michal Koutný <mkoutny@suse.com> wrote:
> Actually, I noticed test_memcontrol failure yesterday (with ~mainline
> kernel) but I remember they used to work also rather recently. I haven't
> got time to look into that but at least that one may be a regression (in
> code or test).
So I'm seeing (with v6.15-rc1):
| not ok 1 test_kmem_basic
| ok 2 test_kmem_memcg_deletion
| ok 3 test_kmem_proc_kpagecgroup
| ok 4 test_kmem_kernel_stacks
| not ok 5 test_kmem_dead_cgroups
| memory.current 8130560 [ <- 1 vCPU ] 13168640
| percpu 5040000 [ 4 vCPUs ->] 10080000
| not ok 6 test_percpu_basic
not ok 1
By a quick look I suspect that negative dentries that are used to boost
memory consumption aren't enough (since some kernel changes, test
assumes at least 10B/dentry) -- presumably inappropriate test in new
dentry environment, not memcg bug proper.
not ok 5
A dying memcg pinned by something indefinitely, didn't look deeper into
that. Little suspicious.
not ok 6
That looks like the test doesn't take into account non-percpu
allocations of memcg (e.g. struct memcg alone is a ~2KiB + struct
mem_cgroup_per_node). The test needs better boundaries, not a memcg bug.
HTH,
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-04-14 13:46 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-04 11:56 selftests: cgroup: Failures – Timeouts & OOM Issues Analysis Naresh Kamboju
2025-03-04 14:20 ` Michal Koutný
2025-04-14 13:46 ` Michal Koutný
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox