selftests: cgroup: Failures – Timeouts & OOM Issues Analysis

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* selftests: cgroup: Failures – Timeouts & OOM Issues Analysis
@ 2025-03-04 11:56 Naresh Kamboju
  2025-03-04 14:20 ` Michal Koutný
  0 siblings, 1 reply; 3+ messages in thread
From: Naresh Kamboju @ 2025-03-04 11:56 UTC (permalink / raw)
  To: Cgroups, linux-mm, open list:KERNEL SELFTEST FRAMEWORK, open list
  Cc: Shuah Khan, Dan Carpenter, Arnd Bergmann, Anders Roxell,
	Tejun Heo, Johannes Weiner, Michal Koutný

As part of LKFT’s re-validation of known issues, we have observed that
the selftests: cgroup suite is consistently failing across almost all
LKFT-supported devices due to:
 - Test timeouts (45 seconds limit reached)
 - OOM-killer invocation

## Key Questions for Discussion:
 - Would it be beneficial to increase the test timeout to ~180 seconds
   to allow sufficient execution time?
 - Should we enhance logging to explicitly print failure reasons when a
   test fails?
 - Are there any missing dependencies that could be causing these failures?
     Note: The required selftests/cgroup/config options were included in
     LKFT's build and test plans.

## Devices Affected:
The following DUTs consistently experience these failures:
  -  dragonboard-410c (arm64)
  -  dragonboard-845c (arm64)
  -  e850-96 (arm64)
  -  juno-r2 (arm64)
  -  qemu-arm64 (arm64)
  -  qemu-armv7
  -  qemu-x86_64
  -  rk3399-rock-pi-4b (arm64)
  -  x15 (arm)
  -  x86_64

Regression Analysis:
 - New regression? No (these failures have been observed for months/years).
 - Reproducibility? Yes, the failures occur consistently.
 - Test suite affected? selftests: cgroup (timeouts and OOM-related failures).

Test regression: selftests cgroup fails timeout and oom-killer
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>

## Test log:
# selftests: cgroup: test_cpu
# ok 1 test_cpucg_subtree_control
# ok 2 test_cpucg_stats
# ok 3 test_cpucg_nice
# not ok 4 test_cpucg_weight_overprovisioned
# ok 5 test_cpucg_weight_underprovisioned
# ok 6 test_cpucg_nested_weight_overprovisioned
# ok 7 test_cpucg_nested_weight_underprovisioned
#
not ok 2 selftests: cgroup: test_cpu # TIMEOUT 45 seconds

<trim>
# selftests: cgroup: test_freezer
# ok 1 test_cgfreezer_simple
# ok 2 test_cgfreezer_tree
# ok 3 test_cgfreezer_forkbomb
# ok 4 test_cgfreezer_mkdir
# ok 5 test_cgfreezer_rmdir
# ok 6 test_cgfreezer_migrate
# Cgroup /sys/fs/cgroup/cg_test_ptrace isn't frozen
# not ok 7 test_cgfreezer_ptrace
# ok 8 test_cgfreezer_stopped
# ok 9 test_cgfreezer_ptraced
# ok 10 test_cgfreezer_vfork
not ok 4 selftests: cgroup: test_freezer # exit=1
<trim>

selftests: cgroup: test_kmem
#
not ok 7 selftests: cgroup: test_kmem # TIMEOUT 45 seconds

<trim>

# selftests: cgroup: test_memcontrol
# ok 1 test_memcg_subtree_control
# not ok 2 test_memcg_current_peak
# not ok 3 test_memcg_min
# not ok 4 test_memcg_low
# not ok 5 test_memcg_high
# ok 6 test_memcg_high_sync
[  270.699078] test_memcontrol invoked oom-killer:
gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[  270.699921] CPU: 1 UID: 0 PID: 946 Comm: test_memcontrol Not
tainted 6.14.0-rc5-next-20250303 #1
[  270.699930] Hardware name: Radxa ROCK Pi 4B (DT)

<trim>
[ 270.729527] Memory cgroup out of memory: Killed process 946
(test_memcontrol) total-vm:104840kB, anon-rss:30596kB,
file-rss:1056kB, shmem-rss:0kB, UID:0 pgtables:104kB oom_score_adj:0
# not ok 7 test_memcg_max
# not ok 8 test_memcg_reclaim
<trim>
not ok 8 selftests: cgroup: test_memcontrol # exit=1

## Source
* Kernel version: 6.14.0-rc5-next-20250303
* Git tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
* Git sha: cd3215bbcb9d4321def93fea6cfad4d5b42b9d1d
* Git describe: 6.14.0-rc5-next-20250303
* Project details:
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250303/

## Test data
* Test log: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250303/testrun/27482450/suite/kselftest-cgroup/test/cgroup_test_memcontrol/log
* Test history:
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250303/testrun/27482450/suite/kselftest-cgroup/test/cgroup_test_memcontrol/history/
* Test details:
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250303/testrun/27482450/suite/kselftest-cgroup/test/cgroup_test_memcontrol/
* Test logs rock pi:
https://lkft.validation.linaro.org/scheduler/job/8148789#L1774
* Test logs x86:  https://lkft.validation.linaro.org/scheduler/job/8148731#L1948

--
Linaro LKFT
https://lkft.linaro.org


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: selftests: cgroup: Failures – Timeouts & OOM Issues Analysis
  2025-03-04 11:56 selftests: cgroup: Failures – Timeouts & OOM Issues Analysis Naresh Kamboju
@ 2025-03-04 14:20 ` Michal Koutný
  2025-04-14 13:46   ` Michal Koutný
  0 siblings, 1 reply; 3+ messages in thread
From: Michal Koutný @ 2025-03-04 14:20 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Cgroups, linux-mm, open list:KERNEL SELFTEST FRAMEWORK,
	open list, Shuah Khan, Dan Carpenter, Arnd Bergmann,
	Anders Roxell, Tejun Heo, Johannes Weiner

[-- Attachment #1: Type: text/plain, Size: 2515 bytes --]

Hello Naresh.

On Tue, Mar 04, 2025 at 05:26:45PM +0530, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> As part of LKFT’s re-validation of known issues, we have observed that
> the selftests: cgroup suite is consistently failing across almost all
> LKFT-supported devices due to:
>  - Test timeouts (45 seconds limit reached)
>  - OOM-killer invocation

Thanks for reporting the issues with the tests.

> ## Key Questions for Discussion:
>  - Would it be beneficial to increase the test timeout to ~180 seconds
>    to allow sufficient execution time?

That depends.

test_cpu has some lenghtier checks and they can in sum surpass 45s,
it'd might be better to shorten them (withing precision margin) instead
of prolonging the limit.

test_kmem -- it shouldn't take so long, if anything I'd suspect
/proc/kpagecgroup -- are your systems larger than 100GiB of memory
(that's my rough estimate for this reads to take above the limit)?

(Are there any other timeouts?)

OOM -- some tests are supposed to trigger memcg OOM.

>  - Should we enhance logging to explicitly print failure reasons when a
>    test fails?

These tests are useful when run by developers them_selves_. In such a
case it's handy to obtain more info running them understrace (since
they're so simple).

>  - Are there any missing dependencies that could be causing these failures?
>      Note: The required selftests/cgroup/config options were included in
>      LKFT's build and test plans.

The deps are rather minimal, only some coreutils (cgroup selftests
should be covered by e.g. this list [1]).

> 
> ## Devices Affected:
> The following DUTs consistently experience these failures:
>   -  dragonboard-410c (arm64)
>   -  dragonboard-845c (arm64)
>   -  e850-96 (arm64)
>   -  juno-r2 (arm64)
>   -  qemu-arm64 (arm64)
>   -  qemu-armv7
>   -  qemu-x86_64
>   -  rk3399-rock-pi-4b (arm64)
>   -  x15 (arm)
>   -  x86_64
> 
> Regression Analysis:
>  - New regression? No (these failures have been observed for months/years).

Actually, I noticed test_memcontrol failure yesterday (with ~mainline
kernel) but I remember they used to work also rather recently. I haven't
got time to look into that but at least that one may be a regression (in
code or test).

>  - Reproducibility? Yes, the failures occur consistently.

+/- as that may depend no nr_cpus or totalram.

>  - Test suite affected? selftests: cgroup (timeouts and OOM-related failures).

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: selftests: cgroup: Failures – Timeouts & OOM Issues Analysis
  2025-03-04 14:20 ` Michal Koutný
@ 2025-04-14 13:46   ` Michal Koutný
  0 siblings, 0 replies; 3+ messages in thread
From: Michal Koutný @ 2025-04-14 13:46 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Cgroups, linux-mm, open list:KERNEL SELFTEST FRAMEWORK, open list

[-- Attachment #1: Type: text/plain, Size: 1282 bytes --]

-Cc: non-lists

On Tue, Mar 04, 2025 at 03:20:58PM +0100, Michal Koutný <mkoutny@suse.com> wrote:
> Actually, I noticed test_memcontrol failure yesterday (with ~mainline
> kernel) but I remember they used to work also rather recently. I haven't
> got time to look into that but at least that one may be a regression (in
> code or test).

So I'm seeing (with v6.15-rc1):

| not ok 1 test_kmem_basic
| ok 2 test_kmem_memcg_deletion
| ok 3 test_kmem_proc_kpagecgroup
| ok 4 test_kmem_kernel_stacks
| not ok 5 test_kmem_dead_cgroups
| memory.current 8130560 [ <- 1 vCPU ] 13168640
| percpu         5040000 [ 4 vCPUs ->] 10080000
| not ok 6 test_percpu_basic

not ok 1
By a quick look I suspect that negative dentries that are used to boost
memory consumption aren't enough (since some kernel changes, test
assumes at least 10B/dentry) -- presumably inappropriate test in new
dentry environment, not memcg bug proper.

not ok 5
A dying memcg pinned by something indefinitely, didn't look deeper into
that. Little suspicious.

not ok 6
That looks like the test doesn't take into account non-percpu
allocations of memcg (e.g. struct memcg alone is a ~2KiB + struct
mem_cgroup_per_node). The test needs better boundaries, not a memcg bug.

HTH,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-04-14 13:46 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-04 11:56 selftests: cgroup: Failures – Timeouts & OOM Issues Analysis Naresh Kamboju
2025-03-04 14:20 ` Michal Koutný
2025-04-14 13:46   ` Michal Koutný

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox