[Question]: pagecache thrashing and hard to trigger OOM in cgroup

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Question]: pagecache thrashing and hard to trigger OOM in cgroup
@ 2023-11-22  3:26 Liu Shixin
  2025-09-23  8:23 ` Zhu Haoran
  0 siblings, 1 reply; 2+ messages in thread
From: Liu Shixin @ 2023-11-22  3:26 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton, Michal Hocko, Shakeel Butt,
	Roman Gushchin, Muchun Song
  Cc: linux-mm, cgroups, Nanyong Sun, Kefeng Wang

[-- Attachment #1: Type: text/plain, Size: 1542 bytes --]

Hi everyone,

Recently, we meet an IO performance issue which caused by pagecache thrashing in
a cgroup and we found it is introduced by commit 815744d75152 ("mm:  memcontrol:
don't batch updates of local VM stats and events").

The problem can easily reproduced in docker environment. Firstly,create a container
with 4G memory limit and 2G swap limit, then run a program which allocate (6G - 50M)
anon memory so there are only 50M memory can be used and no swap space. Then
do "yum install gcc" and we can observed that the yum program is thrashing and IO
keep high for a long but didn't trigger oom. This affects other processes or containers
in the machine.

After analysis, we found there are large number of readahead failures during this time.
Since page allocation from pagecache readahead have __GFP_NORETRY flag, the oom
will be skipped when reach memcg limit. The pagecache is repeatedly allocated and
reclaimed, and the value of workset_refault_file is high. These readahead take a lot of
time, which consume a lot of IO throughput and impact the entire system. This keeps
for long times until other page allocation trigger oom.

By bisection, we finally found commit 815744d75152("mm:  memcontrol: don't batch
updates of local VM stats and events"). Before the commit, the process will trigger oom
in very short time. We suspect the difference is caused by performance changes.

Is there any good way to fix the problem? we prefer the process to be oom rather
than cause the system to be hung and affect other processes.

Thanks,

[-- Attachment #2: Type: text/html, Size: 3028 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [Question]: pagecache thrashing and hard to trigger OOM in cgroup
  2023-11-22  3:26 [Question]: pagecache thrashing and hard to trigger OOM in cgroup Liu Shixin
@ 2025-09-23  8:23 ` Zhu Haoran
  0 siblings, 0 replies; 2+ messages in thread
From: Zhu Haoran @ 2025-09-23  8:23 UTC (permalink / raw)
  To: liushixin2
  Cc: akpm, cgroups, guro, hannes, linux-mm, mhocko, muchun.song,
	shakeelb, sunnanyong, wangkefeng.wang

Hello Liu Shixin,

I’ve been trying to reproduce the thrashing issue you reported. Using QEMU with
the script in [1], the memory-hogging process was always killed quickly in 1-2
minutes, regardless with or without the patch. However, on physical machine
with 6.8 (without your patch [1]) kernel, I was able to reproduce and observe
the long-thrashing installer. I’m now trying to understand why this difference
occurs.

> By bisection, we finally found commit 815744d75152("mm:  memcontrol: don't
> batch updates of local VM stats and events"). Before the commit, the process
> will trigger oom in very short time. We suspect the difference is caused by
> performance changes.

Do you have any insights on why this commit affects OOM triggering?

[1] https://lkml.org/lkml/2024/3/22/410

---
Thanks,
Zhu Haoran

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-09-23  8:23 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-22  3:26 [Question]: pagecache thrashing and hard to trigger OOM in cgroup Liu Shixin
2025-09-23  8:23 ` Zhu Haoran

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox