* [PATCH v3] memcg: introduce non-blocking limit setting option
@ 2025-05-06 23:28 Shakeel Butt
2025-05-07 0:53 ` Andrew Morton
2025-05-07 7:30 ` Michal Hocko
0 siblings, 2 replies; 3+ messages in thread
From: Shakeel Butt @ 2025-05-06 23:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
linux-mm, cgroups, linux-kernel, Meta kernel team, Greg Thelen,
Michal Koutný,
Tejun Heo, Yosry Ahmed, Christian Brauner
Setting the max and high limits can trigger synchronous reclaim and/or
oom-kill if the usage is higher than the given limit. This behavior is
fine for newly created cgroups but it can cause issues for the node
controller while setting limits for existing cgroups.
In our production multi-tenant and overcommitted environment, we are
seeing priority inversion when the node controller dynamically adjusts the
limits of running jobs of different priorities. Based on the system
situation, the node controller may reduce the limits of lower priority
jobs and increase the limits of higher priority jobs. However we are
seeing node controller getting stuck for long period of time while
reclaiming from lower priority jobs while setting their limits and also
spends a lot of its own CPU.
One of the workaround we are trying is to fork a new process which sets
the limit of the lower priority job along with setting an alarm to get
itself killed if it get stuck in the reclaim for lower priority job.
However we are finding it very unreliable and costly. Either we need a
good enough time buffer for the alarm to be delivered after setting limit
and potentialy spend a lot of CPU in the reclaim or be unreliable in
setting the limit for much shorter but cheaper (less reclaim) alarms.
Let's introduce new limit setting option which does not trigger reclaim
and/or oom-kill and let the processes in the target cgroup to trigger
reclaim and/or throttling and/or oom-kill in their next charge request.
This will make the node controller on multi-tenant overcommitted
environment much more reliable.
Explanation from Johannes on side-effects of O_NONBLOCK limit change:
It's usually the allocating tasks inside the group bearing the cost of
limit enforcement and reclaim. This allows a (privileged) updater from
outside the group to keep that cost in there - instead of having to
help, from a context that doesn't necessarily make sense.
I suppose the tradeoff with that - and the reason why this was doing
sync reclaim in the first place - is that, if the group is idle and
not trying to allocate more, it can take indefinitely for the new
limit to actually be met.
It should be okay in most scenarios in practice. As the capacity is
reallocated from group A to B, B will exert pressure on A once it
tries to claim it and thereby shrink it down. If A is idle, that
shouldn't be hard. If A is running, it's likely to fault/allocate
soon-ish and then join the effort.
It does leave a (malicious) corner case where A is just busy-hitting
its memory to interfere with the clawback. This is comparable to
reclaiming memory.low overage from the outside, though, which is an
acceptable risk. Users of O_NONBLOCK just need to be aware.
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
Changes since v2:
- Added more explanation in doc and commit message on O_NONBLOCK
side-effects (Johannes)
Changes since v1:
- Instead of new interfaces use O_NONBLOCK flag (Greg, Roman & Tejun)
Documentation/admin-guide/cgroup-v2.rst | 24 ++++++++++++++++++++++++
mm/memcontrol.c | 10 ++++++++--
2 files changed, 32 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index a497db5e6496..b3e0fdc00614 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1299,6 +1299,18 @@ PAGE_SIZE multiple when read back.
monitors the limited cgroup to alleviate heavy reclaim
pressure.
+ If memory.high is opened with O_NONBLOCK then the synchronous
+ reclaim is bypassed. This is useful for admin processes that
+ need to dynamically adjust the job's memory limits without
+ expending their own CPU resources on memory reclamation. The
+ job will trigger the reclaim and/or get throttled on its
+ next charge request.
+
+ Please note that with O_NONBLOCK, there is a chance that the
+ target memory cgroup may take indefinite amount of time to
+ reduce usage below the limit due to delayed charge request or
+ busy-hitting its memory to slow down reclaim.
+
memory.max
A read-write single value file which exists on non-root
cgroups. The default is "max".
@@ -1316,6 +1328,18 @@ PAGE_SIZE multiple when read back.
Caller could retry them differently, return into userspace
as -ENOMEM or silently ignore in cases like disk readahead.
+ If memory.max is opened with O_NONBLOCK, then the synchronous
+ reclaim and oom-kill are bypassed. This is useful for admin
+ processes that need to dynamically adjust the job's memory limits
+ without expending their own CPU resources on memory reclamation.
+ The job will trigger the reclaim and/or oom-kill on its next
+ charge request.
+
+ Please note that with O_NONBLOCK, there is a chance that the
+ target memory cgroup may take indefinite amount of time to
+ reduce usage below the limit due to delayed charge request or
+ busy-hitting its memory to slow down reclaim.
+
memory.reclaim
A write-only nested-keyed file which exists for all cgroups.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 29860f067952..f8b9c7aa6771 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4328,6 +4328,9 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
page_counter_set_high(&memcg->memory, high);
+ if (of->file->f_flags & O_NONBLOCK)
+ goto out;
+
for (;;) {
unsigned long nr_pages = page_counter_read(&memcg->memory);
unsigned long reclaimed;
@@ -4350,7 +4353,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
if (!reclaimed && !nr_retries--)
break;
}
-
+out:
memcg_wb_domain_size_changed(memcg);
return nbytes;
}
@@ -4377,6 +4380,9 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
xchg(&memcg->memory.max, max);
+ if (of->file->f_flags & O_NONBLOCK)
+ goto out;
+
for (;;) {
unsigned long nr_pages = page_counter_read(&memcg->memory);
@@ -4404,7 +4410,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
break;
cond_resched();
}
-
+out:
memcg_wb_domain_size_changed(memcg);
return nbytes;
}
--
2.47.1
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v3] memcg: introduce non-blocking limit setting option
2025-05-06 23:28 [PATCH v3] memcg: introduce non-blocking limit setting option Shakeel Butt
@ 2025-05-07 0:53 ` Andrew Morton
2025-05-07 7:30 ` Michal Hocko
1 sibling, 0 replies; 3+ messages in thread
From: Andrew Morton @ 2025-05-07 0:53 UTC (permalink / raw)
To: Shakeel Butt
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
linux-mm, cgroups, linux-kernel, Meta kernel team, Greg Thelen,
Michal Koutný,
Tejun Heo, Yosry Ahmed, Christian Brauner
Thanks, I queued this as a -fix:
--- a/Documentation/admin-guide/cgroup-v2.rst~memcg-introduce-non-blocking-limit-setting-option-v3
+++ a/Documentation/admin-guide/cgroup-v2.rst
@@ -1299,12 +1299,17 @@ PAGE_SIZE multiple when read back.
monitors the limited cgroup to alleviate heavy reclaim
pressure.
- If memory.high is opened with O_NONBLOCK then the synchronous
- reclaim is bypassed. This is useful for admin processes that
- need to dynamically adjust the job's memory limits without
- expending their own CPU resources on memory reclamation. The
- job will trigger the reclaim and/or get throttled on its
- next charge request.
+ If memory.high is opened with O_NONBLOCK then the synchronous
+ reclaim is bypassed. This is useful for admin processes that
+ need to dynamically adjust the job's memory limits without
+ expending their own CPU resources on memory reclamation. The
+ job will trigger the reclaim and/or get throttled on its
+ next charge request.
+
+ Please note that with O_NONBLOCK, there is a chance that the
+ target memory cgroup may take indefinite amount of time to
+ reduce usage below the limit due to delayed charge request or
+ busy-hitting its memory to slow down reclaim.
memory.max
A read-write single value file which exists on non-root
@@ -1323,12 +1328,17 @@ PAGE_SIZE multiple when read back.
Caller could retry them differently, return into userspace
as -ENOMEM or silently ignore in cases like disk readahead.
- If memory.max is opened with O_NONBLOCK, then the synchronous
- reclaim and oom-kill are bypassed. This is useful for admin
- processes that need to dynamically adjust the job's memory limits
- without expending their own CPU resources on memory reclamation.
- The job will trigger the reclaim and/or oom-kill on its next
- charge request.
+ If memory.max is opened with O_NONBLOCK, then the synchronous
+ reclaim and oom-kill are bypassed. This is useful for admin
+ processes that need to dynamically adjust the job's memory limits
+ without expending their own CPU resources on memory reclamation.
+ The job will trigger the reclaim and/or oom-kill on its next
+ charge request.
+
+ Please note that with O_NONBLOCK, there is a chance that the
+ target memory cgroup may take indefinite amount of time to
+ reduce usage below the limit due to delayed charge request or
+ busy-hitting its memory to slow down reclaim.
memory.reclaim
A write-only nested-keyed file which exists for all cgroups.
_
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v3] memcg: introduce non-blocking limit setting option
2025-05-06 23:28 [PATCH v3] memcg: introduce non-blocking limit setting option Shakeel Butt
2025-05-07 0:53 ` Andrew Morton
@ 2025-05-07 7:30 ` Michal Hocko
1 sibling, 0 replies; 3+ messages in thread
From: Michal Hocko @ 2025-05-07 7:30 UTC (permalink / raw)
To: Shakeel Butt
Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Muchun Song,
linux-mm, cgroups, linux-kernel, Meta kernel team, Greg Thelen,
Michal Koutný,
Tejun Heo, Yosry Ahmed, Christian Brauner
On Tue 06-05-25 16:28:33, Shakeel Butt wrote:
> Setting the max and high limits can trigger synchronous reclaim and/or
> oom-kill if the usage is higher than the given limit. This behavior is
> fine for newly created cgroups but it can cause issues for the node
> controller while setting limits for existing cgroups.
>
> In our production multi-tenant and overcommitted environment, we are
> seeing priority inversion when the node controller dynamically adjusts the
> limits of running jobs of different priorities. Based on the system
> situation, the node controller may reduce the limits of lower priority
> jobs and increase the limits of higher priority jobs. However we are
> seeing node controller getting stuck for long period of time while
> reclaiming from lower priority jobs while setting their limits and also
> spends a lot of its own CPU.
>
> One of the workaround we are trying is to fork a new process which sets
> the limit of the lower priority job along with setting an alarm to get
> itself killed if it get stuck in the reclaim for lower priority job.
> However we are finding it very unreliable and costly. Either we need a
> good enough time buffer for the alarm to be delivered after setting limit
> and potentialy spend a lot of CPU in the reclaim or be unreliable in
> setting the limit for much shorter but cheaper (less reclaim) alarms.
>
> Let's introduce new limit setting option which does not trigger reclaim
> and/or oom-kill and let the processes in the target cgroup to trigger
> reclaim and/or throttling and/or oom-kill in their next charge request.
> This will make the node controller on multi-tenant overcommitted
> environment much more reliable.
I would say this is a bit creative way to go about kernel interfaces. I
am not aware of any other precedence like that but I recognize this is
likely better than a new set of non-blocking interface.
It is a bit unfortunate that we haven't explicitly excluded O_NONBLOCK
previously so we cannot really add this functionality correctly without
risking breaking any existing users. Sure it hasn't made sense to write
to these files with O_NONBLOCK until now but there is the hope.
> Explanation from Johannes on side-effects of O_NONBLOCK limit change:
> It's usually the allocating tasks inside the group bearing the cost of
> limit enforcement and reclaim. This allows a (privileged) updater from
> outside the group to keep that cost in there - instead of having to
> help, from a context that doesn't necessarily make sense.
>
> I suppose the tradeoff with that - and the reason why this was doing
> sync reclaim in the first place - is that, if the group is idle and
> not trying to allocate more, it can take indefinitely for the new
> limit to actually be met.
>
> It should be okay in most scenarios in practice. As the capacity is
> reallocated from group A to B, B will exert pressure on A once it
> tries to claim it and thereby shrink it down. If A is idle, that
> shouldn't be hard. If A is running, it's likely to fault/allocate
> soon-ish and then join the effort.
>
> It does leave a (malicious) corner case where A is just busy-hitting
> its memory to interfere with the clawback. This is comparable to
> reclaiming memory.low overage from the outside, though, which is an
> acceptable risk. Users of O_NONBLOCK just need to be aware.
Good and useful clarification. Thx!
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Greg Thelen <gthelen@google.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-05-07 7:30 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-06 23:28 [PATCH v3] memcg: introduce non-blocking limit setting option Shakeel Butt
2025-05-07 0:53 ` Andrew Morton
2025-05-07 7:30 ` Michal Hocko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox