* [RFC PATCH v5 0/3] Add memory.max.effective for application's allocators
@ 2024-06-06 15:22 Michal Koutný
2024-06-06 15:22 ` [RFC PATCH v5 1/3] memcg: Add memory.max.effective attribute Michal Koutný
` (3 more replies)
0 siblings, 4 replies; 7+ messages in thread
From: Michal Koutný @ 2024-06-06 15:22 UTC (permalink / raw)
To: cgroups, linux-doc, linux-kernel, linux-mm
Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, Jan Kratochvil (Azul)
Some applications use memory cgroup limits to scale their own memory
needs. Reading of the immediate membership cgroup's memory.max is not
sufficient because of possible ancestral limits. The application could
traverse upwards to figure out the tightest limit but this would not
work in cgroup namespace where the view of cgroup hierarchy is
incomplete and the limit may apply from outer world.
Additionally, applications should respond to limit changes.
(cgroup v1 used memory.stat:hierarchical_memory_limit to report the
value but there's no such counterpart in cgroup v2 memory.stat.)
Introduce a new memcg attribute file that contains the effective value
of memory limit for given cgroup (following cpuset.cpus.effective
pattern) and that sends notifications like memory.events when the
effective limit changes.
Reasons for RFC:
1) Should global limit be included? (And respond to memory hotplug?)
2) Is swap.max.effective needed? (in v2 without memsw accounting)
3) Should memory.high be also handled?
4) What would be an alternative?
My answers to RFC:
1) No (there's no memory.max in global root memcg)
2) No (app doesn't have full control of memory that's swapped out)
3) No (scaling the allocator against the "soft" limit could end up in
dynamics difficult to reason and admin)
4)
- PSI (too obscure for traditional users but better semantics for limit
shrinking)
- memory.stat field (like v1 but separate attribute is better for
notifications, cpuset precedent)
Changes from v4 (https://lore.kernel.org/r/ZcvlhOZ4VBEX9raZ@host1.jankratochvil.net)
- split the patch for swap.max.effetive
- add Documentation/
- reword commit messages
- add notification support
Michal Koutný (3):
memcg: Add memory.max.effective attribute
memcg: Add memory.swap.max.effective like hierarchical_memsw_limit
memcg: Notify on memory.max.effective changes
Documentation/admin-guide/cgroup-v2.rst | 6 ++++
include/linux/memcontrol.h | 2 ++
mm/memcontrol.c | 46 +++++++++++++++++++++++++
3 files changed, 54 insertions(+)
base-commit: 2df0193e62cf887f373995fb8a91068562784adc
--
2.45.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* [RFC PATCH v5 1/3] memcg: Add memory.max.effective attribute
2024-06-06 15:22 [RFC PATCH v5 0/3] Add memory.max.effective for application's allocators Michal Koutný
@ 2024-06-06 15:22 ` Michal Koutný
2024-06-06 15:22 ` [RFC PATCH v5 2/3] memcg: Add memory.swap.max.effective like hierarchical_memsw_limit Michal Koutný
` (2 subsequent siblings)
3 siblings, 0 replies; 7+ messages in thread
From: Michal Koutný @ 2024-06-06 15:22 UTC (permalink / raw)
To: cgroups, linux-doc, linux-kernel, linux-mm
Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, Jan Kratochvil (Azul)
Some applications use memory cgroup limits to scale their own memory
needs. Reading of the immediate membership cgroup's memory.max is not
sufficient because of possible ancestral limits. The application could
traverse upwards to figure out the tightest limit but this would not
work in cgroup namespace where the view of cgroup hierarchy is
incomplete and the limit may apply from outer world.
(cgroup v1 used memory.stat:hierarchical_memory_limit to report the
value but there's no such counterpart in cgroup v2 memory.stat.)
Introduce a new memcg attribute file that contains the effective value
of memory limit for given cgroup (following cpuset.cpus.effective
pattern).
Signed-off-by: Jan Kratochvil (Azul) <jkratochvil@azul.com>
[ mkoutny: rewrite commit message, split out memory.swap.max]
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
Documentation/admin-guide/cgroup-v2.rst | 6 ++++++
mm/memcontrol.c | 18 ++++++++++++++++++
2 files changed, 24 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 8fbb0519d556..988f26264054 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1293,6 +1293,12 @@ PAGE_SIZE multiple when read back.
Caller could retry them differently, return into userspace
as -ENOMEM or silently ignore in cases like disk readahead.
+ memory.max.effective
+ A read-only file that provides effective value of cgroup's hard usage
+ limit. It incorporates limits of all ancestors, even those not visible
+ in cgroupns. The value change in this file generates a file modified
+ event.
+
memory.reclaim
A write-only nested-keyed file which exists for all cgroups.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7fad15b2290c..86bcec84fe7b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7065,6 +7065,19 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
return nbytes;
}
+static int memory_max_effective_show(struct seq_file *m, void *v)
+{
+ unsigned long memory;
+ struct mem_cgroup *mi;
+
+ /* Hierarchical information */
+ memory = PAGE_COUNTER_MAX;
+ for (mi = mem_cgroup_from_seq(m); mi; mi = parent_mem_cgroup(mi))
+ memory = min(memory, READ_ONCE(mi->memory.max));
+
+ return seq_puts_memcg_tunable(m, memory);
+}
+
/*
* Note: don't forget to update the 'samples/cgroup/memcg_event_listener'
* if any new events become available.
@@ -7259,6 +7272,11 @@ static struct cftype memory_files[] = {
.seq_show = memory_max_show,
.write = memory_max_write,
},
+ {
+ .name = "max.effective",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_max_effective_show,
+ },
{
.name = "events",
.flags = CFTYPE_NOT_ON_ROOT,
--
2.45.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* [RFC PATCH v5 2/3] memcg: Add memory.swap.max.effective like hierarchical_memsw_limit
2024-06-06 15:22 [RFC PATCH v5 0/3] Add memory.max.effective for application's allocators Michal Koutný
2024-06-06 15:22 ` [RFC PATCH v5 1/3] memcg: Add memory.max.effective attribute Michal Koutný
@ 2024-06-06 15:22 ` Michal Koutný
2024-06-06 15:22 ` [RFC PATCH v5 3/3] memcg: Notify on memory.max.effective changes Michal Koutný
2024-06-06 18:15 ` [RFC PATCH v5 0/3] Add memory.max.effective for application's allocators Roman Gushchin
3 siblings, 0 replies; 7+ messages in thread
From: Michal Koutný @ 2024-06-06 15:22 UTC (permalink / raw)
To: cgroups, linux-doc, linux-kernel, linux-mm
Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, Jan Kratochvil (Azul)
cgroup v1 used memory.stat:hierarchical_memsw_limit to report the value
of effecitve memsw limit. cgroup v2 has no combined charing but swap.max
limit, add a new memcg attribute file that contains the effective value
of memory limit for given cgroup (following cpuset.cpus.effective
pattern) for cases when whole hierarchy cannot be traversed up due to
cgroupns visibility.
Signed-off-by: Jan Kratochvil (Azul) <jkratochvil@azul.com>
[ mkoutny: rewrite commit message, only memory.swap.max change]
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
mm/memcontrol.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 86bcec84fe7b..a889385f6033 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -8279,6 +8279,19 @@ static ssize_t swap_max_write(struct kernfs_open_file *of,
return nbytes;
}
+static int swap_max_effective_show(struct seq_file *m, void *v)
+{
+ unsigned long swap;
+ struct mem_cgroup *mi;
+
+ /* Hierarchical information */
+ swap = PAGE_COUNTER_MAX;
+ for (mi = mem_cgroup_from_seq(m); mi; mi = parent_mem_cgroup(mi))
+ swap = min(swap, READ_ONCE(mi->swap.max));
+
+ return seq_puts_memcg_tunable(m, swap);
+}
+
static int swap_events_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
@@ -8311,6 +8324,11 @@ static struct cftype swap_files[] = {
.seq_show = swap_max_show,
.write = swap_max_write,
},
+ {
+ .name = "swap.max.effective",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = swap_max_effective_show,
+ },
{
.name = "swap.peak",
.flags = CFTYPE_NOT_ON_ROOT,
--
2.45.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* [RFC PATCH v5 3/3] memcg: Notify on memory.max.effective changes
2024-06-06 15:22 [RFC PATCH v5 0/3] Add memory.max.effective for application's allocators Michal Koutný
2024-06-06 15:22 ` [RFC PATCH v5 1/3] memcg: Add memory.max.effective attribute Michal Koutný
2024-06-06 15:22 ` [RFC PATCH v5 2/3] memcg: Add memory.swap.max.effective like hierarchical_memsw_limit Michal Koutný
@ 2024-06-06 15:22 ` Michal Koutný
2024-06-06 18:15 ` [RFC PATCH v5 0/3] Add memory.max.effective for application's allocators Roman Gushchin
3 siblings, 0 replies; 7+ messages in thread
From: Michal Koutný @ 2024-06-06 15:22 UTC (permalink / raw)
To: cgroups, linux-doc, linux-kernel, linux-mm
Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, Jan Kratochvil (Azul)
When users are interested in cgroup's effective limit, they typically
respond to the value somehow and therefore they should be notified when
the value changes. Use the standard menchanism of triggering a
modification of respective cgroup file.
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
include/linux/memcontrol.h | 2 ++
mm/memcontrol.c | 10 ++++++++++
2 files changed, 12 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 030d34e9d117..79ecbbd87c4c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -232,6 +232,8 @@ struct mem_cgroup {
/* memory.events and memory.events.local */
struct cgroup_file events_file;
struct cgroup_file events_local_file;
+ /* memory.max.effective */
+ struct cgroup_file mem_max_file;
/* handle for "memory.swap.events" */
struct cgroup_file swap_events_file;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a889385f6033..72c8e4693506 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7022,6 +7022,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ struct mem_cgroup *iter;
unsigned int nr_reclaims = MAX_RECLAIM_RETRIES;
bool drained = false;
unsigned long max;
@@ -7061,6 +7062,14 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
break;
}
+ /*
+ * Notification about limit tightening, not about coming OOMs, so it
+ * can be after reclaim.
+ */
+ for_each_mem_cgroup_tree(iter, memcg) {
+ cgroup_file_notify(&iter->mem_max_file);
+ }
+
memcg_wb_domain_size_changed(memcg);
return nbytes;
}
@@ -7275,6 +7284,7 @@ static struct cftype memory_files[] = {
{
.name = "max.effective",
.flags = CFTYPE_NOT_ON_ROOT,
+ .file_offset = offsetof(struct mem_cgroup, mem_max_file),
.seq_show = memory_max_effective_show,
},
{
--
2.45.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH v5 0/3] Add memory.max.effective for application's allocators
2024-06-06 15:22 [RFC PATCH v5 0/3] Add memory.max.effective for application's allocators Michal Koutný
` (2 preceding siblings ...)
2024-06-06 15:22 ` [RFC PATCH v5 3/3] memcg: Notify on memory.max.effective changes Michal Koutný
@ 2024-06-06 18:15 ` Roman Gushchin
2024-08-17 6:00 ` Jan Kratochvil
3 siblings, 1 reply; 7+ messages in thread
From: Roman Gushchin @ 2024-06-06 18:15 UTC (permalink / raw)
To: Michal Koutný
Cc: cgroups, linux-doc, linux-kernel, linux-mm, Tejun Heo, Zefan Li,
Johannes Weiner, Jonathan Corbet, Michal Hocko, Shakeel Butt,
Muchun Song, Andrew Morton, Jan Kratochvil (Azul)
On Thu, Jun 06, 2024 at 05:22:29PM +0200, Michal Koutný wrote:
> Some applications use memory cgroup limits to scale their own memory
> needs. Reading of the immediate membership cgroup's memory.max is not
> sufficient because of possible ancestral limits. The application could
> traverse upwards to figure out the tightest limit but this would not
> work in cgroup namespace where the view of cgroup hierarchy is
> incomplete and the limit may apply from outer world.
> Additionally, applications should respond to limit changes.
If the goal is to detect how much memory would it be possible to allocate,
I'm not sure that knowing all memory.max limits upper in the hierarchy
really buys anything without knowing actual usages and a potential
for memory reclaim across the entire tree.
E.g.:
A (max = 100G)
| \
B C
C's effective max will come out as 100G, but if B.anon_usage = 100G and
there is no swap, the actual number is 0.
But if it's more about exploring the "invisible" part of the cgroup
tree configuration, it makes sense to me.
Not sure about the naming, maybe something like memory.tree.max
or memory.parent.max or even memory.hierarchical.max is a better fit.
Thanks!
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH v5 0/3] Add memory.max.effective for application's allocators
2024-06-06 18:15 ` [RFC PATCH v5 0/3] Add memory.max.effective for application's allocators Roman Gushchin
@ 2024-08-17 6:00 ` Jan Kratochvil
2024-08-19 16:42 ` Michal Koutný
0 siblings, 1 reply; 7+ messages in thread
From: Jan Kratochvil @ 2024-08-17 6:00 UTC (permalink / raw)
To: Roman Gushchin
Cc: Michal Koutný,
cgroups, linux-doc, linux-kernel, linux-mm, Tejun Heo, Zefan Li,
Johannes Weiner, Jonathan Corbet, Michal Hocko, Shakeel Butt,
Muchun Song, Andrew Morton
On Fri, 07 Jun 2024 02:15:00 +0800, Roman Gushchin wrote:
> If the goal is to detect how much memory would it be possible to allocate,
> I'm not sure that knowing all memory.max limits upper in the hierarchy
> really buys anything without knowing actual usages and a potential
> for memory reclaim across the entire tree.
>
> E.g.:
>
> A (max = 100G)
> | \
> B C
>
> C's effective max will come out as 100G, but if B.anon_usage = 100G and
> there is no swap, the actual number is 0.
Yes, it would be better to subtract the used memory from ancestor (and thus
even current) cgroups. The original use case of this feature is for cloud
nodes running a single Java JVM where the sibling cgroups are not an issue.
Jan Kratochvil
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH v5 0/3] Add memory.max.effective for application's allocators
2024-08-17 6:00 ` Jan Kratochvil
@ 2024-08-19 16:42 ` Michal Koutný
0 siblings, 0 replies; 7+ messages in thread
From: Michal Koutný @ 2024-08-19 16:42 UTC (permalink / raw)
To: Jan Kratochvil
Cc: Roman Gushchin, cgroups, linux-doc, linux-kernel, linux-mm,
Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton
[-- Attachment #1: Type: text/plain, Size: 1375 bytes --]
Hello.
On Sat, Aug 17, 2024 at 02:00:15PM GMT, Jan Kratochvil <jkratochvil@azul.com> wrote:
> Yes, it would be better to subtract the used memory from ancestor (and thus
> even current) cgroups.
Then it becomes a more dynamic characterstics and it leads to
calculations of available memory. I share a link [1] for completeness
and to prevent repeated discussions (that past one ended up with no
memory.stat:avail).
> The original use case of this feature is for cloud nodes running a
> single Java JVM where the sibling cgroups are not an issue.
IIUC, it's a tree like this:
O
/ | \
A B C // B:memory.max < O:memory.max
|
...
|
W // workload
This picture made me realize that memory controller may not be even
enabled all the way down from B to W, i.e. W would have no
memory.max.effective, IOW memory.* attribute would not be the right
place for such an value. That would even apply in the apparently
purposeful case if there was a cgroup NS boundary between B and W.
(At least in the proposed implementation, memory.* file would have to be
decoupled from memory controller, similarly to e.g. cpu.stat:usage_usec.)
Jan, do I get the tree shape right? Are B and W in different cgroup
namespaces?
Thanks,
Michal
[1] https://lore.kernel.org/all/alpine.DEB.2.23.453.2007142018150.2667860@chino.kir.corp.google.com/
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-08-19 16:42 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-06 15:22 [RFC PATCH v5 0/3] Add memory.max.effective for application's allocators Michal Koutný
2024-06-06 15:22 ` [RFC PATCH v5 1/3] memcg: Add memory.max.effective attribute Michal Koutný
2024-06-06 15:22 ` [RFC PATCH v5 2/3] memcg: Add memory.swap.max.effective like hierarchical_memsw_limit Michal Koutný
2024-06-06 15:22 ` [RFC PATCH v5 3/3] memcg: Notify on memory.max.effective changes Michal Koutný
2024-06-06 18:15 ` [RFC PATCH v5 0/3] Add memory.max.effective for application's allocators Roman Gushchin
2024-08-17 6:00 ` Jan Kratochvil
2024-08-19 16:42 ` Michal Koutný
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox