From: Michal Hocko <mhocko@suse.com>
To: David Finkel <davidf@vimeo.com>
Cc: Muchun Song <muchun.song@linux.dev>,
Andrew Morton <akpm@linux-foundation.org>,
core-services@vimeo.com, Jonathan Corbet <corbet@lwn.net>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shuah Khan <shuah@kernel.org>,
Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
Zefan Li <lizefan.x@bytedance.com>,
cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
Shakeel Butt <shakeel.butt@linux.dev>
Subject: Re: [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers
Date: Tue, 16 Jul 2024 09:20:08 +0200 [thread overview]
Message-ID: <ZpYfKI6W1uSMkt5i@tiehlicka> (raw)
In-Reply-To: <CAFUnj5MahNvM+B2zynVtcnYKJ7LZHwBNEcPKGAdz-tesDeOXcw@mail.gmail.com>
On Mon 15-07-24 16:46:36, David Finkel wrote:
> > On Mon, Jul 15, 2024 at 4:38 PM David Finkel <davidf@vimeo.com> wrote:
> > >
> > > Other mechanisms for querying the peak memory usage of either a process
> > > or v1 memory cgroup allow for resetting the high watermark. Restore
> > > parity with those mechanisms.
> > >
> > > For example:
> > > - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets
> > > the high watermark.
> > > - writing "5" to the clear_refs pseudo-file in a processes's proc
> > > directory resets the peak RSS.
> > >
> > > This change copies the cgroup v1 behavior so any write to the
> > > memory.peak and memory.swap.peak pseudo-files reset the high watermark
> > > to the current usage.
> > >
> > > This behavior is particularly useful for work scheduling systems that
> > > need to track memory usage of worker processes/cgroups per-work-item.
> > > Since memory can't be squeezed like CPU can (the OOM-killer has
> > > opinions),
I do not understand the OOM-killer reference here. Why does it matter?
Could you explain please?
> > > these systems need to track the peak memory usage to compute
> > > system/container fullness when binpacking workitems.
Could you elaborate some more on how you are using this please? I expect
you recycle memcgs for different runs of workers and reset peak
consumptions before a new run and record it after it is done. The thing
which is not really clear to me is how the peak value really helps if it
can vary a lot among different runs. But maybe I misunderstand.
> > >
> > > Signed-off-by: David Finkel <davidf@vimeo.com>
> > > ---
> > > Documentation/admin-guide/cgroup-v2.rst | 20 +++---
> > > mm/memcontrol.c | 23 ++++++
> > > .../selftests/cgroup/test_memcontrol.c | 72 ++++++++++++++++---
> > > 3 files changed, 99 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > index 8fbb0519d556..201d8e5d9f82 100644
> > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > @@ -1322,11 +1322,13 @@ PAGE_SIZE multiple when read back.
> > > reclaim induced by memory.reclaim.
> > >
> > > memory.peak
> > > - A read-only single value file which exists on non-root
> > > - cgroups.
> > > + A read-write single value file which exists on non-root cgroups.
> > > +
> > > + The max memory usage recorded for the cgroup and its descendants since
> > > + either the creation of the cgroup or the most recent reset.
> > >
> > > - The max memory usage recorded for the cgroup and its
> > > - descendants since the creation of the cgroup.
> > > + Any non-empty write to this file resets it to the current memory usage.
> > > + All content written is completely ignored.
> > >
> > > memory.oom.group
> > > A read-write single value file which exists on non-root
> > > @@ -1652,11 +1654,13 @@ PAGE_SIZE multiple when read back.
> > > Healthy workloads are not expected to reach this limit.
> > >
> > > memory.swap.peak
> > > - A read-only single value file which exists on non-root
> > > - cgroups.
> > > + A read-write single value file which exists on non-root cgroups.
> > > +
> > > + The max swap usage recorded for the cgroup and its descendants since
> > > + the creation of the cgroup or the most recent reset.
> > >
> > > - The max swap usage recorded for the cgroup and its
> > > - descendants since the creation of the cgroup.
> > > + Any non-empty write to this file resets it to the current swap usage.
> > > + All content written is completely ignored.
> > >
> > > memory.swap.max
> > > A read-write single value file which exists on non-root
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 8f2f1bb18c9c..abfa547615d6 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -25,6 +25,7 @@
> > > * Copyright (C) 2020 Alibaba, Inc, Alex Shi
> > > */
> > >
> > > +#include <linux/cgroup-defs.h>
> > > #include <linux/page_counter.h>
> > > #include <linux/memcontrol.h>
> > > #include <linux/cgroup.h>
> > > @@ -6915,6 +6916,16 @@ static u64 memory_peak_read(struct cgroup_subsys_state *css,
> > > return (u64)memcg->memory.watermark * PAGE_SIZE;
> > > }
> > >
> > > +static ssize_t memory_peak_write(struct kernfs_open_file *of,
> > > + char *buf, size_t nbytes, loff_t off)
> > > +{
> > > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > > +
> > > + page_counter_reset_watermark(&memcg->memory);
> > > +
> > > + return nbytes;
> > > +}
> > > +
> > > static int memory_min_show(struct seq_file *m, void *v)
> > > {
> > > return seq_puts_memcg_tunable(m,
> > > @@ -7232,6 +7243,7 @@ static struct cftype memory_files[] = {
> > > .name = "peak",
> > > .flags = CFTYPE_NOT_ON_ROOT,
> > > .read_u64 = memory_peak_read,
> > > + .write = memory_peak_write,
> > > },
> > > {
> > > .name = "min",
> > > @@ -8201,6 +8213,16 @@ static u64 swap_peak_read(struct cgroup_subsys_state *css,
> > > return (u64)memcg->swap.watermark * PAGE_SIZE;
> > > }
> > >
> > > +static ssize_t swap_peak_write(struct kernfs_open_file *of,
> > > + char *buf, size_t nbytes, loff_t off)
> > > +{
> > > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > > +
> > > + page_counter_reset_watermark(&memcg->swap);
> > > +
> > > + return nbytes;
> > > +}
> > > +
> > > static int swap_high_show(struct seq_file *m, void *v)
> > > {
> > > return seq_puts_memcg_tunable(m,
> > > @@ -8283,6 +8305,7 @@ static struct cftype swap_files[] = {
> > > .name = "swap.peak",
> > > .flags = CFTYPE_NOT_ON_ROOT,
> > > .read_u64 = swap_peak_read,
> > > + .write = swap_peak_write,
> > > },
> > > {
> > > .name = "swap.events",
> > > diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
> > > index 41ae8047b889..681972de673b 100644
> > > --- a/tools/testing/selftests/cgroup/test_memcontrol.c
> > > +++ b/tools/testing/selftests/cgroup/test_memcontrol.c
> > > @@ -161,12 +161,12 @@ static int alloc_pagecache_50M_check(const char *cgroup, void *arg)
> > > /*
> > > * This test create a memory cgroup, allocates
> > > * some anonymous memory and some pagecache
> > > - * and check memory.current and some memory.stat values.
> > > + * and checks memory.current, memory.peak, and some memory.stat values.
> > > */
> > > -static int test_memcg_current(const char *root)
> > > +static int test_memcg_current_peak(const char *root)
> > > {
> > > int ret = KSFT_FAIL;
> > > - long current;
> > > + long current, peak, peak_reset;
> > > char *memcg;
> > >
> > > memcg = cg_name(root, "memcg_test");
> > > @@ -180,12 +180,32 @@ static int test_memcg_current(const char *root)
> > > if (current != 0)
> > > goto cleanup;
> > >
> > > + peak = cg_read_long(memcg, "memory.peak");
> > > + if (peak != 0)
> > > + goto cleanup;
> > > +
> > > if (cg_run(memcg, alloc_anon_50M_check, NULL))
> > > goto cleanup;
> > >
> > > + peak = cg_read_long(memcg, "memory.peak");
> > > + if (peak < MB(50))
> > > + goto cleanup;
> > > +
> > > + peak_reset = cg_write(memcg, "memory.peak", "\n");
> > > + if (peak_reset != 0)
> > > + goto cleanup;
> > > +
> > > + peak = cg_read_long(memcg, "memory.peak");
> > > + if (peak > MB(30))
> > > + goto cleanup;
> > > +
> > > if (cg_run(memcg, alloc_pagecache_50M_check, NULL))
> > > goto cleanup;
> > >
> > > + peak = cg_read_long(memcg, "memory.peak");
> > > + if (peak < MB(50))
> > > + goto cleanup;
> > > +
> > > ret = KSFT_PASS;
> > >
> > > cleanup:
> > > @@ -817,13 +837,14 @@ static int alloc_anon_50M_check_swap(const char *cgroup, void *arg)
> > >
> > > /*
> > > * This test checks that memory.swap.max limits the amount of
> > > - * anonymous memory which can be swapped out.
> > > + * anonymous memory which can be swapped out. Additionally, it verifies that
> > > + * memory.swap.peak reflects the high watermark and can be reset.
> > > */
> > > -static int test_memcg_swap_max(const char *root)
> > > +static int test_memcg_swap_max_peak(const char *root)
> > > {
> > > int ret = KSFT_FAIL;
> > > char *memcg;
> > > - long max;
> > > + long max, peak;
> > >
> > > if (!is_swap_enabled())
> > > return KSFT_SKIP;
> > > @@ -840,6 +861,12 @@ static int test_memcg_swap_max(const char *root)
> > > goto cleanup;
> > > }
> > >
> > > + if (cg_read_long(memcg, "memory.swap.peak"))
> > > + goto cleanup;
> > > +
> > > + if (cg_read_long(memcg, "memory.peak"))
> > > + goto cleanup;
> > > +
> > > if (cg_read_strcmp(memcg, "memory.max", "max\n"))
> > > goto cleanup;
> > >
> > > @@ -862,6 +889,27 @@ static int test_memcg_swap_max(const char *root)
> > > if (cg_read_key_long(memcg, "memory.events", "oom_kill ") != 1)
> > > goto cleanup;
> > >
> > > + peak = cg_read_long(memcg, "memory.peak");
> > > + if (peak < MB(29))
> > > + goto cleanup;
> > > +
> > > + peak = cg_read_long(memcg, "memory.swap.peak");
> > > + if (peak < MB(29))
> > > + goto cleanup;
> > > +
> > > + if (cg_write(memcg, "memory.swap.peak", "\n"))
> > > + goto cleanup;
> > > +
> > > + if (cg_read_long(memcg, "memory.swap.peak") > MB(10))
> > > + goto cleanup;
> > > +
> > > +
> > > + if (cg_write(memcg, "memory.peak", "\n"))
> > > + goto cleanup;
> > > +
> > > + if (cg_read_long(memcg, "memory.peak"))
> > > + goto cleanup;
> > > +
> > > if (cg_run(memcg, alloc_anon_50M_check_swap, (void *)MB(30)))
> > > goto cleanup;
> > >
> > > @@ -869,6 +917,14 @@ static int test_memcg_swap_max(const char *root)
> > > if (max <= 0)
> > > goto cleanup;
> > >
> > > + peak = cg_read_long(memcg, "memory.peak");
> > > + if (peak < MB(29))
> > > + goto cleanup;
> > > +
> > > + peak = cg_read_long(memcg, "memory.swap.peak");
> > > + if (peak < MB(19))
> > > + goto cleanup;
> > > +
> > > ret = KSFT_PASS;
> > >
> > > cleanup:
> > > @@ -1295,7 +1351,7 @@ struct memcg_test {
> > > const char *name;
> > > } tests[] = {
> > > T(test_memcg_subtree_control),
> > > - T(test_memcg_current),
> > > + T(test_memcg_current_peak),
> > > T(test_memcg_min),
> > > T(test_memcg_low),
> > > T(test_memcg_high),
> > > @@ -1303,7 +1359,7 @@ struct memcg_test {
> > > T(test_memcg_max),
> > > T(test_memcg_reclaim),
> > > T(test_memcg_oom_events),
> > > - T(test_memcg_swap_max),
> > > + T(test_memcg_swap_max_peak),
> > > T(test_memcg_sock),
> > > T(test_memcg_oom_group_leaf_events),
> > > T(test_memcg_oom_group_parent_events),
> > > --
> > > 2.40.1
> > >
> >
> >
> > --
> > David Finkel
> > Senior Principal Software Engineer, Core Services
>
>
>
> --
> David Finkel
> Senior Principal Software Engineer, Core Services
--
Michal Hocko
SUSE Labs
next prev parent reply other threads:[~2024-07-16 7:20 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-15 20:36 David Finkel
2024-07-15 20:36 ` David Finkel
2024-07-15 20:42 ` David Finkel
2024-07-15 20:46 ` David Finkel
2024-07-16 7:20 ` Michal Hocko [this message]
2024-07-16 12:47 ` David Finkel
2024-07-16 13:19 ` Michal Hocko
2024-07-16 13:39 ` David Finkel
2024-07-16 13:48 ` Michal Hocko
2024-07-16 13:54 ` David Finkel
2024-07-16 16:44 ` Tejun Heo
2024-07-16 17:01 ` Roman Gushchin
2024-07-16 17:20 ` David Finkel
2024-07-16 19:53 ` Tejun Heo
2024-07-16 17:10 ` David Finkel
2024-07-16 19:48 ` Tejun Heo
2024-07-16 20:18 ` David Finkel
2024-07-16 18:00 ` Michal Hocko
2024-07-16 20:00 ` Tejun Heo
2024-07-16 22:06 ` David Finkel
2024-07-17 6:26 ` Michal Hocko
2024-07-17 14:24 ` David Finkel
2024-07-17 15:46 ` Michal Hocko
2024-07-17 6:23 ` Michal Hocko
2024-07-17 17:04 ` Johannes Weiner
2024-07-17 20:14 ` David Finkel
2024-07-17 20:44 ` Johannes Weiner
2024-07-17 21:13 ` David Finkel
2024-07-17 23:48 ` Waiman Long
2024-07-18 1:24 ` Tejun Heo
2024-07-18 2:17 ` Roman Gushchin
2024-07-18 2:22 ` Waiman Long
2024-07-18 7:21 ` Michal Hocko
2024-07-18 21:49 ` David Finkel
2024-07-19 3:23 ` Waiman Long
2024-07-22 15:18 ` David Finkel
-- strict thread matches above, loose matches on Subject: below --
2024-07-22 15:17 [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers (fd-local edition) David Finkel
2024-07-22 15:17 ` [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers David Finkel
2024-07-22 18:22 ` Roman Gushchin
2024-07-22 19:30 ` David Finkel
2024-07-22 19:47 ` Waiman Long
2024-07-22 23:06 ` David Finkel
2023-12-04 19:41 David Finkel
2023-12-04 23:33 ` Shakeel Butt
2023-12-05 9:07 ` Michal Hocko
2023-12-05 16:00 ` David Finkel
2023-12-06 8:45 ` Michal Hocko
2024-02-07 21:06 ` David Finkel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZpYfKI6W1uSMkt5i@tiehlicka \
--to=mhocko@suse.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=core-services@vimeo.com \
--cc=davidf@vimeo.com \
--cc=hannes@cmpxchg.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizefan.x@bytedance.com \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=shuah@kernel.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox