From: David Finkel <davidf@vimeo.com>
To: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>,
Muchun Song <muchun.song@linux.dev>,
Andrew Morton <akpm@linux-foundation.org>,
core-services@vimeo.com, Jonathan Corbet <corbet@lwn.net>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeelb@google.com>,
Shuah Khan <shuah@kernel.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Zefan Li <lizefan.x@bytedance.com>,
cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
linux-mm@kvack.org, linux-kselftest@vger.kernel.org
Subject: Re: [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers
Date: Tue, 16 Jul 2024 16:18:23 -0400 [thread overview]
Message-ID: <CAFUnj5NLTz4yQHpucvwgWqKgC2oeotHMC3h6QyS_XHD2O7wJTA@mail.gmail.com> (raw)
In-Reply-To: <ZpbOezMVYkYdQV_s@slm.duckdns.org>
On Tue, Jul 16, 2024 at 3:48 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Jul 16, 2024 at 01:10:14PM -0400, David Finkel wrote:
> > > Swap still has bad reps but there's nothing drastically worse about it than
> > > page cache. ie. If you're under memory pressure, you get thrashing one way
> > > or another. If there's no swap, the system is just memlocking anon memory
> > > even when they are a lot colder than page cache, so I'm skeptical that no
> > > swap + mostly anon + kernel OOM kills is a good strategy in general
> > > especially given that the system behavior is not very predictable under OOM
> > > conditions.
> >
> > The reason we need peak memory information is to let us schedule work in a
> > way that we generally avoid OOM conditions. For the workloads I work on,
> > we generally have very little in the page-cache, since the data isn't
> > stored locally most of the time, but streamed from other storage/database
> > systems. For those cases, demand-paging will cause large variations in
> > servicing time, and we'd rather restart the process than have
> > unpredictable latency. The same is true for the batch/queue-work system I
> > wrote this patch to support. We keep very little data on the local disk,
> > so the page cache is relatively small.
>
> You can detect these conditions more reliably and *earlier* using PSI
> triggers with swap enabled than hard allocations and OOM kills. Then, you
> can take whatever decision you want to take including killing the job
> without worrying about the whole system severely suffering. You can even do
> things like freezing the cgroup and taking backtraces and collecting other
> debug info to better understand why the memory usage is blowing up.
>
> There are of course multiple ways to go about things but I think it's useful
> to note that hard alloc based on peak usage + OOM kills likely isn't the
> best way here.
To be clear, my goal with peak memory tracking is to bin-pack in a way
that I don't encounter OOMs. I'd prefer to have a bit of headroom and
avoid OOMs if I can.
PSI does seem like a wonderful tool, and I do intend to use it, but
since it's a reactive
signal and doesn't provide absolute values for the total memory usage
that we'd need to
figure out in our central scheduler which work can cohabitate (and how
many instances),
it complements memory.peak rather than replacing my need for it.
FWIW, at the moment, we have some (partially broken) OOM-detection,
which does make
sense to swap out for PSI tracking/trigger-watching that takes care of
scaling down workers
when there's resource-pressure.
(Thanks for pointing out that PSI is generally a better signal than
OOMs for memory pressure)
Thanks again,
>
> ...
> > I appreciate the ownership issues with the current resetting interface in
> > the other locations. However, this peak RSS data is not used by all that
> > many applications (as evidenced by the fact that the memory.peak file was
> > only added a bit over a year ago). I think there are enough cases where
> > ownership is enforced externally that mirroring the existing interface to
> > cgroup2 is sufficient.
>
> It's fairly new addition and its utility is limited, so it's not that widely
> used. Adding reset makes it more useful but in a way which can be
> deterimental in the long term.
>
> > I do think a more stateful interface would be nice, but I don't know
> > whether I have enough knowledge of memcg to implement that in a reasonable
> > amount of time.
>
> Right, this probably isn't trivial.
>
> > Ownership aside, I think being able to reset the high watermark of a
> > process makes it significantly more useful. Creating new cgroups and
> > moving processes around is significantly heavier-weight.
>
> Yeah, the setup / teardown cost can be non-trivial for short lived cgroups.
> I agree that having some way of measuring peak in different time intervals
> can be useful.
>
> Thanks.
>
> --
> tejun
--
David Finkel
Senior Principal Software Engineer, Core Services
next prev parent reply other threads:[~2024-07-16 20:18 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-15 20:36 David Finkel
2024-07-15 20:36 ` David Finkel
2024-07-15 20:42 ` David Finkel
2024-07-15 20:46 ` David Finkel
2024-07-16 7:20 ` Michal Hocko
2024-07-16 12:47 ` David Finkel
2024-07-16 13:19 ` Michal Hocko
2024-07-16 13:39 ` David Finkel
2024-07-16 13:48 ` Michal Hocko
2024-07-16 13:54 ` David Finkel
2024-07-16 16:44 ` Tejun Heo
2024-07-16 17:01 ` Roman Gushchin
2024-07-16 17:20 ` David Finkel
2024-07-16 19:53 ` Tejun Heo
2024-07-16 17:10 ` David Finkel
2024-07-16 19:48 ` Tejun Heo
2024-07-16 20:18 ` David Finkel [this message]
2024-07-16 18:00 ` Michal Hocko
2024-07-16 20:00 ` Tejun Heo
2024-07-16 22:06 ` David Finkel
2024-07-17 6:26 ` Michal Hocko
2024-07-17 14:24 ` David Finkel
2024-07-17 15:46 ` Michal Hocko
2024-07-17 6:23 ` Michal Hocko
2024-07-17 17:04 ` Johannes Weiner
2024-07-17 20:14 ` David Finkel
2024-07-17 20:44 ` Johannes Weiner
2024-07-17 21:13 ` David Finkel
2024-07-17 23:48 ` Waiman Long
2024-07-18 1:24 ` Tejun Heo
2024-07-18 2:17 ` Roman Gushchin
2024-07-18 2:22 ` Waiman Long
2024-07-18 7:21 ` Michal Hocko
2024-07-18 21:49 ` David Finkel
2024-07-19 3:23 ` Waiman Long
2024-07-22 15:18 ` David Finkel
-- strict thread matches above, loose matches on Subject: below --
2024-07-22 15:17 [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers (fd-local edition) David Finkel
2024-07-22 15:17 ` [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers David Finkel
2024-07-22 18:22 ` Roman Gushchin
2024-07-22 19:30 ` David Finkel
2024-07-22 19:47 ` Waiman Long
2024-07-22 23:06 ` David Finkel
2023-12-04 19:41 David Finkel
2023-12-04 23:33 ` Shakeel Butt
2023-12-05 9:07 ` Michal Hocko
2023-12-05 16:00 ` David Finkel
2023-12-06 8:45 ` Michal Hocko
2024-02-07 21:06 ` David Finkel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAFUnj5NLTz4yQHpucvwgWqKgC2oeotHMC3h6QyS_XHD2O7wJTA@mail.gmail.com \
--to=davidf@vimeo.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=core-services@vimeo.com \
--cc=hannes@cmpxchg.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizefan.x@bytedance.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=shakeelb@google.com \
--cc=shuah@kernel.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox