From: Waiman Long <longman@redhat.com>
To: David Finkel <davidf@vimeo.com>, Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>, Michal Hocko <mhocko@suse.com>,
Muchun Song <muchun.song@linux.dev>,
Andrew Morton <akpm@linux-foundation.org>,
core-services@vimeo.com, Jonathan Corbet <corbet@lwn.net>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shuah Khan <shuah@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
Shakeel Butt <shakeel.butt@linux.dev>
Subject: Re: [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers
Date: Wed, 17 Jul 2024 19:48:40 -0400 [thread overview]
Message-ID: <85a67b00-9ae7-42a1-87e0-19b5563b9a0f@redhat.com> (raw)
In-Reply-To: <CAFUnj5OGJtR0wqOZVUh8QQ3gaw4gmatsEN1LcBdcwN_wx-LUug@mail.gmail.com>
On 7/17/24 17:13, David Finkel wrote:
> On Wed, Jul 17, 2024 at 4:44 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>> On Wed, Jul 17, 2024 at 04:14:07PM -0400, David Finkel wrote:
>>> On Wed, Jul 17, 2024 at 1:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>>> On Tue, Jul 16, 2024 at 06:44:11AM -1000, Tejun Heo wrote:
>>>>> Hello,
>>>>>
>>>>> On Tue, Jul 16, 2024 at 03:48:17PM +0200, Michal Hocko wrote:
>>>>> ...
>>>>>>> This behavior is particularly useful for work scheduling systems that
>>>>>>> need to track memory usage of worker processes/cgroups per-work-item.
>>>>>>> Since memory can't be squeezed like CPU can (the OOM-killer has
>>>>>>> opinions), these systems need to track the peak memory usage to compute
>>>>>>> system/container fullness when binpacking workitems.
>>>>> Swap still has bad reps but there's nothing drastically worse about it than
>>>>> page cache. ie. If you're under memory pressure, you get thrashing one way
>>>>> or another. If there's no swap, the system is just memlocking anon memory
>>>>> even when they are a lot colder than page cache, so I'm skeptical that no
>>>>> swap + mostly anon + kernel OOM kills is a good strategy in general
>>>>> especially given that the system behavior is not very predictable under OOM
>>>>> conditions.
>>>>>
>>>>>> As mentioned down the email thread, I consider usefulness of peak value
>>>>>> rather limited. It is misleading when memory is reclaimed. But
>>>>>> fundamentally I do not oppose to unifying the write behavior to reset
>>>>>> values.
>>>>> The removal of resets was intentional. The problem was that it wasn't clear
>>>>> who owned those counters and there's no way of telling who reset what when.
>>>>> It was easy to accidentally end up with multiple entities that think they
>>>>> can get timed measurement by resetting.
>>>>>
>>>>> So, in general, I don't think this is a great idea. There are shortcomings
>>>>> to how memory.peak behaves in that its meaningfulness quickly declines over
>>>>> time. This is expected and the rationale behind adding memory.peak, IIRC,
>>>>> was that it was difficult to tell the memory usage of a short-lived cgroup.
>>>>>
>>>>> If we want to allow peak measurement of time periods, I wonder whether we
>>>>> could do something similar to pressure triggers - ie. let users register
>>>>> watchers so that each user can define their own watch periods. This is more
>>>>> involved but more useful and less error-inducing than adding reset to a
>>>>> single counter.
>>>>>
>>>>> Johannes, what do you think?
>>>> I'm also not a fan of the ability to reset globally.
>>>>
>>>> I seem to remember a scheme we discussed some time ago to do local
>>>> state tracking without having the overhead in the page counter
>>>> fastpath. The new data that needs to be tracked is a pc->local_peak
>>>> (in the page_counter) and an fd->peak (in the watcher's file state).
>>>>
>>>> 1. Usage peak is tracked in pc->watermark, and now also in pc->local_peak.
>>>>
>>>> 2. Somebody opens the memory.peak. Initialize fd->peak = -1.
>>>>
>>>> 3. If they write, set fd->peak = pc->local_peak = usage.
>>>>
>>>> 4. Usage grows.
>>>>
>>>> 5. They read(). A conventional reader has fd->peak == -1, so we return
>>>> pc->watermark. If the fd has been written to, return max(fd->peak, pc->local_peak).
>>>>
>>>> 6. Usage drops.
>>>>
>>>> 7. New watcher opens and writes. Bring up all existing watchers'
>>>> fd->peak (that aren't -1) to pc->local_peak *iff* latter is bigger.
>>>> Then set the new fd->peak = pc->local_peak = current usage as in 3.
>>>>
>>>> 8. See 5. again for read() from each watcher.
>>>>
>>>> This way all fd's can arbitrarily start tracking new local peaks with
>>>> write(). The operation in the charging fast path is cheap. The write()
>>>> is O(existing_watchers), which seems reasonable. It's fully backward
>>>> compatible with conventional open() + read() users.
>>> That scheme seems viable, but it's a lot more work to implement and maintain
>>> than a simple global reset.
>>>
>>> Since that scheme maintains a separate pc->local_peak, it's not mutually
>>> exclusive with implementing a global reset now. (as long as we reserve a
>>> way to distinguish the different kinds of writes).
>>>
>>> As discussed on other sub-threads, this might be too niche to be worth
>>> the significant complexity of avoiding a global reset. (especially when
>>> users would likely be moving from cgroups v1 which does have a global reset)
>> The problem is that once global resetting is allowed, it makes the
>> number reported in memory.peak unreliable for everyone. You just don't
>> know, and can't tell, if somebody wrote to it recently. It's not too
>> much of a leap to say this breaks the existing interface contract.
> It does make it hard to tell when it was reset, however, it also allows some
> very powerful commandline interactions that aren't possible if you need to
> keep a persistent fd open.
>
> I have run things in cgroups to measure peak memory and CPU-time for
> things that have subprocesses. If I needed to keep a persistent fd open
> in order to reset the high watermark, it would have been far less useful.
>
> Honestly, I don't see a ton of value in tracking the peak memory if I
> can't reset it.
> It's not my use-case, but, there are a lot of cases where process-startup uses
> a lot more memory than the steady-state, so the sysadmin might want to
> measure that startup peak and any later peaks separately.
>
> In my use-case, I do have a long-lived process managing the cgroups
> for its workers, so I could keep an fd around and reset it as necessary.
> However, I do sometimes shell into the relevant k8s container and poke
> at the cgroups with a shell, and having to dup that managing processes'
> FD somehow to check the high watermark while debugging would be
> rather annoying. (although definitely not a dealbreaker)
>
>> You have to decide whether the above is worth implementing. But my
>> take is that the downsides of the simpler solution outweigh its
>> benefits.
> There are a few parts to my reticence to implement something
> more complicated.
> 1) Correctly cleaning up when one of those FDs gets closed can
> be subtle
> 2) It's a lot of code, in some very sensitive portions of the kernel,
> so I'd need to test that code a lot more than I do for slapping
> a new entrypoint on the existing watermark reset of the
> page_counter.
> 3) For various reasons, the relevant workload runs on
> Google Kubernetes Engine with their Container Optimised OS.
> If the patch is simple enough, I can request that Google
> cherry-pick the relevant commit, so we don't have to wait
> over a year for the next LTS kernel to roll out before we
> can switch to cgroups v2.
>
> It would be a nice personal challenge to implement the solution
> you suggest, but it's definitely not something I'd knock out in the
> next couple days.
How about letting .peak shows two numbers? The first one is the peak
since the creation of the cgroup and cannot be reset. The second one is
a local maximum that can be reset to 0. We just to keep track of one
more counter that should be simple enough to implement.
Cheers,
Longman
next prev parent reply other threads:[~2024-07-17 23:48 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-15 20:36 David Finkel
2024-07-15 20:36 ` David Finkel
2024-07-15 20:42 ` David Finkel
2024-07-15 20:46 ` David Finkel
2024-07-16 7:20 ` Michal Hocko
2024-07-16 12:47 ` David Finkel
2024-07-16 13:19 ` Michal Hocko
2024-07-16 13:39 ` David Finkel
2024-07-16 13:48 ` Michal Hocko
2024-07-16 13:54 ` David Finkel
2024-07-16 16:44 ` Tejun Heo
2024-07-16 17:01 ` Roman Gushchin
2024-07-16 17:20 ` David Finkel
2024-07-16 19:53 ` Tejun Heo
2024-07-16 17:10 ` David Finkel
2024-07-16 19:48 ` Tejun Heo
2024-07-16 20:18 ` David Finkel
2024-07-16 18:00 ` Michal Hocko
2024-07-16 20:00 ` Tejun Heo
2024-07-16 22:06 ` David Finkel
2024-07-17 6:26 ` Michal Hocko
2024-07-17 14:24 ` David Finkel
2024-07-17 15:46 ` Michal Hocko
2024-07-17 6:23 ` Michal Hocko
2024-07-17 17:04 ` Johannes Weiner
2024-07-17 20:14 ` David Finkel
2024-07-17 20:44 ` Johannes Weiner
2024-07-17 21:13 ` David Finkel
2024-07-17 23:48 ` Waiman Long [this message]
2024-07-18 1:24 ` Tejun Heo
2024-07-18 2:17 ` Roman Gushchin
2024-07-18 2:22 ` Waiman Long
2024-07-18 7:21 ` Michal Hocko
2024-07-18 21:49 ` David Finkel
2024-07-19 3:23 ` Waiman Long
2024-07-22 15:18 ` David Finkel
-- strict thread matches above, loose matches on Subject: below --
2024-07-22 15:17 [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers (fd-local edition) David Finkel
2024-07-22 15:17 ` [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers David Finkel
2024-07-22 18:22 ` Roman Gushchin
2024-07-22 19:30 ` David Finkel
2024-07-22 19:47 ` Waiman Long
2024-07-22 23:06 ` David Finkel
2023-12-04 19:41 David Finkel
2023-12-04 23:33 ` Shakeel Butt
2023-12-05 9:07 ` Michal Hocko
2023-12-05 16:00 ` David Finkel
2023-12-06 8:45 ` Michal Hocko
2024-02-07 21:06 ` David Finkel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=85a67b00-9ae7-42a1-87e0-19b5563b9a0f@redhat.com \
--to=longman@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=core-services@vimeo.com \
--cc=davidf@vimeo.com \
--cc=hannes@cmpxchg.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizefan.x@bytedance.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=shuah@kernel.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox