From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B4EB8C3DA5D
	for <linux-mm@archiver.kernel.org>; Wed, 17 Jul 2024 23:48:58 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 122D86B0093; Wed, 17 Jul 2024 19:48:58 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0D34F6B0095; Wed, 17 Jul 2024 19:48:58 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id EDCA86B0096; Wed, 17 Jul 2024 19:48:57 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id CED8C6B0093
	for <linux-mm@kvack.org>; Wed, 17 Jul 2024 19:48:57 -0400 (EDT)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 503AE41809
	for <linux-mm@kvack.org>; Wed, 17 Jul 2024 23:48:57 +0000 (UTC)
X-FDA: 82350887514.08.09FF611
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf11.hostedemail.com (Postfix) with ESMTP id 6564840009
	for <linux-mm@kvack.org>; Wed, 17 Jul 2024 23:48:55 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=IGFrcCZZ;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf11.hostedemail.com: domain of longman@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=longman@redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721260102; a=rsa-sha256;
	cv=none;
	b=5Oc2QEqwlA2SZhIoZjqsnN6kaqKM1+Ib/9RtI6IW7BarDzzBRYdc7rQ8ptjFn++BI5gznU
	GNou5Bv4E3v8vrc/krWy/VVAqOxp3w1kxQibAyQjosAKBz/MjsULAdkqeNKdNA2YICjnqb
	cGjWWM4IRc55BsQUjzFQAvS+QzDS6Vw=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=IGFrcCZZ;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf11.hostedemail.com: domain of longman@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=longman@redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1721260102;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=dLUKdVmMIVj2UT/Y/3PcvR4JLQHbiTWGV5wC2CziSEs=;
	b=omDxdZK+HNbcvkcLk/GpFeDBO4PIYLka6cpep4NVHZRF6fF5uMCmRSgbX5osbwSAFdJsAd
	x7e5aW53dJm7Sx7aEPQNzUmPBreHPo59bl2HCD8mIKpWnDtb8uTNoePRahSgSXidGQo8Wr
	RKtU5tbNDK5IG5ONfg6weUyZgMMdHcg=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1721260134;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=dLUKdVmMIVj2UT/Y/3PcvR4JLQHbiTWGV5wC2CziSEs=;
	b=IGFrcCZZKD+B/dfeSlZTMdf2fqHLwpl77MuSafM9VLllqW4zb0WUGcgl2SCsNNDZsaAT8B
	eoFQz15OfQRMAIHelId8s5DmQ525wICStro2s9tHLz/i9B/854LSZnS5U5CyADaDlOY7Od
	7efnuxCcbjbh+2j/tJWEXCH7LAh7Kko=
Received: from mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-456-9N8rRVGjNWKkkl7rh5i1Lg-1; Wed,
 17 Jul 2024 19:48:47 -0400
X-MC-Unique: 9N8rRVGjNWKkkl7rh5i1Lg-1
Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id C44FC1955D4A;
	Wed, 17 Jul 2024 23:48:44 +0000 (UTC)
Received: from [10.22.16.209] (unknown [10.22.16.209])
	by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 1E8233000185;
	Wed, 17 Jul 2024 23:48:40 +0000 (UTC)
Message-ID: <85a67b00-9ae7-42a1-87e0-19b5563b9a0f@redhat.com>
Date: Wed, 17 Jul 2024 19:48:40 -0400
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers
To: David Finkel <davidf@vimeo.com>, Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>, Michal Hocko <mhocko@suse.com>,
 Muchun Song <muchun.song@linux.dev>,
 Andrew Morton <akpm@linux-foundation.org>, core-services@vimeo.com,
 Jonathan Corbet <corbet@lwn.net>, Roman Gushchin <roman.gushchin@linux.dev>,
 Shuah Khan <shuah@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
 cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org,
 linux-kselftest@vger.kernel.org, Shakeel Butt <shakeel.butt@linux.dev>
References: <20240715203625.1462309-1-davidf@vimeo.com>
 <20240715203625.1462309-2-davidf@vimeo.com> <ZpZ6IZL482XZT1fU@tiehlicka>
 <ZpajW9BKCFcCCTr-@slm.duckdns.org> <20240717170408.GC1321673@cmpxchg.org>
 <CAFUnj5OA0KaC54M9vd8W+NZJwz5Jw25u-BStO=Bi2An=98Ruwg@mail.gmail.com>
 <20240717204453.GD1321673@cmpxchg.org>
 <CAFUnj5OGJtR0wqOZVUh8QQ3gaw4gmatsEN1LcBdcwN_wx-LUug@mail.gmail.com>
Content-Language: en-US
From: Waiman Long <longman@redhat.com>
In-Reply-To: <CAFUnj5OGJtR0wqOZVUh8QQ3gaw4gmatsEN1LcBdcwN_wx-LUug@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 6564840009
X-Stat-Signature: qsxzfniy6a46eubamu6mtuyxn5bd1tb3
X-Rspam-User: 
X-HE-Tag: 1721260135-373485
X-HE-Meta: U2FsdGVkX189narUVEMC2FMFPFTqWU5IQs16Hqx1gkVneNHxQNhAmPn0KSc5LUGYmxLVrr4mVunyrNUSS2JFVz7XjbCPw3AM1YZYmPaQwZQy+Tol7BVEU+wgLtg8RurcNh97WEg16NJdP+OLC4BbBUKhBA5uuD0DLt7IYEtRDQw87XZoJbPt3Kzp0HTcHOXLAfYxtqgQA1WLkM3VfFifEoiXnD3bnwLjMYoG7MEXbFZXZxxHphfO9XJWz7EAlhW9Dgt7tBrb2ZWqlz94g8BR0nIrvMHMA5Zt+INEr7dss9Tv5kCFKpcrdftfSm3gKN1HxqeO7yweeB6+AX14ndaZb88rOBF6SWxgnBF18M4bb4RVaVjPOXbnlCkS0JNxNSmJ/EOV+oqA54TGGnvfnc8N1BIV4WRyR35eeudyhu++XCUvdKC1G9FyZ3o/Sz/qYazg+TT0nP8Umqu31Ui3BcB3MC2pDJhwYXrbgAipfX0pwnTuDWO+iyzqwZNMcgLrgyjnq/8Wec99N/8aLVtjHTU9Ryeqt3JaWzADB7qzE2OizgnzUOjp2y+Y7xRKqq7mlwUZBynRSV2GtWgoqOV1ci1S67jF7ft5hktP0KvBAldIR8Q+OLNsE0yhW2hxwCgPhlRHjdS6XRYRu6ys1npN+XRHFWK+rLjQ6VhZDtrVKFXR3H4OBMsFj/1MKBRV7MQPC5Ph4gFVfduNo6NPKJ57EMXMKbu0/zL3NsXaiIkMQgR3vvzFjU/h9yXAFvH4yKl1mgB4/XDAbcJNxTT7OZfdQqiJhb0Uy3aguGm0QDhz0Pvck6ygMlUMtPw8LmOsbCeRINEvVeJYCYH7xef2zvK1vZgaK62BMth+m5F7p5vMPJ84xKa5nWFqGn98jYlRyGv8ya6h3FBrscXA0sAef7JEmxL1XjybazwQ45iIKua3UKYZ15CcVt12wjEOEaOPxKHPe9ugNIzW9BcUvdQj/Pck/Tc
 ckCGqKMV
 T3F25rgaG3dNHWPOGfzSvpNHacF5sATz9VmzNwxvUqLoE2fMLZyWjYuNu3pBjyL+Jh/K3WZqCfa1lcdunnu+ybJjyPBBXAnOJntrwud8RwWFosFmYxskaxxytlDxB2CzAVK0FJrrWVKdJfsY3g1GSWHRfztJd/4o6KA/+famASGbECrBliOyR4eODXAqyUN8WtE0gqD3JOFUOYaa5Xb6vvdeUwAt++pfICGeboneasZFRvvdTCDLiGNTnBx99ZUstlAHoZDuqpBsFlsmyfD8q4Fop4uJtWbZDsGAmY8r3UATrTsxOIkyvfsXvOqD9Bg4q4Qqn43s2pY5PszR1VZiazmZnnKWeb3LedE8jJj+1zEf6pEzqDmFTwTJIIm/e0L3R8PqB
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 7/17/24 17:13, David Finkel wrote:
> On Wed, Jul 17, 2024 at 4:44 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>> On Wed, Jul 17, 2024 at 04:14:07PM -0400, David Finkel wrote:
>>> On Wed, Jul 17, 2024 at 1:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>>> On Tue, Jul 16, 2024 at 06:44:11AM -1000, Tejun Heo wrote:
>>>>> Hello,
>>>>>
>>>>> On Tue, Jul 16, 2024 at 03:48:17PM +0200, Michal Hocko wrote:
>>>>> ...
>>>>>>> This behavior is particularly useful for work scheduling systems that
>>>>>>> need to track memory usage of worker processes/cgroups per-work-item.
>>>>>>> Since memory can't be squeezed like CPU can (the OOM-killer has
>>>>>>> opinions), these systems need to track the peak memory usage to compute
>>>>>>> system/container fullness when binpacking workitems.
>>>>> Swap still has bad reps but there's nothing drastically worse about it than
>>>>> page cache. ie. If you're under memory pressure, you get thrashing one way
>>>>> or another. If there's no swap, the system is just memlocking anon memory
>>>>> even when they are a lot colder than page cache, so I'm skeptical that no
>>>>> swap + mostly anon + kernel OOM kills is a good strategy in general
>>>>> especially given that the system behavior is not very predictable under OOM
>>>>> conditions.
>>>>>
>>>>>> As mentioned down the email thread, I consider usefulness of peak value
>>>>>> rather limited. It is misleading when memory is reclaimed. But
>>>>>> fundamentally I do not oppose to unifying the write behavior to reset
>>>>>> values.
>>>>> The removal of resets was intentional. The problem was that it wasn't clear
>>>>> who owned those counters and there's no way of telling who reset what when.
>>>>> It was easy to accidentally end up with multiple entities that think they
>>>>> can get timed measurement by resetting.
>>>>>
>>>>> So, in general, I don't think this is a great idea. There are shortcomings
>>>>> to how memory.peak behaves in that its meaningfulness quickly declines over
>>>>> time. This is expected and the rationale behind adding memory.peak, IIRC,
>>>>> was that it was difficult to tell the memory usage of a short-lived cgroup.
>>>>>
>>>>> If we want to allow peak measurement of time periods, I wonder whether we
>>>>> could do something similar to pressure triggers - ie. let users register
>>>>> watchers so that each user can define their own watch periods. This is more
>>>>> involved but more useful and less error-inducing than adding reset to a
>>>>> single counter.
>>>>>
>>>>> Johannes, what do you think?
>>>> I'm also not a fan of the ability to reset globally.
>>>>
>>>> I seem to remember a scheme we discussed some time ago to do local
>>>> state tracking without having the overhead in the page counter
>>>> fastpath. The new data that needs to be tracked is a pc->local_peak
>>>> (in the page_counter) and an fd->peak (in the watcher's file state).
>>>>
>>>> 1. Usage peak is tracked in pc->watermark, and now also in pc->local_peak.
>>>>
>>>> 2. Somebody opens the memory.peak. Initialize fd->peak = -1.
>>>>
>>>> 3. If they write, set fd->peak = pc->local_peak = usage.
>>>>
>>>> 4. Usage grows.
>>>>
>>>> 5. They read(). A conventional reader has fd->peak == -1, so we return
>>>>     pc->watermark. If the fd has been written to, return max(fd->peak, pc->local_peak).
>>>>
>>>> 6. Usage drops.
>>>>
>>>> 7. New watcher opens and writes. Bring up all existing watchers'
>>>>     fd->peak (that aren't -1) to pc->local_peak *iff* latter is bigger.
>>>>     Then set the new fd->peak = pc->local_peak = current usage as in 3.
>>>>
>>>> 8. See 5. again for read() from each watcher.
>>>>
>>>> This way all fd's can arbitrarily start tracking new local peaks with
>>>> write(). The operation in the charging fast path is cheap. The write()
>>>> is O(existing_watchers), which seems reasonable. It's fully backward
>>>> compatible with conventional open() + read() users.
>>> That scheme seems viable, but it's a lot more work to implement and maintain
>>> than a simple global reset.
>>>
>>> Since that scheme maintains a separate pc->local_peak, it's not mutually
>>> exclusive with implementing a global reset now. (as long as we reserve a
>>> way to distinguish the different kinds of writes).
>>>
>>> As discussed on other sub-threads, this might be too niche to be worth
>>> the significant complexity of avoiding a global reset. (especially when
>>> users would likely be moving from cgroups v1 which does have a global reset)
>> The problem is that once global resetting is allowed, it makes the
>> number reported in memory.peak unreliable for everyone. You just don't
>> know, and can't tell, if somebody wrote to it recently. It's not too
>> much of a leap to say this breaks the existing interface contract.
> It does make it hard to tell when it was reset, however, it also allows some
> very powerful commandline interactions that aren't possible if you need to
> keep a persistent fd open.
>
> I have run things in cgroups to measure peak memory and CPU-time for
> things that have subprocesses. If I needed to keep a persistent fd open
> in order to reset the high watermark, it would have been far less useful.
>
> Honestly, I don't see a ton of value in tracking the peak memory if I
> can't reset it.
> It's not my use-case, but, there are a lot of cases where process-startup uses
> a lot more memory than the steady-state, so the sysadmin might want to
> measure that startup peak and any later peaks separately.
>
> In my use-case, I do have a long-lived process managing the cgroups
> for its workers, so I could keep an fd around and reset it as necessary.
> However, I do sometimes shell into the relevant k8s container and poke
> at the cgroups with a shell, and having to dup that managing processes'
> FD somehow to check the high watermark while debugging would be
> rather annoying. (although definitely not a dealbreaker)
>
>> You have to decide whether the above is worth implementing. But my
>> take is that the downsides of the simpler solution outweigh its
>> benefits.
> There are a few parts to my reticence to implement something
> more complicated.
>   1) Correctly cleaning up when one of those FDs gets closed can
>       be subtle
>   2) It's a lot of code, in some very sensitive portions of the kernel,
>       so I'd need to test that code a lot more than I do for slapping
>       a new entrypoint on the existing watermark reset of the
>       page_counter.
>   3) For various reasons, the relevant workload runs on
>       Google Kubernetes Engine with their Container Optimised OS.
>       If the patch is simple enough, I can request that Google
>       cherry-pick the relevant commit, so we don't have to wait
>       over a year for the next LTS kernel to roll out before we
>       can switch to cgroups v2.
>
> It would be a nice personal challenge to implement the solution
> you suggest, but it's definitely not something I'd knock out in the
> next couple days.

How about letting .peak shows two numbers? The first one is the peak 
since the creation of the cgroup and cannot be reset. The second one is 
a local maximum that can be reset to 0. We just to keep track of one 
more counter that should be simple enough to implement.

Cheers,
Longman