From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B553DC3DA60
	for <linux-mm@archiver.kernel.org>; Wed, 17 Jul 2024 21:14:03 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 320156B00B1; Wed, 17 Jul 2024 17:14:03 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2D0696B00B8; Wed, 17 Jul 2024 17:14:03 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1983C6B00B9; Wed, 17 Jul 2024 17:14:03 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id EFB1C6B00B8
	for <linux-mm@kvack.org>; Wed, 17 Jul 2024 17:14:02 -0400 (EDT)
Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 79DC71A1927
	for <linux-mm@kvack.org>; Wed, 17 Jul 2024 21:14:02 +0000 (UTC)
X-FDA: 82350497124.14.72C27DE
Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174])
	by imf06.hostedemail.com (Postfix) with ESMTP id 94CB8180015
	for <linux-mm@kvack.org>; Wed, 17 Jul 2024 21:14:00 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=vimeo.com header.s=google header.b=YcMIiXoX;
	spf=pass (imf06.hostedemail.com: domain of davidf@vimeo.com designates 209.85.210.174 as permitted sender) smtp.mailfrom=davidf@vimeo.com;
	dmarc=pass (policy=reject) header.from=vimeo.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1721250801;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=K/0JM5dLFFD4LF1dsWbX1JlSJvSEH5yTSYQQKpq7BAo=;
	b=0V+OOsgoS4h265tglVOnF9T5ra9jlNnDCc0ibCijMf0UwLF7rtkavKZ4KMwmrzklNMJz+C
	Qu9mt/RPjZUK+xjb2N/NLOm6hdXJRhSI6p4lhghOdXMOtylWVSULwxRvP0fmn08ZX3ULNF
	ool0Y6GkZY2Uar53wugu3irRIoRg7a4=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721250801; a=rsa-sha256;
	cv=none;
	b=Hwc63Ivt81oB1YpfFXJtimA1vW0DV+73wB0AGqmYaIJ00UItCJbKueHRRadUUXdx+7ZrzY
	44F708Mkwgm38ztTbH7Dblob8cmfo5qq1iq/bwzXzRhhsNQsLNKJZDxN+vz8ufQVXIn1j0
	47+2LwnKd/lZKH6HkqnPzYUnf0zn9Q0=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=vimeo.com header.s=google header.b=YcMIiXoX;
	spf=pass (imf06.hostedemail.com: domain of davidf@vimeo.com designates 209.85.210.174 as permitted sender) smtp.mailfrom=davidf@vimeo.com;
	dmarc=pass (policy=reject) header.from=vimeo.com
Received: by mail-pf1-f174.google.com with SMTP id d2e1a72fcca58-70b09cb7776so71500b3a.1
        for <linux-mm@kvack.org>; Wed, 17 Jul 2024 14:14:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=vimeo.com; s=google; t=1721250839; x=1721855639; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=K/0JM5dLFFD4LF1dsWbX1JlSJvSEH5yTSYQQKpq7BAo=;
        b=YcMIiXoXN65mgBLkkHLx8iAMFcIpDJboAbMhHtgJFh6fVGgmoBcE6odwnHRZBVmVMN
         5XY3Hh4uiLyhE9eHlRm0d8cRDvNlQfy03+fWKPh/dgXOzWYrPBNsZWbBWCmEr2YbyrMO
         0Ujs+1Xv1uO58MBg/71omTGC7/FU+JOSoi7gs=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1721250839; x=1721855639;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=K/0JM5dLFFD4LF1dsWbX1JlSJvSEH5yTSYQQKpq7BAo=;
        b=Nqj2edi/y6hKW5nWQlpWZ0bmANAp/92QjmUSEFayjg0pl9ipZywsA4c9XgaDqG5N6W
         Wr6VYe1t7wjda3uA2HNCQZLdyBNfkTouDtsmae0NShOOIolmSTpg4yQURhz8QACy/4me
         //Lh1yBoCNpytzR11xXsAGf6LWcNsttJMegkl3C2DFJi+ymaa0BLDpn7TNKrk7VWRymw
         0YPFrf6D1HBA/a4e4hS2RNAKqrqOmzSAaeJsYN1ypfc/OL3sGIYf8udjkLVVEqWH/p/8
         e2jZjxqB9b1c/l7xS6SOdc0dyz4wscyfkq6pjiQ7Ja1RS5xRWYfohmegkqMFFjNpxmOh
         506A==
X-Forwarded-Encrypted: i=1; AJvYcCXueJ6xNP2hOgiJeLqg5jAz7vhAtRRE1cjWwXon61FqDX6REi9JImKl2bN2EAEM1hugGG/hnAmQ/Oxr7njAdLcNkpM=
X-Gm-Message-State: AOJu0Yylcq5lGhuYFPtqrmZIkZa3tH7+//lwzafDPfBSgaiAHBFKu1D0
	eOj4qw2OEGjjx5Pv5OJKmkEKxzckVxmltrmsH40FFhS4HfU3A232pGdclaTPLpLKWiKynRVzShd
	yseU6f9DNB40WjvKhXo3wPC4Gpnqr/1VmG1iIaw==
X-Google-Smtp-Source: AGHT+IGxH/WHmAcJpGEJZcvrvdSJnI8QejnDNGq+5KD9M1pb0aI5cslGuiI7fRDd2gbrzdU64Q7pro0QCqIfKTD9Rd0=
X-Received: by 2002:aa7:9301:0:b0:70b:18f:45dd with SMTP id
 d2e1a72fcca58-70ce4f3ed7amr2644993b3a.32.1721250839049; Wed, 17 Jul 2024
 14:13:59 -0700 (PDT)
MIME-Version: 1.0
References: <20240715203625.1462309-1-davidf@vimeo.com> <20240715203625.1462309-2-davidf@vimeo.com>
 <ZpZ6IZL482XZT1fU@tiehlicka> <ZpajW9BKCFcCCTr-@slm.duckdns.org>
 <20240717170408.GC1321673@cmpxchg.org> <CAFUnj5OA0KaC54M9vd8W+NZJwz5Jw25u-BStO=Bi2An=98Ruwg@mail.gmail.com>
 <20240717204453.GD1321673@cmpxchg.org>
In-Reply-To: <20240717204453.GD1321673@cmpxchg.org>
From: David Finkel <davidf@vimeo.com>
Date: Wed, 17 Jul 2024 17:13:47 -0400
Message-ID: <CAFUnj5OGJtR0wqOZVUh8QQ3gaw4gmatsEN1LcBdcwN_wx-LUug@mail.gmail.com>
Subject: Re: [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>, Michal Hocko <mhocko@suse.com>, Muchun Song <muchun.song@linux.dev>, 
	Andrew Morton <akpm@linux-foundation.org>, core-services@vimeo.com, 
	Jonathan Corbet <corbet@lwn.net>, Roman Gushchin <roman.gushchin@linux.dev>, Shuah Khan <shuah@kernel.org>, 
	Zefan Li <lizefan.x@bytedance.com>, cgroups@vger.kernel.org, 
	linux-doc@vger.kernel.org, linux-mm@kvack.org, 
	linux-kselftest@vger.kernel.org, Shakeel Butt <shakeel.butt@linux.dev>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 94CB8180015
X-Stat-Signature: p1axripnzuujrwasyxhppk73dijg8fqj
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-HE-Tag: 1721250840-260956
X-HE-Meta: U2FsdGVkX19SiYWFC6vvduJoeuLvyIGoaEazbpx7VmOGsSKlgiyFvRFNU40QmkEfVph6kG/XphLHKzkxeiEKRhEmHiUpkNJlp3jKqKk7AulJijjV8l5OKloYlAnCMJ1nLR5BELpGs6LRpzrjIJnzVv6AyPy4+P/iVvSlLBSVLLvQriv6UWDAZyx3C5bwIohaFdfPqj9GMFNZKtoJndP5PTJfybvbPFVa5UpM//pn2JuudJCyuD6UOqDZSn7xWL5cExdH8e8/qOg9CYYoVC7NVwzqIffbl8kKwyC9yZkKdzHf8P4VAhFVjfT2ZehMNed5Y3sV7LLZX2HeJbIe9huyTnY7l6oMrNWr6yRPH1/qTHJKVpANkKCj3yjm+r/FTST+UXDW0oRB4Zx54RpspE5j5+Q5oWFsQgRbbZukyRmc+2OYZgL9wrcIMBjjJU0QBlzfKqeRM+58904hGcbzjG5I+/2vEeIkIDdGGOHw4HGp8JoM9f6YNCdqlwYg1F0bIddNpQO6pLZVi5EzgaxhTL3S66xbN940g5ZmvlJWbr7UOIWkWbLjrVtXVrPIYWmcwENDu5c8JyFk5RzuW1sUPeaZYoJNktrvYbeDnKnqowuXfOiaruQ4F937mY1tovU/nZKkjbMIeDPSdCq4A0hs8Fq1CZzvTo3IOJEdcdzfPEfBgTA19qGtDFXctgtKx/sRjGJaSZVCqZzf5WCO1ssVCSl2R9ezTzAE0+1fZretgZnXdiS74zJ3KBUSO3Op+Tp55Y+9+23FKBy7cNPN7H/BpwvFbVDMBOmmVnrXrK1LNy0Xvd0eEN+V/MecAHrBWLsw/MBal7UQgiAVwChYC6kgEXQhQZkS1MA8NfuRfLUaTqaJrXrN6f1JTyR7/jagEHNFJyT7eNC/W731DzCQGy6nwBoL8kyewT2IMckoJaMaV5dbUfk6wiU87O+WGzh7WSS5+dQqUNqVs6UpkJVENTgHTW0
 vOqkDja6
 EYIHkqbFy30M7X0xwUL0l1WT5s6Z6sfdd114Bspv6wSz1Sgry8/xXUJ9TCbHw4K+N9QEH1gpEDXbG/jC4axoSa8dM3ykDfoXZJ+1pd54zRYYviqsGaiRgN9PbkUgpdhqg97BNnrIKGaIfZ/q80KHnOnZ191J/OMU+mptHIuOYb+RipxSNWtPSpZ5YJk4ZYA8ypax7AdlzUQzt2WI+4qAgrF+cSrWWQVR0WfKE
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Jul 17, 2024 at 4:44=E2=80=AFPM Johannes Weiner <hannes@cmpxchg.org=
> wrote:
>
> On Wed, Jul 17, 2024 at 04:14:07PM -0400, David Finkel wrote:
> > On Wed, Jul 17, 2024 at 1:04=E2=80=AFPM Johannes Weiner <hannes@cmpxchg=
.org> wrote:
> > >
> > > On Tue, Jul 16, 2024 at 06:44:11AM -1000, Tejun Heo wrote:
> > > > Hello,
> > > >
> > > > On Tue, Jul 16, 2024 at 03:48:17PM +0200, Michal Hocko wrote:
> > > > ...
> > > > > > This behavior is particularly useful for work scheduling system=
s that
> > > > > > need to track memory usage of worker processes/cgroups per-work=
-item.
> > > > > > Since memory can't be squeezed like CPU can (the OOM-killer has
> > > > > > opinions), these systems need to track the peak memory usage to=
 compute
> > > > > > system/container fullness when binpacking workitems.
> > > >
> > > > Swap still has bad reps but there's nothing drastically worse about=
 it than
> > > > page cache. ie. If you're under memory pressure, you get thrashing =
one way
> > > > or another. If there's no swap, the system is just memlocking anon =
memory
> > > > even when they are a lot colder than page cache, so I'm skeptical t=
hat no
> > > > swap + mostly anon + kernel OOM kills is a good strategy in general
> > > > especially given that the system behavior is not very predictable u=
nder OOM
> > > > conditions.
> > > >
> > > > > As mentioned down the email thread, I consider usefulness of peak=
 value
> > > > > rather limited. It is misleading when memory is reclaimed. But
> > > > > fundamentally I do not oppose to unifying the write behavior to r=
eset
> > > > > values.
> > > >
> > > > The removal of resets was intentional. The problem was that it wasn=
't clear
> > > > who owned those counters and there's no way of telling who reset wh=
at when.
> > > > It was easy to accidentally end up with multiple entities that thin=
k they
> > > > can get timed measurement by resetting.
> > > >
> > > > So, in general, I don't think this is a great idea. There are short=
comings
> > > > to how memory.peak behaves in that its meaningfulness quickly decli=
nes over
> > > > time. This is expected and the rationale behind adding memory.peak,=
 IIRC,
> > > > was that it was difficult to tell the memory usage of a short-lived=
 cgroup.
> > > >
> > > > If we want to allow peak measurement of time periods, I wonder whet=
her we
> > > > could do something similar to pressure triggers - ie. let users reg=
ister
> > > > watchers so that each user can define their own watch periods. This=
 is more
> > > > involved but more useful and less error-inducing than adding reset =
to a
> > > > single counter.
> > > >
> > > > Johannes, what do you think?
> > >
> > > I'm also not a fan of the ability to reset globally.
> > >
> > > I seem to remember a scheme we discussed some time ago to do local
> > > state tracking without having the overhead in the page counter
> > > fastpath. The new data that needs to be tracked is a pc->local_peak
> > > (in the page_counter) and an fd->peak (in the watcher's file state).
> > >
> > > 1. Usage peak is tracked in pc->watermark, and now also in pc->local_=
peak.
> > >
> > > 2. Somebody opens the memory.peak. Initialize fd->peak =3D -1.
> > >
> > > 3. If they write, set fd->peak =3D pc->local_peak =3D usage.
> > >
> > > 4. Usage grows.
> > >
> > > 5. They read(). A conventional reader has fd->peak =3D=3D -1, so we r=
eturn
> > >    pc->watermark. If the fd has been written to, return max(fd->peak,=
 pc->local_peak).
> > >
> > > 6. Usage drops.
> > >
> > > 7. New watcher opens and writes. Bring up all existing watchers'
> > >    fd->peak (that aren't -1) to pc->local_peak *iff* latter is bigger=
.
> > >    Then set the new fd->peak =3D pc->local_peak =3D current usage as =
in 3.
> > >
> > > 8. See 5. again for read() from each watcher.
> > >
> > > This way all fd's can arbitrarily start tracking new local peaks with
> > > write(). The operation in the charging fast path is cheap. The write(=
)
> > > is O(existing_watchers), which seems reasonable. It's fully backward
> > > compatible with conventional open() + read() users.
> >
> > That scheme seems viable, but it's a lot more work to implement and mai=
ntain
> > than a simple global reset.
> >
> > Since that scheme maintains a separate pc->local_peak, it's not mutuall=
y
> > exclusive with implementing a global reset now. (as long as we reserve =
a
> > way to distinguish the different kinds of writes).
> >
> > As discussed on other sub-threads, this might be too niche to be worth
> > the significant complexity of avoiding a global reset. (especially when
> > users would likely be moving from cgroups v1 which does have a global r=
eset)
>
> The problem is that once global resetting is allowed, it makes the
> number reported in memory.peak unreliable for everyone. You just don't
> know, and can't tell, if somebody wrote to it recently. It's not too
> much of a leap to say this breaks the existing interface contract.

It does make it hard to tell when it was reset, however, it also allows som=
e
very powerful commandline interactions that aren't possible if you need to
keep a persistent fd open.

I have run things in cgroups to measure peak memory and CPU-time for
things that have subprocesses. If I needed to keep a persistent fd open
in order to reset the high watermark, it would have been far less useful.

Honestly, I don't see a ton of value in tracking the peak memory if I
can't reset it.
It's not my use-case, but, there are a lot of cases where process-startup u=
ses
a lot more memory than the steady-state, so the sysadmin might want to
measure that startup peak and any later peaks separately.

In my use-case, I do have a long-lived process managing the cgroups
for its workers, so I could keep an fd around and reset it as necessary.
However, I do sometimes shell into the relevant k8s container and poke
at the cgroups with a shell, and having to dup that managing processes'
FD somehow to check the high watermark while debugging would be
rather annoying. (although definitely not a dealbreaker)

>
> You have to decide whether the above is worth implementing. But my
> take is that the downsides of the simpler solution outweigh its
> benefits.

There are a few parts to my reticence to implement something
more complicated.
 1) Correctly cleaning up when one of those FDs gets closed can
     be subtle
 2) It's a lot of code, in some very sensitive portions of the kernel,
     so I'd need to test that code a lot more than I do for slapping
     a new entrypoint on the existing watermark reset of the
     page_counter.
 3) For various reasons, the relevant workload runs on
     Google Kubernetes Engine with their Container Optimised OS.
     If the patch is simple enough, I can request that Google
     cherry-pick the relevant commit, so we don't have to wait
     over a year for the next LTS kernel to roll out before we
     can switch to cgroups v2.

It would be a nice personal challenge to implement the solution
you suggest, but it's definitely not something I'd knock out in the
next couple days.

Thanks,
--=20
David Finkel
Senior Principal Software Engineer, Core Services