From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B553DC3DA60 for ; Wed, 17 Jul 2024 21:14:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 320156B00B1; Wed, 17 Jul 2024 17:14:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2D0696B00B8; Wed, 17 Jul 2024 17:14:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1983C6B00B9; Wed, 17 Jul 2024 17:14:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id EFB1C6B00B8 for ; Wed, 17 Jul 2024 17:14:02 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 79DC71A1927 for ; Wed, 17 Jul 2024 21:14:02 +0000 (UTC) X-FDA: 82350497124.14.72C27DE Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) by imf06.hostedemail.com (Postfix) with ESMTP id 94CB8180015 for ; Wed, 17 Jul 2024 21:14:00 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=vimeo.com header.s=google header.b=YcMIiXoX; spf=pass (imf06.hostedemail.com: domain of davidf@vimeo.com designates 209.85.210.174 as permitted sender) smtp.mailfrom=davidf@vimeo.com; dmarc=pass (policy=reject) header.from=vimeo.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721250801; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=K/0JM5dLFFD4LF1dsWbX1JlSJvSEH5yTSYQQKpq7BAo=; b=0V+OOsgoS4h265tglVOnF9T5ra9jlNnDCc0ibCijMf0UwLF7rtkavKZ4KMwmrzklNMJz+C Qu9mt/RPjZUK+xjb2N/NLOm6hdXJRhSI6p4lhghOdXMOtylWVSULwxRvP0fmn08ZX3ULNF ool0Y6GkZY2Uar53wugu3irRIoRg7a4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721250801; a=rsa-sha256; cv=none; b=Hwc63Ivt81oB1YpfFXJtimA1vW0DV+73wB0AGqmYaIJ00UItCJbKueHRRadUUXdx+7ZrzY 44F708Mkwgm38ztTbH7Dblob8cmfo5qq1iq/bwzXzRhhsNQsLNKJZDxN+vz8ufQVXIn1j0 47+2LwnKd/lZKH6HkqnPzYUnf0zn9Q0= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=vimeo.com header.s=google header.b=YcMIiXoX; spf=pass (imf06.hostedemail.com: domain of davidf@vimeo.com designates 209.85.210.174 as permitted sender) smtp.mailfrom=davidf@vimeo.com; dmarc=pass (policy=reject) header.from=vimeo.com Received: by mail-pf1-f174.google.com with SMTP id d2e1a72fcca58-70b09cb7776so71500b3a.1 for ; Wed, 17 Jul 2024 14:14:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vimeo.com; s=google; t=1721250839; x=1721855639; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=K/0JM5dLFFD4LF1dsWbX1JlSJvSEH5yTSYQQKpq7BAo=; b=YcMIiXoXN65mgBLkkHLx8iAMFcIpDJboAbMhHtgJFh6fVGgmoBcE6odwnHRZBVmVMN 5XY3Hh4uiLyhE9eHlRm0d8cRDvNlQfy03+fWKPh/dgXOzWYrPBNsZWbBWCmEr2YbyrMO 0Ujs+1Xv1uO58MBg/71omTGC7/FU+JOSoi7gs= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721250839; x=1721855639; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=K/0JM5dLFFD4LF1dsWbX1JlSJvSEH5yTSYQQKpq7BAo=; b=Nqj2edi/y6hKW5nWQlpWZ0bmANAp/92QjmUSEFayjg0pl9ipZywsA4c9XgaDqG5N6W Wr6VYe1t7wjda3uA2HNCQZLdyBNfkTouDtsmae0NShOOIolmSTpg4yQURhz8QACy/4me //Lh1yBoCNpytzR11xXsAGf6LWcNsttJMegkl3C2DFJi+ymaa0BLDpn7TNKrk7VWRymw 0YPFrf6D1HBA/a4e4hS2RNAKqrqOmzSAaeJsYN1ypfc/OL3sGIYf8udjkLVVEqWH/p/8 e2jZjxqB9b1c/l7xS6SOdc0dyz4wscyfkq6pjiQ7Ja1RS5xRWYfohmegkqMFFjNpxmOh 506A== X-Forwarded-Encrypted: i=1; AJvYcCXueJ6xNP2hOgiJeLqg5jAz7vhAtRRE1cjWwXon61FqDX6REi9JImKl2bN2EAEM1hugGG/hnAmQ/Oxr7njAdLcNkpM= X-Gm-Message-State: AOJu0Yylcq5lGhuYFPtqrmZIkZa3tH7+//lwzafDPfBSgaiAHBFKu1D0 eOj4qw2OEGjjx5Pv5OJKmkEKxzckVxmltrmsH40FFhS4HfU3A232pGdclaTPLpLKWiKynRVzShd yseU6f9DNB40WjvKhXo3wPC4Gpnqr/1VmG1iIaw== X-Google-Smtp-Source: AGHT+IGxH/WHmAcJpGEJZcvrvdSJnI8QejnDNGq+5KD9M1pb0aI5cslGuiI7fRDd2gbrzdU64Q7pro0QCqIfKTD9Rd0= X-Received: by 2002:aa7:9301:0:b0:70b:18f:45dd with SMTP id d2e1a72fcca58-70ce4f3ed7amr2644993b3a.32.1721250839049; Wed, 17 Jul 2024 14:13:59 -0700 (PDT) MIME-Version: 1.0 References: <20240715203625.1462309-1-davidf@vimeo.com> <20240715203625.1462309-2-davidf@vimeo.com> <20240717170408.GC1321673@cmpxchg.org> <20240717204453.GD1321673@cmpxchg.org> In-Reply-To: <20240717204453.GD1321673@cmpxchg.org> From: David Finkel Date: Wed, 17 Jul 2024 17:13:47 -0400 Message-ID: Subject: Re: [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers To: Johannes Weiner Cc: Tejun Heo , Michal Hocko , Muchun Song , Andrew Morton , core-services@vimeo.com, Jonathan Corbet , Roman Gushchin , Shuah Khan , Zefan Li , cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Shakeel Butt Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 94CB8180015 X-Stat-Signature: p1axripnzuujrwasyxhppk73dijg8fqj X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1721250840-260956 X-HE-Meta: U2FsdGVkX19SiYWFC6vvduJoeuLvyIGoaEazbpx7VmOGsSKlgiyFvRFNU40QmkEfVph6kG/XphLHKzkxeiEKRhEmHiUpkNJlp3jKqKk7AulJijjV8l5OKloYlAnCMJ1nLR5BELpGs6LRpzrjIJnzVv6AyPy4+P/iVvSlLBSVLLvQriv6UWDAZyx3C5bwIohaFdfPqj9GMFNZKtoJndP5PTJfybvbPFVa5UpM//pn2JuudJCyuD6UOqDZSn7xWL5cExdH8e8/qOg9CYYoVC7NVwzqIffbl8kKwyC9yZkKdzHf8P4VAhFVjfT2ZehMNed5Y3sV7LLZX2HeJbIe9huyTnY7l6oMrNWr6yRPH1/qTHJKVpANkKCj3yjm+r/FTST+UXDW0oRB4Zx54RpspE5j5+Q5oWFsQgRbbZukyRmc+2OYZgL9wrcIMBjjJU0QBlzfKqeRM+58904hGcbzjG5I+/2vEeIkIDdGGOHw4HGp8JoM9f6YNCdqlwYg1F0bIddNpQO6pLZVi5EzgaxhTL3S66xbN940g5ZmvlJWbr7UOIWkWbLjrVtXVrPIYWmcwENDu5c8JyFk5RzuW1sUPeaZYoJNktrvYbeDnKnqowuXfOiaruQ4F937mY1tovU/nZKkjbMIeDPSdCq4A0hs8Fq1CZzvTo3IOJEdcdzfPEfBgTA19qGtDFXctgtKx/sRjGJaSZVCqZzf5WCO1ssVCSl2R9ezTzAE0+1fZretgZnXdiS74zJ3KBUSO3Op+Tp55Y+9+23FKBy7cNPN7H/BpwvFbVDMBOmmVnrXrK1LNy0Xvd0eEN+V/MecAHrBWLsw/MBal7UQgiAVwChYC6kgEXQhQZkS1MA8NfuRfLUaTqaJrXrN6f1JTyR7/jagEHNFJyT7eNC/W731DzCQGy6nwBoL8kyewT2IMckoJaMaV5dbUfk6wiU87O+WGzh7WSS5+dQqUNqVs6UpkJVENTgHTW0 vOqkDja6 EYIHkqbFy30M7X0xwUL0l1WT5s6Z6sfdd114Bspv6wSz1Sgry8/xXUJ9TCbHw4K+N9QEH1gpEDXbG/jC4axoSa8dM3ykDfoXZJ+1pd54zRYYviqsGaiRgN9PbkUgpdhqg97BNnrIKGaIfZ/q80KHnOnZ191J/OMU+mptHIuOYb+RipxSNWtPSpZ5YJk4ZYA8ypax7AdlzUQzt2WI+4qAgrF+cSrWWQVR0WfKE X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 17, 2024 at 4:44=E2=80=AFPM Johannes Weiner wrote: > > On Wed, Jul 17, 2024 at 04:14:07PM -0400, David Finkel wrote: > > On Wed, Jul 17, 2024 at 1:04=E2=80=AFPM Johannes Weiner wrote: > > > > > > On Tue, Jul 16, 2024 at 06:44:11AM -1000, Tejun Heo wrote: > > > > Hello, > > > > > > > > On Tue, Jul 16, 2024 at 03:48:17PM +0200, Michal Hocko wrote: > > > > ... > > > > > > This behavior is particularly useful for work scheduling system= s that > > > > > > need to track memory usage of worker processes/cgroups per-work= -item. > > > > > > Since memory can't be squeezed like CPU can (the OOM-killer has > > > > > > opinions), these systems need to track the peak memory usage to= compute > > > > > > system/container fullness when binpacking workitems. > > > > > > > > Swap still has bad reps but there's nothing drastically worse about= it than > > > > page cache. ie. If you're under memory pressure, you get thrashing = one way > > > > or another. If there's no swap, the system is just memlocking anon = memory > > > > even when they are a lot colder than page cache, so I'm skeptical t= hat no > > > > swap + mostly anon + kernel OOM kills is a good strategy in general > > > > especially given that the system behavior is not very predictable u= nder OOM > > > > conditions. > > > > > > > > > As mentioned down the email thread, I consider usefulness of peak= value > > > > > rather limited. It is misleading when memory is reclaimed. But > > > > > fundamentally I do not oppose to unifying the write behavior to r= eset > > > > > values. > > > > > > > > The removal of resets was intentional. The problem was that it wasn= 't clear > > > > who owned those counters and there's no way of telling who reset wh= at when. > > > > It was easy to accidentally end up with multiple entities that thin= k they > > > > can get timed measurement by resetting. > > > > > > > > So, in general, I don't think this is a great idea. There are short= comings > > > > to how memory.peak behaves in that its meaningfulness quickly decli= nes over > > > > time. This is expected and the rationale behind adding memory.peak,= IIRC, > > > > was that it was difficult to tell the memory usage of a short-lived= cgroup. > > > > > > > > If we want to allow peak measurement of time periods, I wonder whet= her we > > > > could do something similar to pressure triggers - ie. let users reg= ister > > > > watchers so that each user can define their own watch periods. This= is more > > > > involved but more useful and less error-inducing than adding reset = to a > > > > single counter. > > > > > > > > Johannes, what do you think? > > > > > > I'm also not a fan of the ability to reset globally. > > > > > > I seem to remember a scheme we discussed some time ago to do local > > > state tracking without having the overhead in the page counter > > > fastpath. The new data that needs to be tracked is a pc->local_peak > > > (in the page_counter) and an fd->peak (in the watcher's file state). > > > > > > 1. Usage peak is tracked in pc->watermark, and now also in pc->local_= peak. > > > > > > 2. Somebody opens the memory.peak. Initialize fd->peak =3D -1. > > > > > > 3. If they write, set fd->peak =3D pc->local_peak =3D usage. > > > > > > 4. Usage grows. > > > > > > 5. They read(). A conventional reader has fd->peak =3D=3D -1, so we r= eturn > > > pc->watermark. If the fd has been written to, return max(fd->peak,= pc->local_peak). > > > > > > 6. Usage drops. > > > > > > 7. New watcher opens and writes. Bring up all existing watchers' > > > fd->peak (that aren't -1) to pc->local_peak *iff* latter is bigger= . > > > Then set the new fd->peak =3D pc->local_peak =3D current usage as = in 3. > > > > > > 8. See 5. again for read() from each watcher. > > > > > > This way all fd's can arbitrarily start tracking new local peaks with > > > write(). The operation in the charging fast path is cheap. The write(= ) > > > is O(existing_watchers), which seems reasonable. It's fully backward > > > compatible with conventional open() + read() users. > > > > That scheme seems viable, but it's a lot more work to implement and mai= ntain > > than a simple global reset. > > > > Since that scheme maintains a separate pc->local_peak, it's not mutuall= y > > exclusive with implementing a global reset now. (as long as we reserve = a > > way to distinguish the different kinds of writes). > > > > As discussed on other sub-threads, this might be too niche to be worth > > the significant complexity of avoiding a global reset. (especially when > > users would likely be moving from cgroups v1 which does have a global r= eset) > > The problem is that once global resetting is allowed, it makes the > number reported in memory.peak unreliable for everyone. You just don't > know, and can't tell, if somebody wrote to it recently. It's not too > much of a leap to say this breaks the existing interface contract. It does make it hard to tell when it was reset, however, it also allows som= e very powerful commandline interactions that aren't possible if you need to keep a persistent fd open. I have run things in cgroups to measure peak memory and CPU-time for things that have subprocesses. If I needed to keep a persistent fd open in order to reset the high watermark, it would have been far less useful. Honestly, I don't see a ton of value in tracking the peak memory if I can't reset it. It's not my use-case, but, there are a lot of cases where process-startup u= ses a lot more memory than the steady-state, so the sysadmin might want to measure that startup peak and any later peaks separately. In my use-case, I do have a long-lived process managing the cgroups for its workers, so I could keep an fd around and reset it as necessary. However, I do sometimes shell into the relevant k8s container and poke at the cgroups with a shell, and having to dup that managing processes' FD somehow to check the high watermark while debugging would be rather annoying. (although definitely not a dealbreaker) > > You have to decide whether the above is worth implementing. But my > take is that the downsides of the simpler solution outweigh its > benefits. There are a few parts to my reticence to implement something more complicated. 1) Correctly cleaning up when one of those FDs gets closed can be subtle 2) It's a lot of code, in some very sensitive portions of the kernel, so I'd need to test that code a lot more than I do for slapping a new entrypoint on the existing watermark reset of the page_counter. 3) For various reasons, the relevant workload runs on Google Kubernetes Engine with their Container Optimised OS. If the patch is simple enough, I can request that Google cherry-pick the relevant commit, so we don't have to wait over a year for the next LTS kernel to roll out before we can switch to cgroups v2. It would be a nice personal challenge to implement the solution you suggest, but it's definitely not something I'd knock out in the next couple days. Thanks, --=20 David Finkel Senior Principal Software Engineer, Core Services