From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A9C8DC3DA49 for ; Thu, 18 Jul 2024 21:49:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3C85B6B0088; Thu, 18 Jul 2024 17:49:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 379046B0089; Thu, 18 Jul 2024 17:49:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 21A956B0092; Thu, 18 Jul 2024 17:49:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id F077F6B0089 for ; Thu, 18 Jul 2024 17:49:43 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 752BD1C08B2 for ; Thu, 18 Jul 2024 21:49:43 +0000 (UTC) X-FDA: 82354215846.29.E3BF272 Received: from mail-pf1-f176.google.com (mail-pf1-f176.google.com [209.85.210.176]) by imf21.hostedemail.com (Postfix) with ESMTP id A574F1C001A for ; Thu, 18 Jul 2024 21:49:40 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=vimeo.com header.s=google header.b=TmphtYSh; spf=pass (imf21.hostedemail.com: domain of davidf@vimeo.com designates 209.85.210.176 as permitted sender) smtp.mailfrom=davidf@vimeo.com; dmarc=pass (policy=reject) header.from=vimeo.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721339359; a=rsa-sha256; cv=none; b=pvbUd1vJ6vBot0Lap6+SGkyrgwbmspbYNhvolh76lzpugKypGocOqW6yvRdmzpXnAqELQW CrRe8xaFYefmDjUYTkdNLTy9Ww/2LOBYzMBBLPECF9VqlyyL1iJ4xjKTPrK8J3HrazQJUJ ZN0O1eUPAz0ifNYIVXPJn7DmmWVlPdg= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=vimeo.com header.s=google header.b=TmphtYSh; spf=pass (imf21.hostedemail.com: domain of davidf@vimeo.com designates 209.85.210.176 as permitted sender) smtp.mailfrom=davidf@vimeo.com; dmarc=pass (policy=reject) header.from=vimeo.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721339359; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ImNfl0lxisTrDz2TQ4vM0mgYT84J4/46xCk2xuscFRE=; b=QAosiB3K+Z2G7OeluJFOCcBNWzeZ/ESW+hPEMdlbhEOoiQqQtw0p3lbV6i5zjGAedREJi/ g7AD1W4sT7/H9MzisqN+UbR8+6WHcpqF0Ml06ldzjy3m9u1s8TonP6X9iwp7WdSkrUrhSO 5nCZEEFUF42GEatMxbKReaRGmzw47SI= Received: by mail-pf1-f176.google.com with SMTP id d2e1a72fcca58-70b0ebd1ef9so217838b3a.2 for ; Thu, 18 Jul 2024 14:49:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vimeo.com; s=google; t=1721339379; x=1721944179; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ImNfl0lxisTrDz2TQ4vM0mgYT84J4/46xCk2xuscFRE=; b=TmphtYShypwJIhF1tliAmZkn6gfEV/GChS7Q0oSPh4pRa/iT6kBHNbWEsgVJt/S0bx ZnVx1kYPKTAdvnQRe1oqmpiELw01c0gmOkzQd4hUQgK6tQkzdeGJiNExnupBf6S8tv4a 8D2rfZ6y8Z0FSghUcJGE06BLXeaMaCKI+1xUk= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721339379; x=1721944179; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ImNfl0lxisTrDz2TQ4vM0mgYT84J4/46xCk2xuscFRE=; b=mH11JftU8nvMWovchsHW2bBBMDVuSPv0EYToAj+sew4EdCUMsri82bI86Wj/3Xldlr R23s2aCPUnTlPDXVBh7CRqdUMz1seVVGYmbyCAX5S3DTOYpcAoSNISkpAiaQdcV5IGAx tRQdIzqpwRTuzPAM6mZOugSk8xH5oTECFZK5CCTdNuF+qDcTJCoYc8l9fgTrOsUydqnd cw2RJDYIN+wLgS8/6QzqdSw1mgP8x3Hv/QpG93U1ufskUx3ELKRhX/Cqx77vYws8FtdT FoLYHWFZ3cOt2KRwvCmLFbGbkEghnZCsxW8/d0PS7jMamemURJpyt4DkSRcAJCJWHtb3 rPIQ== X-Forwarded-Encrypted: i=1; AJvYcCV4Qs5AlX/vu24fiRS2eeCRHpnglDujnc5S3PVssRkQCN6LyqtXoeZk/QrbN6hzZvFb1qDYT9qLvCFP7SgmmFhCsTc= X-Gm-Message-State: AOJu0Yy7vTanlX8vkGlKkMSmfzs2FmemOsd6m2N7ons7LAgdosQeUuu9 EkQcJ8v8ZFuRPB6HfB4PDTMAFpQimLT3MZxtHTidSlokhW26wIbzzZEGJVq7lJgDMl0yZAcDlSV 9nUiEtWZ/W4vTuhbd58dNXpx9d/NkzveTnBrRfA== X-Google-Smtp-Source: AGHT+IHOUHCFW3BqLlA/XU4s+sjNOKmckQbyQ1DXqrZy31IA0S9B3ukOd/g0Ze9UvEkpyp5dZDi44gENSMuTlNUh0f8= X-Received: by 2002:a05:6a00:a1a:b0:705:c029:c993 with SMTP id d2e1a72fcca58-70cfc906b99mr1040207b3a.14.1721339379143; Thu, 18 Jul 2024 14:49:39 -0700 (PDT) MIME-Version: 1.0 References: <20240715203625.1462309-1-davidf@vimeo.com> <20240715203625.1462309-2-davidf@vimeo.com> <20240717170408.GC1321673@cmpxchg.org> In-Reply-To: <20240717170408.GC1321673@cmpxchg.org> From: David Finkel Date: Thu, 18 Jul 2024 17:49:28 -0400 Message-ID: Subject: Re: [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers To: Johannes Weiner Cc: Tejun Heo , Michal Hocko , Muchun Song , Andrew Morton , core-services@vimeo.com, Jonathan Corbet , Roman Gushchin , Shuah Khan , Zefan Li , cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Shakeel Butt Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: i4enbep4amfe97yehnw5cq6pr388x3sn X-Rspamd-Queue-Id: A574F1C001A X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1721339380-661381 X-HE-Meta: U2FsdGVkX1/k50nKoU/2rg/3YMcmTS4YZ5ev9BKIfhHNi5yQ5fyyRiUllVBpcF7yFugB1kmDVDzSGh+4sK15t2uHU2KRdcttM506BveEG5ExW8R44SOw2OF9pYZVfgXZp83ghX7Xo585OqTbQ6L7Y0gihcpHx0LhUobKoA+FNv8Obwt68mup1TylJSGamj0FrxHyNCPXLzGTMiuei1S1Stsk5YLJrpUq1xKHILZNUKQ7qw0sh05nWBFwgUie59pswy/cZC2tsaXgcHAUB1EaGWaImaMROdw649rJpKAWpzI+oqgxvMcznVHh6FTuLckj03yjHLEDsLx1j15H0uXpVFKOeVbxPa0QFE+NGQvY1zmG0p85bO2j6kYy3ktKP5lVF2sE2+cF/Kluc5t8/lfTRPh+a8QazxVOWygUR/kJYuBju0Q/u3D5lobkHoe+K3qyvtC2dBhX73Wjh0Vpr6V286FjE22orIukhiuoTt/JUDmXByIZR8zxT+MOBd8AJaDY2QJ1176FQ3hcBu5xtRz9jU+LMM+sE+3fNWR+6FvtO33dPMP9MxaGmj23SoDdzujID0FvKQhkXHKvOrojwQ/0Yvi7KehtNH05J29rmLDOgWSEpMsDTe3/FMTO6bTkp/0UDY3x/mIs95xuKUH2kt36G0ba9u3Wb2khmvVe0s+QanHseyuZl5TLRaVuFxdC6VylaCbE1XXjnUiQWif1iEwCLRK9l3uOMnHdWVi1lwg3kaLmnceJVNlJCPTyIoLei0LmMagyfuRoHhNulXsTRehkbsqLUbBzAAC6LLg1mHkra60VJ9KY46Q26+XyCZbASfoGmDIoJjOJyqsnwe4kKQE3nosi3wt4Re+UK0+STQjy2RJ7WkxC8YhSI7b2+apYbfIzhvowM3X/mE+HV2k4rgZZgJroCFL5gqAyG/uuHjM6nQyBfvscP+SfhXtXN31p3lF8eirh5UwaRYNp+y4aigd HSQmhRVM Y0CleDrcEb+kWQCtO2VFYjMmOds+oqGDEHfGtZiTTxEAu1sAgNEsqLgfdXD55JnbGqtm4iIYY2NNKgEbnnZZQ3qqi6RA2pDRZaJjwrKqLQ3+C6RRR6dLrrN4pPh9nKXaQdlFJpGhBiW9qRY4jfRIS9tTiH4gDdu/rSKxAn/MbQ7zlYGyUzzIZ2PrUN/fLSjWst9Dcn7fxm4XqunM4spMCSzZjVk2dOND5uB1uXnxGc9A9ncrzf4LXhJ4yDQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 17, 2024 at 1:04=E2=80=AFPM Johannes Weiner wrote: > > On Tue, Jul 16, 2024 at 06:44:11AM -1000, Tejun Heo wrote: > > Hello, > > > > On Tue, Jul 16, 2024 at 03:48:17PM +0200, Michal Hocko wrote: > > ... > > > > This behavior is particularly useful for work scheduling systems th= at > > > > need to track memory usage of worker processes/cgroups per-work-ite= m. > > > > Since memory can't be squeezed like CPU can (the OOM-killer has > > > > opinions), these systems need to track the peak memory usage to com= pute > > > > system/container fullness when binpacking workitems. > > > > Swap still has bad reps but there's nothing drastically worse about it = than > > page cache. ie. If you're under memory pressure, you get thrashing one = way > > or another. If there's no swap, the system is just memlocking anon memo= ry > > even when they are a lot colder than page cache, so I'm skeptical that = no > > swap + mostly anon + kernel OOM kills is a good strategy in general > > especially given that the system behavior is not very predictable under= OOM > > conditions. > > > > > As mentioned down the email thread, I consider usefulness of peak val= ue > > > rather limited. It is misleading when memory is reclaimed. But > > > fundamentally I do not oppose to unifying the write behavior to reset > > > values. > > > > The removal of resets was intentional. The problem was that it wasn't c= lear > > who owned those counters and there's no way of telling who reset what w= hen. > > It was easy to accidentally end up with multiple entities that think th= ey > > can get timed measurement by resetting. > > > > So, in general, I don't think this is a great idea. There are shortcomi= ngs > > to how memory.peak behaves in that its meaningfulness quickly declines = over > > time. This is expected and the rationale behind adding memory.peak, IIR= C, > > was that it was difficult to tell the memory usage of a short-lived cgr= oup. > > > > If we want to allow peak measurement of time periods, I wonder whether = we > > could do something similar to pressure triggers - ie. let users registe= r > > watchers so that each user can define their own watch periods. This is = more > > involved but more useful and less error-inducing than adding reset to a > > single counter. > > > > Johannes, what do you think? > > I'm also not a fan of the ability to reset globally. > > I seem to remember a scheme we discussed some time ago to do local > state tracking without having the overhead in the page counter > fastpath. The new data that needs to be tracked is a pc->local_peak > (in the page_counter) and an fd->peak (in the watcher's file state). > > 1. Usage peak is tracked in pc->watermark, and now also in pc->local_peak= . > > 2. Somebody opens the memory.peak. Initialize fd->peak =3D -1. > > 3. If they write, set fd->peak =3D pc->local_peak =3D usage. > > 4. Usage grows. > > 5. They read(). A conventional reader has fd->peak =3D=3D -1, so we retur= n > pc->watermark. If the fd has been written to, return max(fd->peak, pc-= >local_peak). > > 6. Usage drops. > > 7. New watcher opens and writes. Bring up all existing watchers' > fd->peak (that aren't -1) to pc->local_peak *iff* latter is bigger. > Then set the new fd->peak =3D pc->local_peak =3D current usage as in 3= . > > 8. See 5. again for read() from each watcher. > > This way all fd's can arbitrarily start tracking new local peaks with > write(). The operation in the charging fast path is cheap. The write() > is O(existing_watchers), which seems reasonable. It's fully backward > compatible with conventional open() + read() users. I spent some time today attempting to implement this. Here's a branch on github that compiles, and I think is close to what you described, but is definitely still a WIP: https://github.com/torvalds/linux/compare/master...dfinkel:linux:memcg2_mem= ory_peak_fd_session Since there seems to be significant agreement that this approach is better long-term as a kernel interface, if that continues, I can factor out some o= f the changes so it supports both memory.peak and memory.swap.peak, fix the tests, and clean up any style issues tomorrow. Also, If there are opinions on whether the cgroup_lock is a reasonable way of handling this synchronization, or if I should add a more appropriate spi= nlock or mutex onto either the pagecounter or the memcg, I'm all ears. (I can mail the WIP change as-is if that's prefered to github) --=20 David Finkel Senior Principal Software Engineer, Core Services