From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7BCF2C3DA5D for ; Wed, 17 Jul 2024 20:45:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C326E6B00B6; Wed, 17 Jul 2024 16:45:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BE1B36B00B8; Wed, 17 Jul 2024 16:45:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AA9B66B00B9; Wed, 17 Jul 2024 16:45:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 8C9CD6B00B6 for ; Wed, 17 Jul 2024 16:45:02 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id A3FFF809F9 for ; Wed, 17 Jul 2024 20:45:01 +0000 (UTC) X-FDA: 82350424002.16.934A006 Received: from mail-oi1-f182.google.com (mail-oi1-f182.google.com [209.85.167.182]) by imf20.hostedemail.com (Postfix) with ESMTP id 6F8391C0034 for ; Wed, 17 Jul 2024 20:44:59 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=vFaV4J93; spf=pass (imf20.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.167.182 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721249046; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=IYtwH57V8kt1/aiLa729CYDDtj9L/fOjcfkBpY+TAD4=; b=NsQtNCFZf+NfOzBuN7I7N6hvTbAiQ1TI6SMeHlM7+AAHZUntlxRuerr1YNfXIIPUBy/w2k 91D7dFVCurYEWHcxYRiqhSkvFuOpWYPzGpmmzT62M6mZTeeWP5ff+TUa0E1+cqvjTAyOuq oS+DWJKpvC/dylF8CVURBWj3kUr6fhQ= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=vFaV4J93; spf=pass (imf20.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.167.182 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721249046; a=rsa-sha256; cv=none; b=IbiMhhe6kCqm0qikDa7uwwTWm2EYsUZR4gkH2jE9QpexuJvhu6nCJIsN8VfFX60qdnH35m U9AclU3weK3189mMEC+x4Xd/Z6tBGR8VGD0hxm94FPNEUB/0f4uPPoUOMcgg7GF2n5i+hv Tavy19JXS4oUZrjGOeGUftTlZwKY1kQ= Received: by mail-oi1-f182.google.com with SMTP id 5614622812f47-3d853e31de8so80420b6e.2 for ; Wed, 17 Jul 2024 13:44:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1721249098; x=1721853898; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=IYtwH57V8kt1/aiLa729CYDDtj9L/fOjcfkBpY+TAD4=; b=vFaV4J93cB8cRZIYX7j7abpUkQVt0uj8YjhCKiGa0SwrRDRAEtWi3ETOePwUFDHfNM SmcNWbTtD3K0dDQ13JEHcxNgFJ91BNDkJPBbfVFM0j2QbAZjtSFaTslXdqcAHYCY94hH foMsV3uzyobuZWjXFVbPdO3r3Q+d0WHxhlCNGSGrqEM4yK3wo+ylcB7aCEA5Y5pmMaNa TsyaoCEtnfFRFnza1yaKHvCMc9psB68Z7aLobkTcc1XrRQRGk0XlVcdK44XfNZnSax+u tifieLoS5ndOPD5omuiUoNGHoPoQVfydqHysVAHJFXTWd4W4cyMN9muAQDEqezxQh08N Iq4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721249098; x=1721853898; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=IYtwH57V8kt1/aiLa729CYDDtj9L/fOjcfkBpY+TAD4=; b=tXRlwYbmrHyHovGAugmIUuXwADcZlShn/qnAGkEMtT28HDKFHVoKgYo80HTwvw/kZt ThdcUWLsio8LtEbMpkDUzy1Wv2ZCyWMvcxklMoYxxNokdydVlaONrxG45fxiqFXNsJEa r4iRVNRLxFsphX8Tty0PXT0aGWv3/M0WoDgrcMiI9IH9hl0gXyqqwgLi+aqn4UjZQRvd t08DKGx95zPg8kuYMV+wKM5ILU4/XqZpc/lkI4tbEsXcQBDKpdPH/BIsBROv44cvN+O7 5PXfjVSJfvq16eywgwsXXh8s5d1/fhkx0GpD/JearaMDITfk1ATt71ocUDD4sB68MnFt oxVw== X-Forwarded-Encrypted: i=1; AJvYcCW+UhPDXHfllm6B1xLoFk8gtUcjbo1hvXk/0UZm3PRk0MMtv0XkCHOnA0zNpEr05wYi72xLpW4+DJ5C3l5Q9DSzqtU= X-Gm-Message-State: AOJu0YwAGY+MMiMnQCIoJMDh7nh0R+Hz0qatiTJye9wbvur7JgrADLXd AA0CUn+7XvM1ogw9XoG1zyc2psHf+rFOG01G0Erz4axNi9Dq6ix+BQPR2Kscm7s= X-Google-Smtp-Source: AGHT+IF4+/Pck83cU63sy8DWfGHaNuyZzaqpdVdmLEhKuF8kgFuxUsMp0AUq43ixoy0Rip5ijHUwjA== X-Received: by 2002:a05:6808:188f:b0:3d9:2b95:3306 with SMTP id 5614622812f47-3dad52bd2e4mr2359887b6e.42.1721249098162; Wed, 17 Jul 2024 13:44:58 -0700 (PDT) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7a160b98dc1sm445620285a.1.2024.07.17.13.44.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Jul 2024 13:44:57 -0700 (PDT) Date: Wed, 17 Jul 2024 16:44:53 -0400 From: Johannes Weiner To: David Finkel Cc: Tejun Heo , Michal Hocko , Muchun Song , Andrew Morton , core-services@vimeo.com, Jonathan Corbet , Roman Gushchin , Shuah Khan , Zefan Li , cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Shakeel Butt Subject: Re: [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers Message-ID: <20240717204453.GD1321673@cmpxchg.org> References: <20240715203625.1462309-1-davidf@vimeo.com> <20240715203625.1462309-2-davidf@vimeo.com> <20240717170408.GC1321673@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 6F8391C0034 X-Stat-Signature: 5zxc8kf9qx6888ewdk7ktg9oemrtfye5 X-Rspam-User: X-HE-Tag: 1721249099-838012 X-HE-Meta: U2FsdGVkX1+eXgoWzlRCJWKf/VJJAhnlxKLb9gQunX+Uo/ftod5dLQjUzf2rhppiIqGb9EwxiJMoUBgx0dsgwfVsy7lYOHdhFg5w9PF2bCYpgHq2I0sQbtqJSP1EOfGMWl/vOer6SmiyXCfuNzjBh3aCV2BlhRo6EVHBZ6QjUbrMaWptQ/krlBRwZjziXenHR8lTJxz6twgXtiMJOBM53tyoxqIYPv425e7D+ZrB1iJwXcXiL9Yyb+MP11K/BM70GY54HEibcGIC6Q8ZyxAFHbwCBqX0uJ+NvmkyK+BdTlatFY6MYPUFcS+2ha7ZP0sU70gtdsFgfLM9VQtUcCtpGNqOm/76DQy9zJB8WxE8RuA08Tt/H4N1zqxb+/hTm7n/BxA/cDd6helc3xXgVDfRFwfwcPLvby/uJhGTewrtE5PfVCRc9VE+JkDeSgJP0hqjKlo7VZIk8nBP+c+1oXVCrRpUzkUwlcJshTHfrYvyTdc29G9TD/KUk4XVD7rVydwXmaaMtSGWb4sR7LzkIjgmqxUcPnPRvC60i+9Vqkr2Rf6BECJPtzR92zhJtJjHjKpCLYZe5ZULLWGBqQUyobE+Fuwbv7l3dFTdp3cC09Q54Ru/meSkJaFdvcPxMEHV3OF272oaagdXBbq6cXHO1or8Q169VXpvhwgZKyVXCP4xN681fJetvBdFy7POVCHkhP2gK+na6XAbna6cI7flT6SxnPNUWWGT943ziMFvFC8RheJqG1n69xk+ckTg2iaJvgo4OLgy0EIrhgH5LWT0O+uCc3RV0mchGd+OX+2apHzV7/Rf0X2xt3juHR6lE7qXRRtQXOHAcQw9l+MRRQoCid++37Qu4R07iKybTKTB685aG9qaLH3KaawVR1KzNFpQcVSUDqRsg/cHnuwhr1quJ1Cr1s04WhGCAKOM6gaf3c9wYBqa1R2JT2odsSTrIRNn3bIM1R7udmYjxBGHtAafwxv AFImwShV EnBFbZ518yR4coKf+TjMFvZF1bNIUb3GjudqNqoQOl92LaS9ocIEHhEutFXlOAcAgtgOLB4z3DznAIs0fGvsylzN925BkPEQNIHPoR0h9Xir7YR3cL4S55Jg8YApZ+GrOuhM7zzNL1Q025zn6sldHxmnNWmk8eEVW85fJ071lge5xDo7UKk8en8ia+1JZTirQKdcrzBTtskPXBCbjaoFo9MqnlnMsuTRxuU4OOxCPFUQ1bYWWJB92GEOXDOnsmJMvQws2iK8mDhX96kE5IyJ/CnvP4KFi6LNBMaDvQzX9S+dYNLLbEVVV4ObqobLsC3qQva0i3uMfsglA1VqQeu4PXnZOnOFr9AGcFd9moNjWQAPxA3Y/wdTtH5IIL7NPIBpQ5PK/0uGfcpmQN+idG1puAhJQszV7Iikhz1sPOEspMBuXy/x30OZOdAu4CBToQPUWa61X8h8QIXpMN3qZ7fKnA1ZFcczeN7fbx2GON6V+aRdSinY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 17, 2024 at 04:14:07PM -0400, David Finkel wrote: > On Wed, Jul 17, 2024 at 1:04 PM Johannes Weiner wrote: > > > > On Tue, Jul 16, 2024 at 06:44:11AM -1000, Tejun Heo wrote: > > > Hello, > > > > > > On Tue, Jul 16, 2024 at 03:48:17PM +0200, Michal Hocko wrote: > > > ... > > > > > This behavior is particularly useful for work scheduling systems that > > > > > need to track memory usage of worker processes/cgroups per-work-item. > > > > > Since memory can't be squeezed like CPU can (the OOM-killer has > > > > > opinions), these systems need to track the peak memory usage to compute > > > > > system/container fullness when binpacking workitems. > > > > > > Swap still has bad reps but there's nothing drastically worse about it than > > > page cache. ie. If you're under memory pressure, you get thrashing one way > > > or another. If there's no swap, the system is just memlocking anon memory > > > even when they are a lot colder than page cache, so I'm skeptical that no > > > swap + mostly anon + kernel OOM kills is a good strategy in general > > > especially given that the system behavior is not very predictable under OOM > > > conditions. > > > > > > > As mentioned down the email thread, I consider usefulness of peak value > > > > rather limited. It is misleading when memory is reclaimed. But > > > > fundamentally I do not oppose to unifying the write behavior to reset > > > > values. > > > > > > The removal of resets was intentional. The problem was that it wasn't clear > > > who owned those counters and there's no way of telling who reset what when. > > > It was easy to accidentally end up with multiple entities that think they > > > can get timed measurement by resetting. > > > > > > So, in general, I don't think this is a great idea. There are shortcomings > > > to how memory.peak behaves in that its meaningfulness quickly declines over > > > time. This is expected and the rationale behind adding memory.peak, IIRC, > > > was that it was difficult to tell the memory usage of a short-lived cgroup. > > > > > > If we want to allow peak measurement of time periods, I wonder whether we > > > could do something similar to pressure triggers - ie. let users register > > > watchers so that each user can define their own watch periods. This is more > > > involved but more useful and less error-inducing than adding reset to a > > > single counter. > > > > > > Johannes, what do you think? > > > > I'm also not a fan of the ability to reset globally. > > > > I seem to remember a scheme we discussed some time ago to do local > > state tracking without having the overhead in the page counter > > fastpath. The new data that needs to be tracked is a pc->local_peak > > (in the page_counter) and an fd->peak (in the watcher's file state). > > > > 1. Usage peak is tracked in pc->watermark, and now also in pc->local_peak. > > > > 2. Somebody opens the memory.peak. Initialize fd->peak = -1. > > > > 3. If they write, set fd->peak = pc->local_peak = usage. > > > > 4. Usage grows. > > > > 5. They read(). A conventional reader has fd->peak == -1, so we return > > pc->watermark. If the fd has been written to, return max(fd->peak, pc->local_peak). > > > > 6. Usage drops. > > > > 7. New watcher opens and writes. Bring up all existing watchers' > > fd->peak (that aren't -1) to pc->local_peak *iff* latter is bigger. > > Then set the new fd->peak = pc->local_peak = current usage as in 3. > > > > 8. See 5. again for read() from each watcher. > > > > This way all fd's can arbitrarily start tracking new local peaks with > > write(). The operation in the charging fast path is cheap. The write() > > is O(existing_watchers), which seems reasonable. It's fully backward > > compatible with conventional open() + read() users. > > That scheme seems viable, but it's a lot more work to implement and maintain > than a simple global reset. > > Since that scheme maintains a separate pc->local_peak, it's not mutually > exclusive with implementing a global reset now. (as long as we reserve a > way to distinguish the different kinds of writes). > > As discussed on other sub-threads, this might be too niche to be worth > the significant complexity of avoiding a global reset. (especially when > users would likely be moving from cgroups v1 which does have a global reset) The problem is that once global resetting is allowed, it makes the number reported in memory.peak unreliable for everyone. You just don't know, and can't tell, if somebody wrote to it recently. It's not too much of a leap to say this breaks the existing interface contract. You have to decide whether the above is worth implementing. But my take is that the downsides of the simpler solution outweigh its benefits.