From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 10B82C61DA4 for ; Fri, 3 Feb 2023 15:29:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8DB5E6B0074; Fri, 3 Feb 2023 10:29:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 88B076B0075; Fri, 3 Feb 2023 10:29:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 752B16B0078; Fri, 3 Feb 2023 10:29:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 653D36B0074 for ; Fri, 3 Feb 2023 10:29:31 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 09285411C8 for ; Fri, 3 Feb 2023 15:29:31 +0000 (UTC) X-FDA: 80426364942.26.8C6A942 Received: from mail-ej1-f54.google.com (mail-ej1-f54.google.com [209.85.218.54]) by imf14.hostedemail.com (Postfix) with ESMTP id 25FB710001F for ; Fri, 3 Feb 2023 15:29:27 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=DItiumzo; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf14.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675438168; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BpJGj/MjEF41YcojHv2HTee/4NjLlWGDg/WhzHnnp88=; b=p8UdhKsb9OPoX9S3y4u6HYNjNI1tpmJ9dxDOwVc4Kir7PSjSgSGOJiRXK6tYu9teCYFTKV YUp30txd8ChtAQCtBgfmI21+DZcY41VmrdGe9JAtGY0kqrMsLh67+BwYedYPF3DAAQHxqo trjl7Y7ZTYZBWk7OCDNSONvvfdBXMn8= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=DItiumzo; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf14.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675438168; a=rsa-sha256; cv=none; b=DAZDZTFVZ8xRYgT1CsBKi8bv9AMQv0bXHxqL445LWqS1SPQR9D6KGZvVySz9Rp+sjCju0Q N2L4A3vVNdNnHkhOvShpvFCt8CIYzfde64BgaK5CIrhSzbJqnRIg1MnU7jwHWms75Oaon3 5FyE4knL+2Otrtq1efr5sYl3ZtjztSk= Received: by mail-ej1-f54.google.com with SMTP id bk15so16283715ejb.9 for ; Fri, 03 Feb 2023 07:29:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=BpJGj/MjEF41YcojHv2HTee/4NjLlWGDg/WhzHnnp88=; b=DItiumzo3t2xRwXnUdqlKnpCaLlxACiSbEGK+3Knp48getxAhBVc9XAyB78VmpN04r aHjuS7YQ6YDG0YLI9aUja88Ic6hWQWg0bvkAOg4GOqLvEvFSLrlPzD+yp3l/X3yZ+EBB llaGyuS+zB/46oddbcYw7d2Mlrbp+uBkgm2oIiEtW2EYpVX7RE5pt5HphuShfiTa/wAw qg5Iw/vqpY2tz6egUuMHQRlgTGg4J3t1Bb2d/G+vhQHtfGFhVCuBZVoa3SuJ0EZ73qtZ MzeUvmHjf+edglSRGAm35UfObJE6NmhKE3tJMh6ZX0nvtmI1cdXP5gz5DCufUEto4DuV lmzA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=BpJGj/MjEF41YcojHv2HTee/4NjLlWGDg/WhzHnnp88=; b=1XldOZix/kMvnGPVYWHt9wDeFoY8XMTy8mpMhXW94Ri7gO+rT1rFnOjfvhXpd3s+DH +2UQbKuPQVgH3si25ThXkk+3kJfZOGU2WOHZhu6lHk2xmQTga72RHW0T4aNoIMjkcyDH dZwBGtq/tDlUsP3F9ieIehUUtV1y2GBrH18iADkWQW7UCN/3EKuqmyh79UegTSFeEdd4 QQ4dYkIniQy0xfP/oGFCcRaXIk9GS7b/XQxPNlfqdSB85D0CnTl3nB4PF3Guaw6/czGj 9Nncd3B5dB/ntBT7Q/n71SRu8402dw0hQ0KaZY7B8BSSQY8yrPXj9DDVVth0qUQi9p9r To+Q== X-Gm-Message-State: AO0yUKU6aBXc1V69CfqrJvEMEStyL4GL48KzjS4/tVR3yWIioW/oumXQ kqLLPyO6S9tHpt+1OJWs4qMlWk1/hwf2UJaBdlgwUA== X-Google-Smtp-Source: AK7set/WxSUHuoAVADEgI2U/KrIdwUnRv3ZOFP5GW7Mha4pHke5L7dgXrmu+rCNBXYXNTfo3fyLfm+pxNS0MTuyX1UM= X-Received: by 2002:a17:906:c241:b0:7c0:b3a8:a5f9 with SMTP id bl1-20020a170906c24100b007c0b3a8a5f9mr3061493ejb.154.1675438166391; Fri, 03 Feb 2023 07:29:26 -0800 (PST) MIME-Version: 1.0 References: <20230202233229.3895713-1-yosryahmed@google.com> <20230203000057.GS360264@dread.disaster.area> In-Reply-To: From: Yosry Ahmed Date: Fri, 3 Feb 2023 07:28:49 -0800 Message-ID: Subject: Re: [RFC PATCH v1 0/2] Ignore non-LRU-based reclaim in memcg reclaim To: Johannes Weiner Cc: Dave Chinner , Alexander Viro , "Darrick J. Wong" , Christoph Lameter , David Rientjes , Joonsoo Kim , Vlastimil Babka , Roman Gushchin , Hyeonggon Yoo <42.hyeyoo@gmail.com>, "Matthew Wilcox (Oracle)" , Miaohe Lin , David Hildenbrand , Peter Xu , NeilBrown , Shakeel Butt , Michal Hocko , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 25FB710001F X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: r9bssyg8c4c8fbep4fgge6s99zteunb7 X-HE-Tag: 1675438167-158030 X-HE-Meta: U2FsdGVkX1+ttptdgjj8Ef8gwP5apulHc8mQ64AQIjcOOvBkh8ksU0LTu/veFmxS5GEpDQu5jwrnCqgH88dPmZ3FLME1fKrWauJkJqgomrdAZenLu7nZaTwrGLf8lKxwY/FGJIC+IbrzqVwXkts5YZ86xJzigMgVC97cMusRS823aqb/GozPK/Xs5aO77qUbCxQKiqKrkaiJsplvxg0gU7z8mPbpEQDLW/p+wqhC39uNLwCPw6HrWxqY1tsB70r3u9CONLC74Ju87w+vozmOInBCGL0ubR+eOmZDoMFFLU55CzwVXkOSlXYIi4R4QZ2pE1zVYuPdVzGKZFbC6sJ2vSQhe+IXSU7rIz79n+nZVenY8VoDW3tnl1G5J5BXAMN61V0ZABunjFaEIPSJYACfkTLP2zTNa14lPJ3vjpjH2eIMlokpAB2A4ThyVp+FXPQkD8UqSf9EWHHeFJPsRkGpqBAXSv4QzTDUde/BmnqJ63BaHx4WWPP/JsjDX62p8wnyoaNPu2WIycXZ/J/dGxqOGs9wWu+77q9bAjwoWItol3MZLethFbKkH4lKsuSMRO12VFXUOUvBeHuvUTVjTxNhnNDnG9cF0tLZM2U/yO8MXRJUeALiddHnXDOwNxafJPHist/yNbXnHoa9OAH+QvB4de7kmzK8x7xccJgu3BJgAAgdkFrAeMyyVXLeGxeUidX7WZTcT4zJuJ3vxs7zAYYiDQD98hRsOOG45GG/5WT9FsmhoThsp8VXFcChGLKqus0Xg5vLfhY2pUWDoNEVwk2hNx6Mfl1q/9lUlHcwCfPG5hFh3m4XeMWL0MgB6ofJHrgw2oRvVN1RKnMg/vElQxevbuNQqweQS2J5r0xlqgKW5wB7kO9FzD2MxGDNghbrBqViYupnyxj9NHx73fK3i/PxjzxjH8FjJzjJYn7FKV2AR7YI6LbhUY+ZkVwt8bRraAmSyS/va2kUiVrTwN5V7Fb 1OrVEGXN o6TxNWBlVE7B9uyyZa6X1nfE4LxXHZzBM4Ei/7EkiV/Lx7N1Z192NDRXLJTazu6rWoKTUqRLr/0sLRFOuut0m8EZ1eTnx7WCYW3L29A/QFidc9K84RLlf1ko+oyqrWwv7+YYtN+t/INuYezAcbkKroOXado6p+2qGNnwucoIHwfAo78THuyUF/IYfamI/4kGg8il3xmKBmdIl035ZBQKvaEnNRdlKH4U6VFor2mv6OFClK7Tsf1/yDRXxiH01oXkwNxddqOj5wLpyhRQXQmyT8d7c4EKVx/YtFvF6zo9Jtit/SdrCstXyegYPAZKMER1dvYRJAcV252+chdfzJ6uVXOOu3Zr+ejfwgpWrWHtOsoANT7mtgmFuDDqIo929zY4Q9HfriAqRzR/tL8kDrY8SmNhspP68zXLDm0i+vJEDz4jzPwRO3l1BWHThUQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Feb 3, 2023 at 7:11 AM Johannes Weiner wrote: > > On Thu, Feb 02, 2023 at 04:17:18PM -0800, Yosry Ahmed wrote: > > On Thu, Feb 2, 2023 at 4:01 PM Dave Chinner wrote: > > > > Patch 1 is just refactoring updating reclaim_state into a helper > > > > function, and renames reclaimed_slab to just reclaimed, with a comment > > > > describing its true purpose. > > > > > > > > Patch 2 ignores pages reclaimed outside of LRU reclaim in memcg reclaim. > > > > > > > > The original draft was a little bit different. It also kept track of > > > > uncharged objcg pages, and reported them only in memcg reclaim and only > > > > if the uncharged memcg is in the subtree of the memcg under reclaim. > > > > This was an attempt to make reporting of memcg reclaim even more > > > > accurate, but was dropped due to questionable complexity vs benefit > > > > tradeoff. It can be revived if there is interest. > > > > > > > > Yosry Ahmed (2): > > > > mm: vmscan: refactor updating reclaimed pages in reclaim_state > > > > mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim > > > > > > > > fs/inode.c | 3 +-- > > > > > > Inodes and inode mapping pages are directly charged to the memcg > > > that allocated them and the shrinker is correctly marked as > > > SHRINKER_MEMCG_AWARE. Freeing the pages attached to the inode will > > > account them correctly to the related memcg, regardless of which > > > memcg is triggering the reclaim. Hence I'm not sure that skipping > > > the accounting of the reclaimed memory is even correct in this case; > > > > Please note that we are not skipping any accounting here. The pages > > are still uncharged from the memcgs they are charged to (the allocator > > memcgs as you pointed out). We just do not report them in the return > > value of try_to_free_mem_cgroup_pages(), to avoid over-reporting. > > I was wondering the same thing as Dave, reading through this. But > you're right, we'll catch the accounting during uncharge. Can you > please add a comment on the !cgroup_reclaim() explaining this? Sure! If we settle on this implementation I will send another version with a comment and fix the build problem in patch 2. > > There is one wrinkle with this, though. We have the following > (simplified) sequence during charging: > > nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, > gfp_mask, reclaim_options); > > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > goto retry; > > /* > * Even though the limit is exceeded at this point, reclaim > * may have been able to free some pages. Retry the charge > * before killing the task. > * > * Only for regular pages, though: huge pages are rather > * unlikely to succeed so close to the limit, and we fall back > * to regular pages anyway in case of failure. > */ > if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER)) > goto retry; > > So in the unlikely scenario where the first call doesn't make the > necessary headroom, and the shrinkers are the only thing that made > forward progress, we would OOM prematurely. > > Not that an OOM would seem that far away in that scenario, anyway. But I > remember long discussions with DavidR on probabilistic OOM regressions ;) > Above the if (nr_reclaimed...) check we have: if (gfp_mask & __GFP_NORETRY) goto nomem; , and below it we have: if (nr_retries--) goto retry; So IIUC we only prematurely OOM if we either have __GFP_NORETRY and cannot reclaim any LRU pages in the first try, or if the scenario where only shrinkers were successful to reclaim happens in the last retry. Right? > > > I think the code should still be accounting for all pages that > > > belong to the memcg being scanned that are reclaimed, not ignoring > > > them altogether... > > > > 100% agree. Ideally I would want to: > > - For pruned inodes: report all freed pages for global reclaim, and > > only report pages charged to the memcg under reclaim for memcg > > reclaim. > > This only happens on highmem systems at this point, as elsewhere > populated inodes aren't on the shrinker LRUs anymore. We'd probably be > ok with a comment noting the inaccuracy in the proactive reclaim stats > for the time being, until somebody actually cares about that combination. Interesting, I did not realize this. I guess in this case we may get away with just ignoring non-LRU reclaimed pages in memcg reclaim completely, or go an extra bit and report uncharged objcg pages in memcg reclaim. See below. > > > - For slab: report all freed pages for global reclaim, and only report > > uncharged objcg pages from the memcg under reclaim for memcg reclaim. > > > > The only problem is that I thought people would think this is too much > > complexity and not worth it. If people agree this should be the > > approach to follow, I can prepare patches for this. I originally > > implemented this for slab pages, but held off on sending it. > > I'd be curious to see the code! I think it is small enough to paste here. Basically instead of just ignoring reclaim_state->reclaimed completely in patch 2, I counted uncharged objcg pages only in memcg reclaim instead of freed slab pages, and ignored pruned inode pages in memcg reclaim. So I guess we can go with either: - Just ignore freed slab pages and pages from pruned inodes in memcg reclaim (current RFC). - Ignore pruned inodes in memcg reclaim (as you explain above), and use the following diff instead of patch 2 for slab. - Use the following diff for slab AND properly report freed pages from pruned inodes if they are relevant to the memcg under reclaim. Let me know what you think is best. diff --git a/include/linux/swap.h b/include/linux/swap.h index bc1d8b326453..37f799901dfb 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -162,6 +162,7 @@ struct reclaim_state { }; void report_freed_pages(unsigned long pages); +bool report_uncharged_pages(unsigned long pages, struct mem_cgroup *memcg); #ifdef __KERNEL__ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ab457f0394ab..a886ace70648 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3080,6 +3080,13 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg, memcg_account_kmem(memcg, -nr_pages); refill_stock(memcg, nr_pages); + /* + * If undergoing memcg reclaim, report uncharged pages and drain local + * stock to update the memcg usage. + */ + if (report_uncharged_pages(nr_pages, memcg)) + drain_local_stock(NULL); + css_put(&memcg->css); } diff --git a/mm/vmscan.c b/mm/vmscan.c index 207998b16e5f..d4eced2b884b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -204,17 +204,54 @@ static void set_task_reclaim_state(struct task_struct *task, task->reclaim_state = rs; } +static bool cgroup_reclaim(struct scan_control *sc); + /* * reclaim_report_freed_pages: report pages freed outside of LRU-based reclaim * @pages: number of pages freed * - * If the current process is undergoing a reclaim operation, + * If the current process is undergoing a non-cgroup reclaim operation, * increment the number of reclaimed pages by @pages. */ void report_freed_pages(unsigned long pages) { - if (current->reclaim_state) - current->reclaim_state->reclaimed += pages; + struct reclaim_state *rs = current->reclaim_state; + struct scan_control *sc; + + if (!rs) + return; + + sc = container_of(rs, struct scan_control, reclaim_state); + if (!cgroup_reclaim(sc)) + rs->reclaimed += pages; +} + +/* + * report_uncharged_pages: report pages uncharged outside of LRU-based reclaim + * @pages: number of pages uncharged + * @memcg: memcg pages were uncharged from + * + * If the current process is undergoing a cgroup reclaim operation, increment + * the number of reclaimed pages by @pages, if the memcg under reclaim is @memcg + * or an ancestor of it. + * + * Returns true if an update was made. + */ +bool report_uncharged_pages(unsigned long pages, struct mem_cgroup *memcg) +{ + struct reclaim_state *rs = current->reclaim_state; + struct scan_control *sc; + + if (!rs) + return false; + + sc = container_of(rs, struct scan_control, reclaim_state); + if (cgroup_reclaim(sc) && + mem_cgroup_is_descendant(memcg, sc->target_mem_cgroup)) { + rs->reclaimed += pages; + return true; + } + return false; } LIST_HEAD(shrinker_list);