From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 903B6C4708D for ; Wed, 7 Dec 2022 21:44:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F01AC8E0003; Wed, 7 Dec 2022 16:44:09 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EB0D98E0001; Wed, 7 Dec 2022 16:44:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D79198E0003; Wed, 7 Dec 2022 16:44:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id C35F68E0001 for ; Wed, 7 Dec 2022 16:44:09 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 7C127140CD4 for ; Wed, 7 Dec 2022 21:44:09 +0000 (UTC) X-FDA: 80216838618.06.F1D195D Received: from mail-vk1-f182.google.com (mail-vk1-f182.google.com [209.85.221.182]) by imf25.hostedemail.com (Postfix) with ESMTP id 149C6A0014 for ; Wed, 7 Dec 2022 21:44:07 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=Ktalp9HZ; spf=pass (imf25.hostedemail.com: domain of almasrymina@google.com designates 209.85.221.182 as permitted sender) smtp.mailfrom=almasrymina@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670449448; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=uup7IkDj0Tj4wV+c1b16APmIL8/YUUL2/TzkGwezP54=; b=vg6EMlHf7AtQaHNgG7630RoFZFwJqhNHI9MZlsIJ6hZUXaj5bAinjNGocaf0lAtdypgiWX 4qQAxi7hWkGv0Ev5dRd4vFe8WeHSucOnAu+wDaMOSpV+c5Yxou5l3xm5fqgwt+zd0ssoA0 WjMn8MK8/Kgru9cdm2YXE88pX1djRaA= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=Ktalp9HZ; spf=pass (imf25.hostedemail.com: domain of almasrymina@google.com designates 209.85.221.182 as permitted sender) smtp.mailfrom=almasrymina@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670449448; a=rsa-sha256; cv=none; b=YfOnMXE1iWs6GBiR+izWF+ZlxRhZO8PUHiwCYZ/kJAcUMl5IO5RRqUIaRBDuHQ3qHFcmRR TNRiDL4IZFtAbcjX27pboPvu1T7K/szYdrQh84L+DC2lOLzyEDGTF4QazfEE4hITGC9W6L izhDUO8hlPQGUl09Lbj9wMVHZavUliA= Received: by mail-vk1-f182.google.com with SMTP id b81so8888157vkf.1 for ; Wed, 07 Dec 2022 13:44:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=uup7IkDj0Tj4wV+c1b16APmIL8/YUUL2/TzkGwezP54=; b=Ktalp9HZmEqYzzFA53wP+de1fhUrCUieTsR0ijbVrH2vpxmqnMu08SGmtDf6uA3hxB GYqGNC6CvxNBtX2ikrJPcpNe2FKF8GDITCJgvUQF4MrgphGC3Ih2Z+ansZ/FoxHG6xTB 75JICu2DCJCamAqbthcAFX63+5E/Gu+sZJUCaML0GEWCYYL+w+J6jmbtvj+LtzZDEU9V 8gTwjIT749WCnEQVgyEn7lfd7Tvy+V9wT12Hw1EUeEpdrcflFtf/Ao7Px6l2jt6jiIbR 5cT93ROCIYF6H+m6xUdbx7u44uuJMGny2r9oReJAHt2KkKF9tXdi4IFUdTvbq8k3PmqG W6VA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=uup7IkDj0Tj4wV+c1b16APmIL8/YUUL2/TzkGwezP54=; b=nGORswyRhUpK0tPThWNNEGkP10H9CjGsgagu6x+97XWqaiYxPKEPrFb42H0o0UfVIA aIx7RrDoQjL3YXyFokVzGifvh+22HiBnIyzCi2iKTt7anKVC2JYlCjG58vCO+xfZcPvG vwbD7qZjRADSEkhpJjq6RJUDOV4bp1LRC2dHpGgaIo7eUip7dJJEgRyoht8RYwche2G1 jTJ8iEJSK+OMnvtjO5rhllAyxhFhl1T1FqfgeMYAppP5wel31C0kdafAThTEciRpRjrt L5B+fAEXFaak4UxfzXpgZcNTqP3TsjWO97AU6kTxTRZ4n8iWI3eJgINBiWiSkw2wzVPx A6LQ== X-Gm-Message-State: ANoB5pmSGc/w2y+kwhaNkhxaQAxe3V9Qz/fxHcQI3OvHALYhWCG71Qhs by8l2PwfIgGPAgeWroSOEbUUGxTwcGNSrLDb08j8aw== X-Google-Smtp-Source: AA0mqf4G+r8VHesiFpRJVFS+4Uqt+AYMPWd4+05cWYFuYFwaSaULZlFLvOXyV30k+oepYxdsjKpoEEf5j0owzBAptak= X-Received: by 2002:a05:6122:41e:b0:3bd:ad7c:b3ec with SMTP id e30-20020a056122041e00b003bdad7cb3ecmr7972933vkd.0.1670449446897; Wed, 07 Dec 2022 13:44:06 -0800 (PST) MIME-Version: 1.0 References: <20221206023406.3182800-1-almasrymina@google.com> In-Reply-To: From: Mina Almasry Date: Wed, 7 Dec 2022 13:43:55 -0800 Message-ID: Subject: Re: [PATCH v3] [mm-unstable] mm: Fix memcg reclaim on memory tiered systems To: Michal Hocko Cc: Andrew Morton , Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Huang Ying , Yang Shi , Yosry Ahmed , weixugc@google.com, fvdl@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 149C6A0014 X-Stat-Signature: mk9f4y6jwbbgtj45ze4x6oa9qtbq9gdt X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1670449447-891948 X-HE-Meta: U2FsdGVkX1/hc2tASDa31LOOy271pYH/yFsfbFbtXWLIpChiJCjvwSmniVInmVqOKPtU099z37kyKxC3J7dPcSPh/OgAEinBwMpI1ZoAQelVYMqjFf30GNsFVje/9QNm0PGucgOHg9bYrdrj/mBc2WHWGNXPM/DjO0CosPFKivEpUCCswfCXmroXuxCRAFRIthQor7uJO1SHfBmuv0bzmTCYEhM4HYImR/ilHLoTPOpf9QfuitQsD7AaRcl9bcdXXMhK37ie1E31x3lW+9QaqtE/d6D/3Ht3VMdfZNEGEFyBRmEt+puO6GWed7y+19hU+WSgULaymeYv2mG2fcmzWTTES6kr16qewmc2OR/ygS8ds2Iqe50BMTTBfrJbH9qINXxmOcVEBKJztXic4/CBRbYF+t4PwRevt9eJRTz5lc861URexJ3Kva/LrZGPPTloHlMgDoQEV/sfGr6/todr+AAoRYwoQHw1XIiPK3BABlKL+4kabasZiaQR2WWSY2VB1v5cvVWlywbFFisi3F8lrwSOkKziUTFgrTIqsNdTw69Irj/NiXHUvDYgCOPryfWo5ZRvqzmjcnNK42uBYFn9N2b4m9YO0/lfKX23t0Cqe4n3LbZP3zKD3OoPC3Iy+hRzhQebkLiv79zcQhWXWefaKfKqHVAve6+CBgb73WGwyJ2ug8PUfsk0XHODNUaf7SqjqWhQBqYt48OlQCFiou3dv9VQjvDy4lFN+NphjWiLJtSCNDECqh55rH3LXZyuBrt3OVqClKaRxHbdEKWQnEQ214hrYRodoCLIhSpmt6isauST8c+IwcPb/BNVjU/eUc3/HxiOxRChkVF5G0B9IwMgEX7Tcca8Mx1Z0Ydk0LdnDMpO9Gi/xjAt6xNM2ripCjg8ld0aQjO1J3erZ/7tDuhwAvYF9JS+BeLCNfRy2QkT5Amz6Xh5l568Rvunyv2t9D2z8aupL/N3zQFYqxKcD0q AlhAZ0jf 8EXnwmg9MTDXRlBE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Dec 7, 2022 at 3:12 AM Michal Hocko wrote: > > On Tue 06-12-22 17:55:58, Mina Almasry wrote: > > On Tue, Dec 6, 2022 at 11:55 AM Michal Hocko wrote: > > > > > > On Tue 06-12-22 08:06:51, Mina Almasry wrote: > > > > On Tue, Dec 6, 2022 at 4:20 AM Michal Hocko wrote: > > > > > > > > > > On Mon 05-12-22 18:34:05, Mina Almasry wrote: > > > > > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > > > > > reclaim"") enabled demotion in memcg reclaim, which is the right thing > > > > > > to do, however, it introduced a regression in the behavior of > > > > > > try_to_free_mem_cgroup_pages(). > > > > > > > > > > > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to > > > > > > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage > > > > > > of the cgroup should reduce by nr_pages. The callers expect > > > > > > try_to_free_mem_cgroup_pages() to also return the number of pages > > > > > > reclaimed, not demoted. > > > > > > > > > > > > However, what try_to_free_mem_cgroup_pages() actually does is it > > > > > > unconditionally counts demoted pages as reclaimed pages. So in practice > > > > > > when it is called it will often demote nr_pages and return the number of > > > > > > demoted pages to the caller. Demoted pages don't lower the memcg usage, > > > > > > and so try_to_free_mem_cgroup_pages() is not actually doing what the > > > > > > callers want it to do. > > > > > > > > > > > > Various things work suboptimally on memory tiered systems or don't work > > > > > > at all due to this: > > > > > > > > > > > > - memory.high enforcement likely doesn't work (it just demotes nr_pages > > > > > > instead of lowering the memcg usage by nr_pages). > > > > > > - try_charge_memcg() will keep retrying the charge while > > > > > > try_to_free_mem_cgroup_pages() is just demoting pages and not actually > > > > > > making any room for the charge. > > > > > > > > > > This has been brought up during the review https://lore.kernel.org/all/YoYTEDD+c4GT0xYY@dhcp22.suse.cz/ > > > > > > > > > > > > > Ah, I did indeed miss this. Thanks for the pointer. However I don't > > > > understand this bit from your email (sorry I'm probably missing > > > > something): > > > > > > > > "I suspect this is rather unlikely situation, though. The last tear > > > > (without any fallback) should have some memory to reclaim most of > > > > the time." > > > > > > > > Reading the code in try_charge_memcg(), I don't see the last retry for > > > > try_to_free_mem_cgroup_pages() do anything special. My concern here is > > > > that try_charge_memcg() calls try_to_free_mem_cgroup_pages() > > > > MAX_RECLAIM_RETRIES times. Each time that call may demote pages and > > > > report back that it was able to 'reclaim' memory, but the charge keeps > > > > failing because the memcg reclaim didn't actually make room for the > > > > charge. What happens in this case? My understanding is that the memcg > > > > oom-killer gets wrongly invoked. > > > > > > The memcg reclaim shrinks from all zones in the allowed zonelist. In > > > general from all nodes. So unless the lower tier is outside of this > > > zonelist then there is a zone to reclaim from which cannot demote. > > > Correct? > > > > > > > Ah, thanks for pointing this out. I did indeed miss that the memcg > > reclaim tries to apply pressure equally to all the nodes. With some > > additional testing I'm able to see what you said: there could be no > > premature oom kill invocation because generally the memcg reclaim will > > find some pages to reclaim from lower tier nodes. > > > > I do find that the first call to try_to_free_mem_cgroup_pages() > > sometimes will mostly demote and not do much reclaim. I haven't been > > able to fully track the cause of that down but I suspect that the > > first call in my test will find most of the cgroup's memory on top > > tier nodes. However we do retry a bunch of times before we invoke oom, > > and in my testing subsequent calls will find plenty of memory in the > > lower tier nodes that it can reclaim. I'll update the commit message > > in the next version. > > In the past we used to break out early from the memcg reclaim if the > reclaim target has been completed - see 1ba6fc9af35b ("mm: vmscan: do > not share cgroup iteration between reclaimers"). But I do not see early > break from the reclaim anywhere anymore so the precise nr_reclaimed > tracking shouldn't make a lot of difference. There are cases where we > really have hard time to find proper candidates and need to dip into > hogher reclaim priorities though. > Yes, I agree. Thank you for your patient explanation. I now see why precise accounting of nr_reclaimed is not an issue (or at least as big an issue as I once thought). > Anyway a proper nr_reclaimed tracking should be rather straightforward > but I do not expect to make a big difference in practice > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 026199c047e0..1b7f2d8cb128 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1633,7 +1633,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > LIST_HEAD(ret_folios); > LIST_HEAD(free_folios); > LIST_HEAD(demote_folios); > - unsigned int nr_reclaimed = 0; > + unsigned int nr_reclaimed = 0, nr_demoted = 0; > unsigned int pgactivate = 0; > bool do_demote_pass; > struct swap_iocb *plug = NULL; > @@ -2065,8 +2065,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > } > /* 'folio_list' is always empty here */ > > - /* Migrate folios selected for demotion */ > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > + /* > + * Migrate folios selected for demotion. > + * Do not consider demoted pages to be reclaimed for the memcg reclaim > + * because no charges are really freed during the migration. Global > + * reclaim aims at releasing memory from nodes/zones so consider > + * demotion to reclaim memory. > + */ > + nr_demoted += demote_folio_list(&demote_folios, pgdat); > + if (!cgroup_reclaim(sc)) > + nr_reclaimed += nr_demoted; > + > /* Folios that could not be demoted are still in @demote_folios */ > if (!list_empty(&demote_folios)) { > /* Folios which weren't demoted go back on @folio_list for retry: */ > > [...] Thank you again, but this patch breaks the memory.reclaim nodes arg for me. This is my test case. I run it on a machine with 2 memory tiers. Memory tier 1= nodes 0-2 Memory tier 2= node 3 mkdir -p /sys/fs/cgroup/unified/test cd /sys/fs/cgroup/unified/test echo $$ > cgroup.procs head -c 500m /dev/random > /tmp/testfile echo $$ > /sys/fs/cgroup/unified/cgroup.procs echo "1m nodes=0-2" > memory.reclaim In my opinion the expected behavior is for the kernel to demote 1mb of memory from nodes 0-2 to node 3. Actual behavior on the tip of mm-unstable is as expected. Actual behavior with your patch cherry-picked to mm-unstable is that the kernel demotes all 500mb of memory from nodes 0-2 to node 3, and returns -EAGAIN to the user. This may be the correct behavior you're intending, but it completely breaks the use case I implemented the nodes= arg for and listed on the commit message of that change. > > > Either I am missing something or I simply do not understand why you are > > > hooked into nodemask so much. Why cannot we have a simple rule that > > > only global reclaim considers demotions as nr_reclaimed? > > > > > > > Thanks. I think this approach would work for most callers. My issue > > here is properly supporting the recently added nodes= arg[1] to > > memory.reclaim. If the user specifies all nodes or provides no arg, > > I'd like to treat it as memcg reclaim which doesn't count demotions. > > If the user provides the top tier nodes, I would like to count > > demotions as this interface is the way to trigger proactive demotion > > from top tier nodes. > > Sorry to repeat myself but nodemask doesn't really make any difference. > The reclaim should only count pages which really free memory in the > domain. Even if you demote to a node outside of the given nodemask then > the charge stays with the existing memcg so it is rather dubious to > count it as a reclaimed page. Or do I still miss what you are trying to > achieve? > I think you've convinced me that the issue is not as serious as I first thought, thanks. We can maybe abandon this fix? If we do want to fix this now or in the future, if possible let the fix be compatible with memory.reclaim and its nodes= arg. I'm happy to provide patches for whatever fix you find acceptable. > -- > Michal Hocko > SUSE Labs