From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 529A2C4321E for ; Sat, 3 Dec 2022 00:26:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8C6686B0071; Fri, 2 Dec 2022 19:26:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 876296B0072; Fri, 2 Dec 2022 19:26:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 73E406B0073; Fri, 2 Dec 2022 19:26:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 6494A6B0071 for ; Fri, 2 Dec 2022 19:26:43 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 29B91C070D for ; Sat, 3 Dec 2022 00:26:43 +0000 (UTC) X-FDA: 80199104286.06.E280C38 Received: from mail-qt1-f170.google.com (mail-qt1-f170.google.com [209.85.160.170]) by imf08.hostedemail.com (Postfix) with ESMTP id D26C3160015 for ; Sat, 3 Dec 2022 00:26:41 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="T/LJJSaG"; spf=pass (imf08.hostedemail.com: domain of yosryahmed@google.com designates 209.85.160.170 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670027201; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=H6oyGA7JzpPsUWYTAA2/030qT9cSN6sDgncpoKJKE2Q=; b=g18DmReykP9K9Pq5k/uFnd6XwNAfeabJIqn7YcfeZNgnfDnOrm4KpZ1tGdKEXEH6r8eRA6 OrIa0qb1g+HGsyNrVTPXVTdiz2C/BrCbE/+ed319a52rx+UX6FZHBU+O6RkdaMBjtwVs+6 tRkxbwHx5H69vv5c680/QrXzNayxVTo= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="T/LJJSaG"; spf=pass (imf08.hostedemail.com: domain of yosryahmed@google.com designates 209.85.160.170 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670027201; a=rsa-sha256; cv=none; b=hioO08lHHySMG9ptlfqCg0YJ8flXVWv+iPuqlLEdTnkRJEos31ClQLD58T7DEb89eRA5pT IOJAOgBsXJBo/knFtqXqmn17GKyJFyAnhDlti6NrRD4T0JO/uEs+qkRYmcHPZTWs0FJHO3 N817a41eO00l98Bg7UV4wGNZDlYFPdw= Received: by mail-qt1-f170.google.com with SMTP id fp23so7213476qtb.0 for ; Fri, 02 Dec 2022 16:26:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=H6oyGA7JzpPsUWYTAA2/030qT9cSN6sDgncpoKJKE2Q=; b=T/LJJSaG/DQqkLUJQqoGkGenM5T0SxLC+y8PG0mxyXIbqMneGDZsXXtKhbUxZFNFdq Z/SOhxajUjNeQd2Rgk7LnQOuCraoycMgNxLaCAKgQOQ1z2C/Elnpe1X15FOoKm+TInrL ORj8fsMsRU+cxnX2dL1bwtIUPvYQghJhJ0PtMboqR5YAMAdQJm15FbQDwr96Nf22SVbm JzqdKhKWQeF58j5QH0XN9VfbbsLcP3OLxllwNSEarsReojQdu6jp9XlgUAsFph2vgDMI /q0RthXA4swaWpKFBkcdiQpIZOExhPzLsZiYMUJrQS6qSl2NdtHv3LM1jjlOQ+FfY0+J L8hA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=H6oyGA7JzpPsUWYTAA2/030qT9cSN6sDgncpoKJKE2Q=; b=HwB8NosL1WjNZgqrRc2gF+RPR8viTY086RC3iBdKG+UFeGvb3xHyAReBxSq3pQFyOs /kCFMJeLCYM17gjIyE41bdgh5CWpm5+3imVW3nOO6l+rqXkHRQX5n6Ao/OTpABtX2tYR KZc17OyPjDGTDLuHQgPO/TqZfx4YxlUeJVN0ZeyWzgXGXo6iMSn4FY64BGiRIH4H4ZHo 50c/1g0X8y5dgpyK47GKMjn+N1RuU8gunImlU2qSrbodpgPs5TlDGSVcW8+jQTerIdbc n21r0ZdEEMBuJ4XqqQOWToXAkMRprhVOJ7ZJLQwHumCtyVxaa5fw5D/QXh3Z3F/NUn9Y EpLQ== X-Gm-Message-State: ANoB5pmBC8t8px/nUVix0HjxNwxA3BQe03MuK97Pzx6gYjs8itmsaqeV RUXgQYwfIDtiFKLCB5PRC6oWEHk953OBaBzKP60yCg== X-Google-Smtp-Source: AA0mqf6epze0fKqzBEapqtULsddckDJB9LkpGiF4LjVms4jK3h6u+84RunBv9T7dWcM92U81yT95Vb6jTvm6IyzuSsQ= X-Received: by 2002:a05:622a:6022:b0:398:5f25:649 with SMTP id he34-20020a05622a602200b003985f250649mr69017714qtb.673.1670027200882; Fri, 02 Dec 2022 16:26:40 -0800 (PST) MIME-Version: 1.0 References: <20221202031512.1365483-1-yosryahmed@google.com> <20221202031512.1365483-2-yosryahmed@google.com> In-Reply-To: <20221202031512.1365483-2-yosryahmed@google.com> From: Yosry Ahmed Date: Fri, 2 Dec 2022 16:26:05 -0800 Message-ID: Subject: Re: [PATCH v3 1/3] mm: memcg: fix stale protection of reclaim target memcg To: Andrew Morton , Shakeel Butt , Roman Gushchin , Johannes Weiner , Michal Hocko , Yu Zhao , Muchun Song , Tejun Heo Cc: "Matthew Wilcox (Oracle)" , Vasily Averin , Vlastimil Babka , Chris Down , David Rientjes , cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: D26C3160015 X-Rspam-User: X-Stat-Signature: f5edm8f15zmtns8346jicumwy7btpinh X-Spamd-Result: default: False [-2.90 / 9.00]; BAYES_HAM(-6.00)[100.00%]; SORBS_IRL_BL(3.00)[209.85.160.170:from]; BAD_REP_POLICIES(0.10)[]; RCVD_NO_TLS_LAST(0.10)[]; MIME_GOOD(-0.10)[text/plain]; MIME_TRACE(0.00)[0:+]; RCVD_COUNT_TWO(0.00)[2]; FROM_EQ_ENVFROM(0.00)[]; DMARC_POLICY_ALLOW(0.00)[google.com,reject]; RCPT_COUNT_TWELVE(0.00)[16]; DKIM_TRACE(0.00)[google.com:+]; TO_MATCH_ENVRCPT_SOME(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; R_DKIM_ALLOW(0.00)[google.com:s=20210112]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(0.00)[+ip4:209.85.128.0/17]; TO_DN_SOME(0.00)[]; ARC_NA(0.00)[] X-HE-Tag: 1670027201-79690 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Andrew, does this need to be picked up by stable branches? On Thu, Dec 1, 2022 at 7:15 PM Yosry Ahmed wrote: > > During reclaim, mem_cgroup_calculate_protection() is used to determine > the effective protection (emin and elow) values of a memcg. The > protection of the reclaim target is ignored, but we cannot set their > effective protection to 0 due to a limitation of the current > implementation (see comment in mem_cgroup_protection()). Instead, > we leave their effective protection values unchaged, and later ignore it > in mem_cgroup_protection(). > > However, mem_cgroup_protection() is called later in > shrink_lruvec()->get_scan_count(), which is after the > mem_cgroup_below_{min/low}() checks in shrink_node_memcgs(). As a > result, the stale effective protection values of the target memcg may > lead us to skip reclaiming from the target memcg entirely, before > calling shrink_lruvec(). This can be even worse with recursive > protection, where the stale target memcg protection can be higher than > its standalone protection. See two examples below (a similar version of > example (a) is added to test_memcontrol in a later patch). > > (a) A simple example with proactive reclaim is as follows. Consider the > following hierarchy: > ROOT > | > A > | > B (memory.min = 10M) > > Consider the following scenario: > - B has memory.current = 10M. > - The system undergoes global reclaim (or memcg reclaim in A). > - In shrink_node_memcgs(): > - mem_cgroup_calculate_protection() calculates the effective min (emin) > of B as 10M. > - mem_cgroup_below_min() returns true for B, we do not reclaim from B. > - Now if we want to reclaim 5M from B using proactive reclaim > (memory.reclaim), we should be able to, as the protection of the > target memcg should be ignored. > - In shrink_node_memcgs(): > - mem_cgroup_calculate_protection() immediately returns for B without > doing anything, as B is the target memcg, relying on > mem_cgroup_protection() to ignore B's stale effective min (still 10M). > - mem_cgroup_below_min() reads the stale effective min for B and we > skip it instead of ignoring its protection as intended, as we never > reach mem_cgroup_protection(). > > (b) An more complex example with recursive protection is as follows. > Consider the following hierarchy with memory_recursiveprot: > ROOT > | > A (memory.min = 50M) > | > B (memory.min = 10M, memory.high = 40M) > > Consider the following scenario: > - B has memory.current = 35M. > - The system undergoes global reclaim (target memcg is NULL). > - B will have an effective min of 50M (all of A's unclaimed protection). > - B will not be reclaimed from. > - Now allocate 10M more memory in B, pushing it above it's high limit. > - The system undergoes memcg reclaim from B (target memcg is B). > - Like example (a), we do nothing in mem_cgroup_calculate_protection(), > then call mem_cgroup_below_min(), which will read the stale effective > min for B (50M) and skip it. In this case, it's even worse because we > are not just considering B's standalone protection (10M), but we are > reading a much higher stale protection (50M) which will cause us to not > reclaim from B at all. > > This is an artifact of commit 45c7f7e1ef17 ("mm, memcg: decouple > e{low,min} state mutations from protection checks") which made > mem_cgroup_calculate_protection() only change the state without > returning any value. Before that commit, we used to return > MEMCG_PROT_NONE for the target memcg, which would cause us to skip the > mem_cgroup_below_{min/low}() checks. After that commit we do not return > anything and we end up checking the min & low effective protections for > the target memcg, which are stale. > > Update mem_cgroup_supports_protection() to also check if we are > reclaiming from the target, and rename it to mem_cgroup_unprotected() > (now returns true if we should not protect the memcg, much simpler logic). > > Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks") > Signed-off-by: Yosry Ahmed > Reviewed-by: Roman Gushchin > --- > include/linux/memcontrol.h | 31 +++++++++++++++++++++---------- > mm/vmscan.c | 11 ++++++----- > 2 files changed, 27 insertions(+), 15 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index e1644a24009c..d3c8203cab6c 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -615,28 +615,32 @@ static inline void mem_cgroup_protection(struct mem_cgroup *root, > void mem_cgroup_calculate_protection(struct mem_cgroup *root, > struct mem_cgroup *memcg); > > -static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg) > +static inline bool mem_cgroup_unprotected(struct mem_cgroup *target, > + struct mem_cgroup *memcg) > { > /* > * The root memcg doesn't account charges, and doesn't support > - * protection. > + * protection. The target memcg's protection is ignored, see > + * mem_cgroup_calculate_protection() and mem_cgroup_protection() > */ > - return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg); > - > + return mem_cgroup_disabled() || mem_cgroup_is_root(memcg) || > + memcg == target; > } > > -static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg) > +static inline bool mem_cgroup_below_low(struct mem_cgroup *target, > + struct mem_cgroup *memcg) > { > - if (!mem_cgroup_supports_protection(memcg)) > + if (mem_cgroup_unprotected(target, memcg)) > return false; > > return READ_ONCE(memcg->memory.elow) >= > page_counter_read(&memcg->memory); > } > > -static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg) > +static inline bool mem_cgroup_below_min(struct mem_cgroup *target, > + struct mem_cgroup *memcg) > { > - if (!mem_cgroup_supports_protection(memcg)) > + if (mem_cgroup_unprotected(target, memcg)) > return false; > > return READ_ONCE(memcg->memory.emin) >= > @@ -1209,12 +1213,19 @@ static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root, > { > } > > -static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg) > +static inline bool mem_cgroup_unprotected(struct mem_cgroup *target, > + struct mem_cgroup *memcg) > +{ > + return true; > +} > +static inline bool mem_cgroup_below_low(struct mem_cgroup *target, > + struct mem_cgroup *memcg) > { > return false; > } > > -static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg) > +static inline bool mem_cgroup_below_min(struct mem_cgroup *target, > + struct mem_cgroup *memcg) > { > return false; > } > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 04d8b88e5216..79ef0fe67518 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4486,7 +4486,7 @@ static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, unsigned > > mem_cgroup_calculate_protection(NULL, memcg); > > - if (mem_cgroup_below_min(memcg)) > + if (mem_cgroup_below_min(NULL, memcg)) > return false; > > need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, swappiness, &nr_to_scan); > @@ -5047,8 +5047,9 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control * > DEFINE_MAX_SEQ(lruvec); > DEFINE_MIN_SEQ(lruvec); > > - if (mem_cgroup_below_min(memcg) || > - (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim)) > + if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg) || > + (mem_cgroup_below_low(sc->target_mem_cgroup, memcg) && > + !sc->memcg_low_reclaim)) > return 0; > > *need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, can_swap, &nr_to_scan); > @@ -6048,13 +6049,13 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > > mem_cgroup_calculate_protection(target_memcg, memcg); > > - if (mem_cgroup_below_min(memcg)) { > + if (mem_cgroup_below_min(target_memcg, memcg)) { > /* > * Hard protection. > * If there is no reclaimable memory, OOM. > */ > continue; > - } else if (mem_cgroup_below_low(memcg)) { > + } else if (mem_cgroup_below_low(target_memcg, memcg)) { > /* > * Soft protection. > * Respect the protection only as long as > -- > 2.39.0.rc0.267.gcb52ba06e7-goog >