From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 10DABC3ABB2 for ; Wed, 28 May 2025 20:53:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A178B6B007B; Wed, 28 May 2025 16:53:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9EF0B6B0082; Wed, 28 May 2025 16:53:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9045C6B0083; Wed, 28 May 2025 16:53:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 6D1486B007B for ; Wed, 28 May 2025 16:53:40 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 163BD1D5782 for ; Wed, 28 May 2025 20:53:40 +0000 (UTC) X-FDA: 83493517800.01.B9D151A Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) by imf14.hostedemail.com (Postfix) with ESMTP id 31E08100003 for ; Wed, 28 May 2025 20:53:38 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=VkQzXJMj; spf=pass (imf14.hostedemail.com: domain of yuanchu@google.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=yuanchu@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748465618; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GurTm2N7y0z3rtiw7G6HF7fGb6ar9xO0X9pfPVn2LXY=; b=d7IuIW2jO1cm6p4b7OWpF8TR6r3qBgEJIF8pIuC1pdvQA4iPl+MAI4sIQRFl3RK7o1Mdzy yXDGo0gEzg28og7q4D5plwyYa2JFU+f2o59hPBGporxv7CxT2IuNK6j5Gg8N3m1zRPUJFr zyytb+XoOqcRQef43YnZR2lVd2O7WCg= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=VkQzXJMj; spf=pass (imf14.hostedemail.com: domain of yuanchu@google.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=yuanchu@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748465618; a=rsa-sha256; cv=none; b=JsRlV4kJeImWdxg+v5xVIa918nwLvaC1+06fSnsYPJRSflL21ifP3S8ysGccc/X7qVw09K rW4b7gUCa8/5bpAKtMYuNhlGZunBgK6BIv8oO5hjqfB/YRvRPvkkH1DqLPX++swiL7meAr n532Nuujcng6OrcX3GhO9DFM2uS6C9k= Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-2348a45fc73so53005ad.0 for ; Wed, 28 May 2025 13:53:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1748465617; x=1749070417; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=GurTm2N7y0z3rtiw7G6HF7fGb6ar9xO0X9pfPVn2LXY=; b=VkQzXJMjLDHVeO09PTIr1CXcN1Cf1SX+IhSLiImRtm79GOsA0pCayy0WpkVNwfVUfl ItBzD/pNiQMTiBvCmTYFScKWkWkL9tWTrtNraGt1dlQ2WnR5fikpIfENgzvjmNxjaqQa 7jbycojBOTBUzcHB1kQ//15Oo+SzD2AjXLSzZYxwwaAII+IaqjvpheIXPsx3rsOQrjGw i483QqRBDCf1A34vocTgWJEZ0aAc04IyOhlRJktpMLNZCFPkZ90y4D8iG7+A7jVKQ7Qx nyy8S6IFlBBPGF7VOtq3bkuz32ztOOpl6du6Rj8Yi5r4jdzRKDWTM6UauivvAqv6RKUc A26w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748465617; x=1749070417; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GurTm2N7y0z3rtiw7G6HF7fGb6ar9xO0X9pfPVn2LXY=; b=BfkeStxROn6iM/DkPcZrK+RSLa92cU+qXsfZVcJeiljpd31XZUP26EVoBtPJQNGmCs f/5esUDb/LDbtZSeuu0ERoqIVZ2yYaBreiIq6kk2p3jKqasxaWHtrsiC0d5FRjbxd36W LH2p0ixq94TSB7jhYXkTHBe4nmASbjU6gltdGRIOQhwSymDdVJiv6V598W9ls2JqEZVm 7qRS/YP3iZAquYKuUrdtn+xJtqfQPAOsTo4FsQPtr8uwykPLJFOHvwH8K1mo4SSt7Rlz /U4yWjXZxaZw4GXfMPphooDE7YlankXJUJI9P9D4d73KrKyM0hfXf9uwt2s+1MTfAsmM 6P4Q== X-Gm-Message-State: AOJu0YyG7RCz/RZ57OdG7HhN0eAbnZHjg1ZrnB6POHB58saoYLZwJdmF SytF8IQaUPzEHXXIkFJdXSO/5Je+dCb+SKqF2q9xc/JoDp8Z39ipiPJ5A1KZ+R1q7kmwLiO91oO yY2tBXN9sWgOBmK/zfuPh64dUPqvm4f+4OV7NAJ4y X-Gm-Gg: ASbGncvSN+xjjmYLvbSDrwbqM7ofJbITvzdbWRxVJ/zDg199guFHC3ZUX1pyCxJkTa4 xEn7VoNwAkkSYAAV/A3JupzksOWEpGJ9/XpJ+lF5dYzPEoDH35qxUwswHvyTkATN9pq/0XVPPQm GWJbBS7/xhV1h9vBo0lsK60xmyJf42uB9s48v8mrgXq3nPqe5SJcCJ7UimGnunAOg7tM9a43iT X-Google-Smtp-Source: AGHT+IG2vqbrTFZXjKyrWRszrj/oY7bgBCGfCU2x/BPKW+QyxREhsZRD7r1RL8MK/PqD/nRjG9zuViMNSGPadmmBXvU= X-Received: by 2002:a17:903:46c5:b0:234:afcf:d9e8 with SMTP id d9443c01a7336-23506a33f3amr93055ad.7.1748465616668; Wed, 28 May 2025 13:53:36 -0700 (PDT) MIME-Version: 1.0 References: <20250404141118.3895592-1-koichiro.den@canonical.com> In-Reply-To: <20250404141118.3895592-1-koichiro.den@canonical.com> From: Yuanchu Xie Date: Wed, 28 May 2025 13:53:19 -0700 X-Gm-Features: AX0GCFuhsPlEnNwmGRcR3NsJ33kdxV-WoaAM9HdU6EYp_SmhqPbFFrw9KL7OtQw Message-ID: Subject: Re: [PATCH] mm: vmscan: apply proportional reclaim pressure for memcg when MGLRU is enabled To: Koichiro Den Cc: linux-mm@kvack.org, akpm@linux-foundation.org, yuzhao@google.com, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: xxtk6ujcn93aszd58yiwu94t3cokren1 X-Rspamd-Queue-Id: 31E08100003 X-Rspam-User: X-Rspamd-Server: rspam02 X-HE-Tag: 1748465617-779459 X-HE-Meta: U2FsdGVkX1+uaCEsXOUnDhxCNifLJJ6EZhCBRHbKVDU4kzBblT35LOnaiHQuCt+4vnS7Iaku9u0VCOzBIbCeUEqJaAggPWxqOAC7rGGOsaPw4ZCo6Ug+YAfEvbD7EV2DQL1BZjE/ahd7mvZgtjHonWpIuVlBryHteF+A5J1K4LfrXBoeto57OiAgqSJgNxxklpExWnMKYd3eJRA17crJtxvWDrKW4vqWGz3n5B7u/570XQcfhDFtchRd+DDfiIo+5BEWiGBAeSY3dVOVQ7MC+QquG53x1bPLS4fkzGYiWJJ5u2Y7bOFfA4xEIiWsrFXRNqEN5Q2YDIDSDVSHWdCMVcg8aDBlNZuygXXVSFAH62vpvlMEcRg6e/064zs3AnzHs/LpW11a7IUQnVyqL6XJIhCa7DFVqqns0alynmLWFEnP+AvGFRzgMiES16NVMgoz5DzLpZvM2HQz/PlH3ULINQl9tht2xh0wXng2IeGnwNKhvtxfO0I86+Z4tDyZ4TtLxCzXlih1CR8ILuIYp5OMlkpXa+40lr1mtuarrmLfLCk29Wnyz35G9+GnOtE9rvJB4Xs6XeQP/LGt/Mi4XVOYstKJZ6Z+GaO0GxNIlR5ZajTWcACloRIW6Gpy+JPzYgVRkv4mQz7CwURw42fGi2FJhhRvYhCKy5N2g8v2Zb7+YlUeNsT9Jdp1oh1MjQyYmRUrMtpGevPQwTaVANwXqYY6jb88rK9Xq840v9Rg24U7jjjGZWm8XIbvCMP39HlziaCB0fY2ilhq2kXnp/BnFidKvL0On9QCbOpNyGDyLIU4RYynszef3+r7KommuDWajPswrj9N2rgVqg173Gcy2W74/8exDUdkjGqGoFJMYXH85w8AL7AEaZikn3WpQlkhfM12zgZGK74ZpH0KRhWHaeV6kjOlVIkd14PVmy/2prAq03CKgvu1caiDLuSCC1TUhDfpjC7xwLoixz6gaGbiw1w dqDZWFFx L0AeUdLo5AaIgR+Y9mdA7O1kEnHhKU2HIpJK5yMV2RA9mFgsJrii8qshV+5ulMBt3zPzH99t3r3HqFr2z9GaSWwXz8mKYrPH/P/uXCI1Cgf6VxCqE/V7/ftaYY9hbZPtEDVgPwmT0UPYR/JlOd6K5SQlrkfXgGAIb7culcRhZmUq1U/RqvkQDe1xeCH+f2qDwb/RJsYHh59S8K4S/uNpjHrSgdZNu+Rv0wj11Y+aTx/E3x8ZDt398SO/uoiM7jj5qpZKjZmTQvZhJLhbqFZb+NLCiRyuceva/980IQVitNgdcRx8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Apr 4, 2025 at 7:11=E2=80=AFAM Koichiro Den wrote: > > The scan implementation for MGLRU was missing proportional reclaim > pressure for memcg, which contradicts the description in > Documentation/admin-guide/cgroup-v2.rst (memory.{low,min} section). Nice, this is a discrepancy between the two reclaim implementations. Thanks for addressing this. > > This issue was revealed by the LTP memcontrol03 [1] test case. The > following example output from a local test env with no NUMA shows > that prior to this patch, proportional protection was not working: > > * Without this patch (MGLRU enabled): > $ sudo LTP_SINGLE_FS_TYPE=3Dxfs ./memcontrol03 > ... > memcontrol03.c:214: TPASS: Expect: (A/B/C memory.current=3D25964544) = ~=3D 34603008 > memcontrol03.c:216: TPASS: Expect: (A/B/D memory.current=3D26038272) = ~=3D 17825792 > ... > > * With this patch (MGLRU enabled): > $ sudo LTP_SINGLE_FS_TYPE=3Dxfs ./memcontrol03 > ... > memcontrol03.c:214: TPASS: Expect: (A/B/C memory.current=3D29327360) = ~=3D 34603008 > memcontrol03.c:216: TPASS: Expect: (A/B/D memory.current=3D23748608) = ~=3D 17825792 > ... > > * When MGLRU is disabled: > $ sudo LTP_SINGLE_FS_TYPE=3Dxfs ./memcontrol03 > ... > memcontrol03.c:214: TPASS: Expect: (A/B/C memory.current=3D28819456) = ~=3D 34603008 > memcontrol03.c:216: TPASS: Expect: (A/B/D memory.current=3D24018944) = ~=3D 17825792 > ... > > Note that the test shows TPASS for all cases here due to its lenient > criteria. And even with this patch, or when MGLRU is disabled, the > results above show slight deviation from the expected values, but this > is due to relatively small mem usage compared to the >> DEF_PRIORITY > adjustment. It's kind of disappointing that the LTP test doesn't fail when reclaim pressure scaling doesn't work. Would you be interested in fixing the test as well? > > Factor out the proportioning logic to a new function and have MGLRU > reuse it. > > [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kerne= l/controllers/memcg/memcontrol03.c > > Signed-off-by: Koichiro Den > --- > mm/vmscan.c | 148 +++++++++++++++++++++++++++------------------------- > 1 file changed, 78 insertions(+), 70 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index b620d74b0f66..c594d8264938 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2467,6 +2467,69 @@ static inline void calculate_pressure_balance(stru= ct scan_control *sc, > *denominator =3D ap + fp; > } > > +static unsigned long apply_proportional_protection(struct mem_cgroup *me= mcg, > + struct scan_control *sc, unsigned long scan) > +{ > + unsigned long min, low; > + > + mem_cgroup_protection(sc->target_mem_cgroup, memcg, &min, &low); > + > + if (min || low) { > + /* > + * Scale a cgroup's reclaim pressure by proportioning > + * its current usage to its memory.low or memory.min > + * setting. > + * > + * This is important, as otherwise scanning aggression > + * becomes extremely binary -- from nothing as we > + * approach the memory protection threshold, to totally > + * nominal as we exceed it. This results in requiring > + * setting extremely liberal protection thresholds. It > + * also means we simply get no protection at all if we > + * set it too low, which is not ideal. > + * > + * If there is any protection in place, we reduce scan > + * pressure by how much of the total memory used is > + * within protection thresholds. > + * > + * There is one special case: in the first reclaim pass, > + * we skip over all groups that are within their low > + * protection. If that fails to reclaim enough pages to > + * satisfy the reclaim goal, we come back and override > + * the best-effort low protection. However, we still > + * ideally want to honor how well-behaved groups are in > + * that case instead of simply punishing them all > + * equally. As such, we reclaim them based on how much > + * memory they are using, reducing the scan pressure > + * again by how much of the total memory used is under > + * hard protection. > + */ > + unsigned long cgroup_size =3D mem_cgroup_size(memcg); > + unsigned long protection; > + > + /* memory.low scaling, make sure we retry before OOM */ > + if (!sc->memcg_low_reclaim && low > min) { > + protection =3D low; > + sc->memcg_low_skipped =3D 1; > + } else { > + protection =3D min; > + } > + > + /* Avoid TOCTOU with earlier protection check */ > + cgroup_size =3D max(cgroup_size, protection); > + > + scan -=3D scan * protection / (cgroup_size + 1); > + > + /* > + * Minimally target SWAP_CLUSTER_MAX pages to keep > + * reclaim moving forwards, avoiding decrementing > + * sc->priority further than desirable. > + */ > + scan =3D max(scan, SWAP_CLUSTER_MAX); > + } > + return scan; > +} > + > /* > * Determine how aggressively the anon and file LRU lists should be > * scanned. > @@ -2537,70 +2600,10 @@ static void get_scan_count(struct lruvec *lruvec,= struct scan_control *sc, > for_each_evictable_lru(lru) { > bool file =3D is_file_lru(lru); > unsigned long lruvec_size; > - unsigned long low, min; > unsigned long scan; > > lruvec_size =3D lruvec_lru_size(lruvec, lru, sc->reclaim_= idx); > - mem_cgroup_protection(sc->target_mem_cgroup, memcg, > - &min, &low); > - > - if (min || low) { > - /* > - * Scale a cgroup's reclaim pressure by proportio= ning > - * its current usage to its memory.low or memory.= min > - * setting. > - * > - * This is important, as otherwise scanning aggre= ssion > - * becomes extremely binary -- from nothing as we > - * approach the memory protection threshold, to t= otally > - * nominal as we exceed it. This results in requ= iring > - * setting extremely liberal protection threshold= s. It > - * also means we simply get no protection at all = if we > - * set it too low, which is not ideal. > - * > - * If there is any protection in place, we reduce= scan > - * pressure by how much of the total memory used = is > - * within protection thresholds. > - * > - * There is one special case: in the first reclai= m pass, > - * we skip over all groups that are within their = low > - * protection. If that fails to reclaim enough pa= ges to > - * satisfy the reclaim goal, we come back and ove= rride > - * the best-effort low protection. However, we st= ill > - * ideally want to honor how well-behaved groups = are in > - * that case instead of simply punishing them all > - * equally. As such, we reclaim them based on how= much > - * memory they are using, reducing the scan press= ure > - * again by how much of the total memory used is = under > - * hard protection. > - */ > - unsigned long cgroup_size =3D mem_cgroup_size(mem= cg); > - unsigned long protection; > - > - /* memory.low scaling, make sure we retry before = OOM */ > - if (!sc->memcg_low_reclaim && low > min) { > - protection =3D low; > - sc->memcg_low_skipped =3D 1; > - } else { > - protection =3D min; > - } > - > - /* Avoid TOCTOU with earlier protection check */ > - cgroup_size =3D max(cgroup_size, protection); > - > - scan =3D lruvec_size - lruvec_size * protection / > - (cgroup_size + 1); > - > - /* > - * Minimally target SWAP_CLUSTER_MAX pages to kee= p > - * reclaim moving forwards, avoiding decrementing > - * sc->priority further than desirable. > - */ > - scan =3D max(scan, SWAP_CLUSTER_MAX); > - } else { > - scan =3D lruvec_size; > - } > - > + scan =3D apply_proportional_protection(memcg, sc, lruvec_= size); > scan >>=3D sc->priority; > > /* > @@ -4521,8 +4524,9 @@ static bool isolate_folio(struct lruvec *lruvec, st= ruct folio *folio, struct sca > return true; > } > > -static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > - int type, int tier, struct list_head *list) > +static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec, > + struct scan_control *sc, int type, int tier, > + struct list_head *list) > { > int i; > int gen; > @@ -4531,7 +4535,7 @@ static int scan_folios(struct lruvec *lruvec, struc= t scan_control *sc, > int scanned =3D 0; > int isolated =3D 0; > int skipped =3D 0; > - int remaining =3D MAX_LRU_BATCH; > + int remaining =3D min(nr_to_scan, MAX_LRU_BATCH); > struct lru_gen_folio *lrugen =3D &lruvec->lrugen; > struct mem_cgroup *memcg =3D lruvec_memcg(lruvec); > > @@ -4642,7 +4646,8 @@ static int get_type_to_scan(struct lruvec *lruvec, = int swappiness) > return positive_ctrl_err(&sp, &pv); > } > > -static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc= , int swappiness, > +static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruve= c, > + struct scan_control *sc, int swappiness, > int *type_scanned, struct list_head *list) > { > int i; > @@ -4654,7 +4659,7 @@ static int isolate_folios(struct lruvec *lruvec, st= ruct scan_control *sc, int sw > > *type_scanned =3D type; > > - scanned =3D scan_folios(lruvec, sc, type, tier, list); > + scanned =3D scan_folios(nr_to_scan, lruvec, sc, type, tie= r, list); > if (scanned) > return scanned; > > @@ -4664,7 +4669,8 @@ static int isolate_folios(struct lruvec *lruvec, st= ruct scan_control *sc, int sw > return 0; > } > > -static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, = int swappiness) > +static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec, > + struct scan_control *sc, int swappiness) > { > int type; > int scanned; > @@ -4683,7 +4689,7 @@ static int evict_folios(struct lruvec *lruvec, stru= ct scan_control *sc, int swap > > spin_lock_irq(&lruvec->lru_lock); > > - scanned =3D isolate_folios(lruvec, sc, swappiness, &type, &list); > + scanned =3D isolate_folios(nr_to_scan, lruvec, sc, swappiness, &t= ype, &list); > > scanned +=3D try_to_inc_min_seq(lruvec, swappiness); > > @@ -4804,6 +4810,8 @@ static long get_nr_to_scan(struct lruvec *lruvec, s= truct scan_control *sc, int s > if (nr_to_scan && !mem_cgroup_online(memcg)) > return nr_to_scan; > > + nr_to_scan =3D apply_proportional_protection(memcg, sc, nr_to_sca= n); > + > /* try to get away with not aging at the default priority */ > if (!success || sc->priority =3D=3D DEF_PRIORITY) > return nr_to_scan >> sc->priority; > @@ -4856,7 +4864,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lru= vec, struct scan_control *sc) > if (nr_to_scan <=3D 0) > break; > > - delta =3D evict_folios(lruvec, sc, swappiness); > + delta =3D evict_folios(nr_to_scan, lruvec, sc, swappiness= ); > if (!delta) > break; > > @@ -5477,7 +5485,7 @@ static int run_eviction(struct lruvec *lruvec, unsi= gned long seq, struct scan_co > if (sc->nr_reclaimed >=3D nr_to_reclaim) > return 0; > > - if (!evict_folios(lruvec, sc, swappiness)) > + if (!evict_folios(MAX_LRU_BATCH, lruvec, sc, swappiness)) > return 0; Right now this change preserves the current behavior, but given this is only invoked from the debugfs interface, it would be reasonable to also change this to something like nr_to_reclaim - sc->nr_reclaimed so the run_eviction evicts closer to nr_to_reclaim number of pages. Closer to what it advertises, but different from the current behavior. I have no strong opinion here, so if you're a user of this proactive reclaim interface and would prefer to change it, go ahead. > > cond_resched(); > -- > 2.45.2 > > Reviewed-by: Yuanchu Xie