From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6CFD7D3C550 for ; Fri, 18 Oct 2024 06:36:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EEB376B0093; Fri, 18 Oct 2024 02:36:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E9AA76B0095; Fri, 18 Oct 2024 02:36:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D62956B0096; Fri, 18 Oct 2024 02:36:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id B2A416B0093 for ; Fri, 18 Oct 2024 02:36:29 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 322961A05DD for ; Fri, 18 Oct 2024 06:36:08 +0000 (UTC) X-FDA: 82685763762.26.F72CBCB Received: from mail-vs1-f54.google.com (mail-vs1-f54.google.com [209.85.217.54]) by imf09.hostedemail.com (Postfix) with ESMTP id C28A014000C for ; Fri, 18 Oct 2024 06:36:19 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ME1a1aN+; spf=pass (imf09.hostedemail.com: domain of weixugc@google.com designates 209.85.217.54 as permitted sender) smtp.mailfrom=weixugc@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729233242; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wuj6DrKD+KQlPUUa5XDfdvYRoNewekiOugMVqppje00=; b=SFpwvdbbo2do6CXnOzYkF7xyfSJ2gnrwjTT9Te89hDWyNmnQPps1w0MCQpmgvL3SiTD9XV NBkBYrpkdLq8XjNjrLD6wBg34bmKpkp3gfy8VwTbeVfavmiVUxKubpL7r1Il6VNHCI1WmG NREituwJd7jen66+YRTU6KBiAKPthAA= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729233242; a=rsa-sha256; cv=none; b=ZUVFRvX1JDJFPWCVRntTCkQeHE0xAyjAUbDhZUbQObo4yb/dBdeiO4J+cWnuPSRtUPb2rE 2QMEcCE63okgVswVuc2y3pelTFzsPJAvxLg4dZ6dMRIqx4BfU4TyuoSwykJ8KlvqsGJCNq GBLChSN5T2YMcWwC9qlcdHa45vPJfok= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ME1a1aN+; spf=pass (imf09.hostedemail.com: domain of weixugc@google.com designates 209.85.217.54 as permitted sender) smtp.mailfrom=weixugc@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-vs1-f54.google.com with SMTP id ada2fe7eead31-4a46d5aa949so524885137.1 for ; Thu, 17 Oct 2024 23:36:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1729233386; x=1729838186; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=wuj6DrKD+KQlPUUa5XDfdvYRoNewekiOugMVqppje00=; b=ME1a1aN+lRAyD1sEWGFomwKtjV6G5ZgG7ya+jSq4BVW4R2ukJ3RPh9N3OoqqWI8Q/C 6iWNfXCHyDZCrrOSEO3v9Giv1BWm7kxG+XR+WRaKMTEQ2J2VDZru++PabOREC8cUFMLw ditnXN7m9MzNjPIeqwOVQwO+z6fO2zesbZRa6WtC0y07A1y8p8hnl6m0o0oob7+0WzmG f5C9ylkKmUQhdeYbiynMbmeE64rljBv6kAz7AkwRvCJB8tbhKXcyrJ6RQJ9UmlnoJbcM zjZ6L5o7CoUuTKHIlcyD+wOAb2XbKlQ2034bngMiiqCf86Nu82JNThL/jmzcV1uIyjxP jhOw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729233386; x=1729838186; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wuj6DrKD+KQlPUUa5XDfdvYRoNewekiOugMVqppje00=; b=fBvalnYbOK7VMkkEui8MlTL0DEyAGHcGz0zeYzHYUEoXGzYstShqH+bqpuvFF+Mbbf ibpiJNOIZ3zMTTMygeF9mvB9lW7PwB7efQbYM2f8jLzPN1IONIvFflT/z26TV9/CsEN+ WpyW1b9tXsRiRznjgTCgnWjH0Uu9QCPuphQM/0Dlt+dJmUd+XQ/b36/YA/XGw9lnZB+i lmIjKNVuIaIoQPUPovw1Pi382hfji5WSO3oqcpgFvOv2qkfvdMfjR17MpMmBHEDgzGTs UpbnlfYeP8YhybwfPziGMfVAycdNiwAurus2zknHv1SDnHv33rsvf9xW7nmiOdFEpmEI ghPA== X-Gm-Message-State: AOJu0YygpYVR/ZEi6MuHY1UejRVZY6paza/ImBfcQxnw/tQcKJoU9G+4 t/6CClYfxGU1K29Zve6OYc3J09+tDrvarinrVUHa9n+9reDQJoTEXrILayio9m+ihfSrL2f4liV vlaHP0zp15Og3fGyge14Ckp5d1/O6/0bVkcU9 X-Google-Smtp-Source: AGHT+IExxspQDuhUEfgc6CF2fwU9MDFGQ0bF5R+eqSuQR8+AUByAOWu1vdoqmj7CGFhlab3vRBlTEipkrF0PAmzdEys= X-Received: by 2002:a05:6102:32c7:b0:4a3:a137:b9b with SMTP id ada2fe7eead31-4a5d6ad1b8cmr1627995137.11.1729233386336; Thu, 17 Oct 2024 23:36:26 -0700 (PDT) MIME-Version: 1.0 References: <20241018052942.3810740-1-jingxiangzeng.cas@gmail.com> In-Reply-To: <20241018052942.3810740-1-jingxiangzeng.cas@gmail.com> From: Wei Xu Date: Thu, 17 Oct 2024 23:36:13 -0700 Message-ID: Subject: Re: [PATCH v5] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM To: Jingxiang Zeng Cc: linux-mm@kvack.org, akpm@linux-foundation.org, kasong@tencent.com, linuszeng@tencent.com, linux-kernel@vger.kernel.org, tjmercier@google.com, yuzhao@google.com, chrisl@kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: C28A014000C X-Stat-Signature: iobdpb5os3eo9od4swx79mkbjgxft6gr X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1729233379-225636 X-HE-Meta: U2FsdGVkX1+tndoIuaFRJLLdIBWLX8MXjv/xRD5XveEvYjl7XO1SdHdRJPbvyiJkvm1EN10BPjePRD53oJyFygZfsJaDureXhznvHqoeMlpc/bYtkWbqE8cNBp8lxPzGBYtMhbLBK3KGwrThYvqE4KnnvjLPLDSgL2IsUBd+jmU2drzbHkfBQ/6kse2rh+qfmVW+kU+TPnpCrAZZbcXAGMl8FJawVCgdwQEXGwg6Z9pGXoXuZ+jnTgodNNN9XzoMKzYt/wUU1zWxLsgsG8aIihziryo0OaYpZmOxZNG+A9ZIrUfxnatFwoKim95aHLBW7mhSasyn8Zrs9/1O+rUWz7RAbNgDE5qQLnMtZxrTd993yrfrCp/9WEyBnN3mqDX99SlcZG3m5p6ihYll67vwqkxmhJ7vZUAQPSO0jt8lMQcaJYO2LEX7JrSm9IfbzW88iPM+bCOz5YMAC0qEudNOYnuPDf009Rt5uyn0RXvN773MXF5h9LHYJlCA5F+hayybiHWkQx/a4roSeMNt1se+++FjJ3q7uRyL53ZKmzVTQzWuOnbLOcyvaJOPtw6QZ1PAr0G3zofPJf8UMiynlABbYDklbl8ZW8VZONzyf6DWD4pMiaVuE372VtQ4e9OKUyybfmS9svQpydP4wbMWrqF+s/HX5Q/zRFoWOdSP33DeFQaIWzZUttbTkebcTJI+Mn5rRxbKjW19j8OF2pIWu6a8U0wM3LMAa5h731aBsQkabfsHFt2G8jL/b0nY6Qy+GhsMA7g3DHZNt/52tJoJJ0RARbZl3mq245Nqk+yoptypIWo4vb8P8yCCmtnb3rH+ZBC5LrppemOZiuqjGmr1Ls01+2YDM0jN1vo90btaWLHJVgXHWxVAi/r0wsA8KwjIuEoSTO/+ojb5eaHGAXD8YAZu9B2XJDtUmqbzr/10wMPlmMK+GLtyKrocY5G2Ak5NY51+obogTdJKrEPXnAp1PNP 2DktRBX/ NKjoXaJF3utJg0OoXTlaPnBrbKz3MSdelyixQnnc8mdsVgX8EkuwyMgtkwWMqNwHvWTkNQMhk8H0caqDvIRkpByByqU24hBLXG/wrlvgqywhet0m6l4+FbBx0JrBbyAYEhfMOhgKdl6CpAbuzB1dbEPRLCBC+5AW6aVEozYvsSUheWaNfeH8ZLUuYhAzT3yi7Y6yYv1M1dq9FwagmhE72aAA3cghkLd/HqRn6mQXbs/WGZ9CzyEQ6BQVR+p9Hi2+LPw1naiT2s+Fow5I6HEGB2XqmOyzWA4JJP+KR8KhZG8vxOMWg0wPIxg5PZ6DLabPDwAFhbBBMi9puH+Apd8QHeKd8JCALYFkcVA+6XsnC2vfSZQo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Oct 17, 2024 at 10:29=E2=80=AFPM Jingxiang Zeng wrote: > > From: Zeng Jingxiang > > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > removed the opportunity to wake up flushers during the MGLRU page > reclamation process can lead to an increased likelihood of triggering OOM > when encountering many dirty pages during reclamation on MGLRU. > > This leads to premature OOM if there are too many dirty pages in cgroup: > Killed > > dd invoked oom-killer: gfp_mask=3D0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRI= TE), > order=3D0, oom_score_adj=3D0 > > Call Trace: > > dump_stack_lvl+0x5f/0x80 > dump_stack+0x14/0x20 > dump_header+0x46/0x1b0 > oom_kill_process+0x104/0x220 > out_of_memory+0x112/0x5a0 > mem_cgroup_out_of_memory+0x13b/0x150 > try_charge_memcg+0x44f/0x5c0 > charge_memcg+0x34/0x50 > __mem_cgroup_charge+0x31/0x90 > filemap_add_folio+0x4b/0xf0 > __filemap_get_folio+0x1a4/0x5b0 > ? srso_return_thunk+0x5/0x5f > ? __block_commit_write+0x82/0xb0 > ext4_da_write_begin+0xe5/0x270 > generic_perform_write+0x134/0x2b0 > ext4_buffered_write_iter+0x57/0xd0 > ext4_file_write_iter+0x76/0x7d0 > ? selinux_file_permission+0x119/0x150 > ? srso_return_thunk+0x5/0x5f > ? srso_return_thunk+0x5/0x5f > vfs_write+0x30c/0x440 > ksys_write+0x65/0xe0 > __x64_sys_write+0x1e/0x30 > x64_sys_call+0x11c2/0x1d50 > do_syscall_64+0x47/0x110 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > memory: usage 308224kB, limit 308224kB, failcnt 2589 > swap: usage 0kB, limit 9007199254740988kB, failcnt 0 > > ... > file_dirty 303247360 > file_writeback 0 > ... > > oom-kill:constraint=3DCONSTRAINT_MEMCG,nodemask=3D(null),cpuset=3Dtest, > mems_allowed=3D0,oom_memcg=3D/test,task_memcg=3D/test,task=3Ddd,pid=3D440= 4,uid=3D0 > Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB, > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB > oom_score_adj:0 > > The flusher wake up was removed to decrease SSD wearing, but if we are > seeing all dirty folios at the tail of an LRU, not waking up the flusher > could lead to thrashing easily. So wake it up when a mem cgroups is abou= t > to OOM due to dirty caches. > > I did run the build kernel test[1] on V5, with -j16 1G memcg on my local > branch: > > Without the patch(10 times): > user 1473.29 > system 347.87 339.17 345.28 354.64 352.46 355.63 358.80 359.40 358.28 > 350.95 (avg 352.248) > real 166.651 > > With the V5 patch(10 times): > user 1470.7 > system 339.13 350.58 350.07 355.58 348.96 344.83 351.78 336.39 350.45 > 343.31 (avg 347.108) > real 165.821 > > Test results show that this patch has about 1% performance improvement, > which should be caused by noise. > > --- > Changes from v4: > - Add the number of unqueued dirty pages in the shrink_folio_list functio= n > to sc->nr.unqueued_dirty. [Wei Xu, Jingxiang Zeng] > - Reset sc->nr before calling lru_gen_shrink_node function. > [Wei Xu, Jingxiang Zeng] > - Modified the conditions for waking up the flusher thread to avoid > interference from unevictable and anonymous pages. > [Wei Xu, Jingxiang Zeng] > - Link to v4: https://lore.kernel.org/all/20240929113050.76079-1-jingxian= gzeng.cas@gmail.com/ > Changes from v3: > - Avoid taking lock and reduce overhead on folio isolation by > checking the right flags and rework wake up condition, fixing the > performance regression reported by Chris Li. > [Chris Li, Kairui Song] > - Move the wake up check to try_to_shrink_lruvec to cover kswapd > case as well, and update comments. [Kairui Song] > - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxian= gzeng.cas@gmail.com/ > Changes from v2: > - Acquire the lock before calling the folio_check_dirty_writeback > function. [Wei Xu, Jingxiang Zeng] > - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxi= angzeng.cas@gmail.com/ > Changes from v1: > - Add code to count the number of unqueued_dirty in the sort_folio > function. [Wei Xu, Jingxiang Zeng] > - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxia= ngzeng.cas@gmail.com/ > --- > > Link: https://lore.kernel.org/all/CACePvbV4L-gRN9UKKuUnksfVJjOTq_5Sti2-e= =3Dpb_w51kucLKQ@mail.gmail.com/ [1] > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > Signed-off-by: Zeng Jingxiang > Signed-off-by: Kairui Song > Cc: T.J. Mercier > Cc: Wei Xu > Cc: Yu Zhao > --- > mm/vmscan.c | 22 +++++++++++++++++++--- > 1 file changed, 19 insertions(+), 3 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2d0486189804..97e0af338ee0 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4292,6 +4292,7 @@ static bool sort_folio(struct lruvec *lruvec, struc= t folio *folio, struct scan_c > int tier_idx) > { > bool success; > + bool dirty, writeback; > int gen =3D folio_lru_gen(folio); > int type =3D folio_is_file_lru(folio); > int zone =3D folio_zonenum(folio); > @@ -4337,9 +4338,14 @@ static bool sort_folio(struct lruvec *lruvec, stru= ct folio *folio, struct scan_c > return true; > } > > + dirty =3D folio_test_dirty(folio); > + writeback =3D folio_test_writeback(folio); > + if (type =3D=3D LRU_GEN_FILE && dirty && !writeback) > + sc->nr.unqueued_dirty +=3D delta; > + > /* waiting for writeback */ > - if (folio_test_locked(folio) || folio_test_writeback(folio) || > - (type =3D=3D LRU_GEN_FILE && folio_test_dirty(folio))) { > + if (folio_test_locked(folio) || writeback || > + (type =3D=3D LRU_GEN_FILE && dirty)) { > gen =3D folio_inc_gen(lruvec, folio, true); > list_move(&folio->lru, &lrugen->folios[gen][type][zone]); > return true; > @@ -4455,7 +4461,8 @@ static int scan_folios(struct lruvec *lruvec, struc= t scan_control *sc, > trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_B= ATCH, > scanned, skipped, isolated, > type ? LRU_INACTIVE_FILE : LRU_INACTIVE_A= NON); > - > + if (type =3D=3D LRU_GEN_FILE) > + sc->nr.taken +=3D isolated; > /* > * There might not be eligible folios due to reclaim_idx. Check t= he > * remaining to prevent livelock if it's not making progress. > @@ -4589,6 +4596,7 @@ static int evict_folios(struct lruvec *lruvec, stru= ct scan_control *sc, int swap > return scanned; > retry: > reclaimed =3D shrink_folio_list(&list, pgdat, sc, &stat, false); > + sc->nr.unqueued_dirty +=3D stat.nr_unqueued_dirty; > sc->nr_reclaimed +=3D reclaimed; > trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id, > scanned, reclaimed, &stat, sc->priority, > @@ -4797,6 +4805,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lr= uvec, struct scan_control *sc) > cond_resched(); > } > > + /* > + * If too many file cache in the coldest generation can't be evic= ted > + * due to being dirty, wake up the flusher. > + */ > + if (sc->nr.unqueued_dirty && !sc->nr.taken) > + wakeup_flusher_threads(WB_REASON_VMSCAN); This wakeup condition is too relaxed: We can wake up the the flusher even when there are only one unqueued dirty page and many clean file pages. I'd suggest that we add dirty file pages to sc->nr.taken in sort_folio() as well, in addition to isolated file pages in scan_folios(). Then we can change the wakeup condition to be: if (sc->nr.taken && sc->nr.unqueued_dirty =3D=3D sc->nr.taken) Also, given that only file pages are counted here, it would be better to add sc->nr.file_taken as the new counter and use it in this patch for more clarity. > + > /* whether this lruvec should be rotated */ > return nr_to_scan < 0; > } > @@ -5942,6 +5957,7 @@ static void shrink_node(pg_data_t *pgdat, struct sc= an_control *sc) > bool reclaimable =3D false; > > if (lru_gen_enabled() && root_reclaim(sc)) { > + memset(&sc->nr, 0, sizeof(sc->nr)); > lru_gen_shrink_node(pgdat, sc); > return; > } > -- > 2.43.5 >