From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6C688CF0443 for ; Wed, 9 Oct 2024 05:43:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ED00C6B00A4; Wed, 9 Oct 2024 01:43:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E7E166B00A9; Wed, 9 Oct 2024 01:43:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D1EC56B00D1; Wed, 9 Oct 2024 01:43:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B22786B00A4 for ; Wed, 9 Oct 2024 01:43:31 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 35CADAC27C for ; Wed, 9 Oct 2024 05:43:27 +0000 (UTC) X-FDA: 82652971422.30.A153BD7 Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com [209.85.214.174]) by imf13.hostedemail.com (Postfix) with ESMTP id 7069420003 for ; Wed, 9 Oct 2024 05:43:29 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=jQqbLDxt; spf=pass (imf13.hostedemail.com: domain of jingxiangzeng.cas@gmail.com designates 209.85.214.174 as permitted sender) smtp.mailfrom=jingxiangzeng.cas@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728452566; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hLtN0wlTxY8jWtMzdhoOCsjGueTunUKwVw4gA0b7X5E=; b=qdxFlYx7GmvZpC6Uw+b29Ox1Y2Q17m8MHjsY4VgDSGRskIzyFxGxWdJKfeHKE63j+h9JGp E4jkNMVxED6XHeE3sTIkH1cDmtCM+OFaBl8/cKttnobuL6f72NWgCL7JKFdQUlMiiLnSoa 76KEOHdWklGi08bSCuVJ9fmtU7arsdk= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=jQqbLDxt; spf=pass (imf13.hostedemail.com: domain of jingxiangzeng.cas@gmail.com designates 209.85.214.174 as permitted sender) smtp.mailfrom=jingxiangzeng.cas@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728452566; a=rsa-sha256; cv=none; b=iA3VinTw0SPiNF/pFWovkH6jVMkxINGvjWirzRGHpZnV9g64YhU79wqpVwZv+8dEPqv2yw UzsZIHKg4BqzgIltmYX+iU1u0YatI5XryErGGTuPwZBIod0CPUs55Ac3Kf3ngTzRo4gvV4 IpnMTLTC2iABNtOke0Hl8CDL7FDs8AE= Received: by mail-pl1-f174.google.com with SMTP id d9443c01a7336-20b9b35c7c7so45875015ad.1 for ; Tue, 08 Oct 2024 22:43:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1728452608; x=1729057408; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=hLtN0wlTxY8jWtMzdhoOCsjGueTunUKwVw4gA0b7X5E=; b=jQqbLDxtZlllZOBr/0Ov5vjPLVWBcYOf7e6sDLRrqeIG/rVPWuw/GxYgrli641wXkn SMUJ3SlrX2Ig4GzdLc5v9LppDGnZ783ULlEnMTkP+LLe9pysCE5CiVn0xKW6OJyqye6V ZddwFsHC4dzXK5GVNfbQMzowhbWDLFxmkL4vdC+JclQQ66/OVNE0L64oLXE+ioLU4z0B EqDYLf3gohUh2BuE/+Jc8rQxSFRJqOsW8C4R9TLYayGQJ4tSSJsz4ZBfyD5kXxzqwnNw tKBUuKOd7eMM9KSPgrH6BAtdiqxv+8uAdymwRTAqIAjKhRWg9YT4jTgn5ASHLgu9eyCG AYgg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728452608; x=1729057408; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=hLtN0wlTxY8jWtMzdhoOCsjGueTunUKwVw4gA0b7X5E=; b=uE2CPWcIKdVCK7YgHLfAMLUnBG4PXQgX3hbDueB7gv5pfA1Jxtrzi5rjoay9HsTOpL 0/EAjOkn5OtJ7DeaDoQ1qr5TT6kMJ2UyydBaHuIMVIqUG/yeBSQb5PIXPhS/aG7C94Vc /NN09VETlWUjCUoCE3sKDOetydLDg3tnrZKi/Cklht1ciHr6z+OWeEuBcFRaafrYfqlN RmF2BKdhvezn11anWQbja9qh801C+7kU4iKrOpRzsqWiBBHk76WSEl84Zlgz3y3VmM9J xtuh95TjIf2Dk8bb/38voXkbQ4hK21Jce4RrMvOYZ2/8izzWoiUWUQIh8lzUnbrh8xGl 5DJw== X-Gm-Message-State: AOJu0YwSBnUp+Fiyjc4oJ4kAzVnRh9jUkEwAvCCG6z1wG/D4wmyXubIH DBkXA1EygTqHPtqsCmJ/qxVcyzM7tUaSGda3136LTbWWOPJEvzfX31S01kHL4CEy5PfBmy8HUCi fTwwMT4iITEdWGcxp/VsD1yI8mqw= X-Google-Smtp-Source: AGHT+IHwRs1ZCpKz7OSzw2UPJSI8DbX82fviLfkF/nCd9HI5zwF23d+8PNdvnNiKlQF8wNQjJ4Bw6YfRP6EWOJs57jc= X-Received: by 2002:a17:903:24f:b0:206:a79c:ba37 with SMTP id d9443c01a7336-20c6376f311mr21455335ad.19.1728452607977; Tue, 08 Oct 2024 22:43:27 -0700 (PDT) MIME-Version: 1.0 References: <20241008015635.2782751-1-jingxiangzeng.cas@gmail.com> In-Reply-To: From: jingxiang zeng Date: Wed, 9 Oct 2024 13:43:16 +0800 Message-ID: Subject: Re: [RESEND][PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM To: Chris Li Cc: linux-mm@kvack.org, akpm@linux-foundation.org, kasong@tencent.com, linuszeng@tencent.com, linux-kernel@vger.kernel.org, tjmercier@google.com, weixugc@google.com, yuzhao@google.com Content-Type: multipart/alternative; boundary="000000000000b5b22b062404bb6e" X-Rspam-User: X-Stat-Signature: yrog7oedpk78mqp7kwtx8s7y1jgxy3fc X-Rspamd-Queue-Id: 7069420003 X-Rspamd-Server: rspam11 X-HE-Tag: 1728452609-703353 X-HE-Meta: U2FsdGVkX19WiTBndfw122reXUtw3bOr2Ih8YQ1nTRMYsAzBll/85zRDe6FAu+MDDnVWQTYD9I1mVq6/oOFluO8Z3ftV2VJ50HCbMLy3Qc4WzjRxJfYCgRGXNS4JfebflS0osy/MPyaz2l28/8RBXTyyU5hB+vKoNYM79y+idht7oDpfY1I8D+sQX0R5yrG9hijer//3N0+nI0EqEyJtOrL0wtqktSG56batc0hSWH8X0sjF1x7nYI4wIeBzxoMw0dZs+TaZ7DsblVowRE4hVxN6xSzFkQBSovV7nwptrDY1OfeSUsSITvVNwRxQqpNwVePN58ogVxXiYuQp3hffbmzUoW2E0EA0WQEwFe7OwsNyoiEIiUssKHfd3AkObNmlyHvecIynXOzcgSAY+u+5EJQn+LtdZu9hfWK0y2/dKAtq7qPIhkTrlpIRizxIactlEpMFdu7EI7cqN9LINihwL7rFU72y6zGQ6IeU1zoBTpDlc5+dLNrZJNoq4vHq52IQjo9bKr8MvzbwLlIfhXyUWBSFGg5ndQ3cJYS2oQCvWFsxSyXWL392dBtl+7iAr7cGFqpL3Am1GRpxZ0wfQaiEOHK1xG9YKUA4Sx0EvYI3g7Lq09Sip0MU1iF1zjp3vCpOYtZJjbqtRMP55OgbTMmr++jBDIJHKJmrR/NflEvZ+uZExAiGlsc15e0++cp7uFxdmb1vHhjdnQrOGL+isAyG0340J/NX9mj1LEDqhOUj/s3kk2G0GZBybPIwm2I5uYKHTdUYK2VJKZmUH0cUxDBDbyk/37IwTUYI4nKuyiEQozgk5aj4mmagsxdj+9vOjkxw5eRpP7DQdjt/Dq1cQ6rVEdll2HW3k5jl04iAnhc9kRA9StV8ETUs2+gSQcAsJHnU+RS+FREzUD6I879OrJkZv/tamm40VhzO99gwwhNHCfJjH5u0nt9f/vpKVHOItWq2Uy9vUV+i/8ZtrE67s0B FL9mOPb+ qclAEsTuqBTg5ksf0KsfXXKuetqWu5yrhP1mp6/nkUTso1QqqSCo2iuzDnIjOelxskBZ2JwL+VAswu9CERUJckIPOpkXRUPd9pahrQQcnDbIt/aJT71Y41YeA7HLdNU1AlN48WeDXOdA/dAkFgMN0YIckBmjt4LEoNPjRjF232WMFbtUjSm2NvlK/fR4IxXYctuwzfnNn0Qk6Ceb5It/LGKWn/rYc7MOdlsvi+25O8R6xQl8JlxMk8W93UnCHaMcJ2hMFQ75dcNdhVPcdEx83jL3DtMwSmgC47jnxlfLTzrQZNzimkUDnKrRUb405v2pWDhg0QVKCVNSj+mdJIwxd6PnU0qq2+FElp6f8w+JvTzZr+CUpRor12jTkkHAMPr4gB8G68OwJZrWvEKqbtFwXHuWT2jzmVsJvGYE5G6u3nH4ToJMyCOkJRCi5177PQSqFieBlOeaORqoRy/Ei4btFnE4vvVA4rO7AHVPgTS/rTli8SCODla5j6rBFVNP9qSNjUV+K47MwbhAOvKk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: --000000000000b5b22b062404bb6e Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, 9 Oct 2024 at 13:29, jingxiang zeng wrote: > > On Wed, 9 Oct 2024 at 01:12, Chris Li wrote: > > > > Hi Jingxiang, > > > > I did run the same swap stress test on V4 and it is much better than V3= . > > V3 test was hang there (time out). V4 did not hang any more, it > > finishes in about the same time. > > > > If we look closer of V4, it seems suggest that v4 system time is slightly worse. > > Is that kind of expected or might be the noise of my test? Just trying > > to understand it better, it is not a NACK by any means. > > > > Here is the number on mm-unstable c121617e3606be6575cdacfdb63cc8d67b46a568: > > Without (10 times): > > user 2688.328 > > system 6059.021 : 6031.57 6043.61 6044.35 6045.01 6052.46 6053.75 > > 6057.21 6063.31 6075.76 6123.18 > > real 277.145 > > > > With V4: > > First run (10 times): > > user 2688.537 > > system 6180.907 : 6128.4 6145.47 6160.25 6167.09 6193.31 6195.93 > > 6197.26 6202.98 6204.64 6213.74 > > real 280.174 > > Second run (10 times): > > user 2771.498 > > system 6199.043 : 6165.39 6173.49 6179.97 6189.03 6193.13 6199.33 > > 6204.03 6212.9 6216.32 6256.84 > > real 284.854 > > > > Chris > > > > Hi Chris, > > Before I released the V4 version, I also ran the swap stress test you gave me, > with -j32, 1G memcg on my local branch: > > Without the patch: > 1952.07user 1768.35system 4:51.89elapsed 1274%CPU (0avgtext+0avgdata > 920100maxresident)k > > With the patch: > 1957.83user 1757.06system 4:51.15elapsed 1275%CPU (0avgtext+0avgdata > 919880maxresident)k The result here is written in reverse, please refer to another email: https://lore.kernel.org/all/CAJqJ8ijzAhVxuE16-oawhhs3YmHKWmmoo0ca5KyaLGME7a= oXjw@mail.gmail.com/ > > My test results are the same as yours. This should not be test noise. I a= m > trying to analyze whether it can be further optimized. > > Jingxiang Zeng > > > On Mon, Oct 7, 2024 at 6:57=E2=80=AFPM Jingxiang Zeng > > wrote: > > > > > > From: Jingxiang Zeng > > > > > > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle"= ) > > > removed the opportunity to wake up flushers during the MGLRU page > > > reclamation process can lead to an increased likelihood of triggering OOM > > > when encountering many dirty pages during reclamation on MGLRU. > > > > > > This leads to premature OOM if there are too many dirty pages in cgroup: > > > Killed > > > > > > dd invoked oom-killer: gfp_mask=3D0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), > > > order=3D0, oom_score_adj=3D0 > > > > > > Call Trace: > > > > > > dump_stack_lvl+0x5f/0x80 > > > dump_stack+0x14/0x20 > > > dump_header+0x46/0x1b0 > > > oom_kill_process+0x104/0x220 > > > out_of_memory+0x112/0x5a0 > > > mem_cgroup_out_of_memory+0x13b/0x150 > > > try_charge_memcg+0x44f/0x5c0 > > > charge_memcg+0x34/0x50 > > > __mem_cgroup_charge+0x31/0x90 > > > filemap_add_folio+0x4b/0xf0 > > > __filemap_get_folio+0x1a4/0x5b0 > > > ? srso_return_thunk+0x5/0x5f > > > ? __block_commit_write+0x82/0xb0 > > > ext4_da_write_begin+0xe5/0x270 > > > generic_perform_write+0x134/0x2b0 > > > ext4_buffered_write_iter+0x57/0xd0 > > > ext4_file_write_iter+0x76/0x7d0 > > > ? selinux_file_permission+0x119/0x150 > > > ? srso_return_thunk+0x5/0x5f > > > ? srso_return_thunk+0x5/0x5f > > > vfs_write+0x30c/0x440 > > > ksys_write+0x65/0xe0 > > > __x64_sys_write+0x1e/0x30 > > > x64_sys_call+0x11c2/0x1d50 > > > do_syscall_64+0x47/0x110 > > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > > > > memory: usage 308224kB, limit 308224kB, failcnt 2589 > > > swap: usage 0kB, limit 9007199254740988kB, failcnt 0 > > > > > > ... > > > file_dirty 303247360 > > > file_writeback 0 > > > ... > > > > > > oom-kill:constraint=3DCONSTRAINT_MEMCG,nodemask=3D(null),cpuset=3Dtes= t, > > > mems_allowed=3D0,oom_memcg=3D/test,task_memcg=3D/test,task=3Ddd,pid= =3D4404,uid=3D0 > > > Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB, > > > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB > > > oom_score_adj:0 > > > > > > The flusher wake up was removed to decrease SSD wearing, but if we ar= e > > > seeing all dirty folios at the tail of an LRU, not waking up the flusher > > > could lead to thrashing easily. So wake it up when a mem cgroups is about > > > to OOM due to dirty caches. > > > > > > --- > > > Changes from v3: > > > - Avoid taking lock and reduce overhead on folio isolation by > > > checking the right flags and rework wake up condition, fixing the > > > performance regression reported by Chris Li. > > > [Chris Li, Kairui Song] > > > - Move the wake up check to try_to_shrink_lruvec to cover kswapd > > > case as well, and update comments. [Kairui Song] > > > - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.= com/ > > > Changes from v2: > > > - Acquire the lock before calling the folio_check_dirty_writeback > > > function. [Wei Xu, Jingxiang Zeng] > > > - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmai= l.com/ > > > Changes from v1: > > > - Add code to count the number of unqueued_dirty in the sort_folio > > > function. [Wei Xu, Jingxiang Zeng] > > > - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail= .com/ > > > --- > > > > > > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle"= ) > > > Signed-off-by: Zeng Jingxiang > > > Signed-off-by: Kairui Song > > > Cc: T.J. Mercier > > > Cc: Wei Xu > > > Cc: Yu Zhao > > > --- > > > mm/vmscan.c | 19 ++++++++++++++++--- > > > 1 file changed, 16 insertions(+), 3 deletions(-) > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index dc7a285b256b..2a5c2fe81467 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c > > > int tier_idx) > > > { > > > bool success; > > > + bool dirty, writeback; > > > int gen =3D folio_lru_gen(folio); > > > int type =3D folio_is_file_lru(folio); > > > int zone =3D folio_zonenum(folio); > > > @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c > > > return true; > > > } > > > > > > + dirty =3D folio_test_dirty(folio); > > > + writeback =3D folio_test_writeback(folio); > > > + if (type =3D=3D LRU_GEN_FILE && dirty && !writeback) > > > + sc->nr.unqueued_dirty +=3D delta; > > > + > > > /* waiting for writeback */ > > > - if (folio_test_locked(folio) || folio_test_writeback(folio) |= | > > > - (type =3D=3D LRU_GEN_FILE && folio_test_dirty(folio))) { > > > + if (folio_test_locked(folio) || writeback || > > > + (type =3D=3D LRU_GEN_FILE && dirty)) { > > > gen =3D folio_inc_gen(lruvec, folio, true); > > > list_move(&folio->lru, &lrugen->folios[gen][type][zone]); > > > return true; > > > @@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > > > trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH, > > > scanned, skipped, isolated, > > > type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON); > > > - > > > + sc->nr.taken +=3D scanned; > > > /* > > > * There might not be eligible folios due to reclaim_idx. Check the > > > * remaining to prevent livelock if it's not making progress. > > > @@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) > > > cond_resched(); > > > } > > > > > > + /* > > > + * If too many file cache in the coldest generation can't be evicted > > > + * due to being dirty, wake up the flusher. > > > + */ > > > + if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty =3D=3D sc->nr.taken) > > > + wakeup_flusher_threads(WB_REASON_VMSCAN); > > > + > > > /* whether this lruvec should be rotated */ > > > return nr_to_scan < 0; > > > } > > > -- > > > 2.43.5 > > > --000000000000b5b22b062404bb6e Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Wed, 9 Oct 2024 at 13:29, jingxiang zeng <jingxiangzeng.cas@gmail.com> wrote:
>
> On Wed, 9 Oct 2024 at 01:12, Chris Li <chrisl@kernel.org> wrote:
> &= gt;
> > Hi Jingxiang,
> >
> > I did run the same= swap stress test on V4 and it is much better than V3.
> > V3 test= was hang there (time out). V4 did not hang any more, it
> > finis= hes in about the same time.
> >
> > If we look closer of = V4, it seems suggest that v4 system time is slightly worse.
> > Is= that kind of expected or might be the noise of my test? Just trying
>= ; > to understand it better, it is not a NACK by any means.
> >=
> > Here is the number on mm-unstable c121617e3606be6575cdacfdb63= cc8d67b46a568:
> > Without (10 times):
> > user =C2=A0 = =C2=A02688.328
> > system =C2=A06059.021 : 6031.57 6043.61 6044.35= 6045.01 6052.46 6053.75
> > 6057.21 6063.31 6075.76 6123.18
&g= t; > real =C2=A0 =C2=A0277.145
> >
> > With V4:
>= ; > First run (10 times):
> > user =C2=A0 =C2=A02688.537
>= ; > system =C2=A06180.907 : 6128.4 6145.47 6160.25 6167.09 6193.31 6195.= 93
> > 6197.26 6202.98 6204.64 6213.74
> > real =C2=A0 = =C2=A0280.174
> > Second run (10 times):
> > user =C2=A0 = =C2=A02771.498
> > system =C2=A06199.043 : 6165.39 6173.49 6179.97= 6189.03 6193.13 6199.33
> > 6204.03 6212.9 6216.32 6256.84
>= ; > real =C2=A0 =C2=A0284.854
> >
> > Chris
> &g= t;
>
> Hi Chris,
>
> Before I released the V4 versi= on, I also ran the swap stress test you gave me,
> with -j32, 1G memc= g on my local branch:
>
> Without the patch:
> 1952.07use= r 1768.35system 4:51.89elapsed 1274%CPU (0avgtext+0avgdata
> 920100ma= xresident)k
>
> With the patch:
> 1957.83user 1757.06syst= em 4:51.15elapsed 1275%CPU (0avgtext+0avgdata
> 919880maxresident)k
The result here is written in reverse, please refer to an= other email:
https://lore.kernel.org/al= l/CAJqJ8ijzAhVxuE16-oawhhs3YmHKWmmoo0ca5KyaLGME7aoXjw@mail.gmail.com/
>
> My test results are the same as yours. This shoul= d not be test noise. I am
> trying to analyze whether it can be furth= er optimized.
>
> Jingxiang Zeng
>
> > On Mon, O= ct 7, 2024 at 6:57=E2=80=AFPM Jingxiang Zeng
> > <jingxiangzeng.cas@gmail.com> wrote= :
> > >
> > > From: Jingxiang Zeng <linuszeng@tencent.com>
> > >= ;
> > > Commit 14aa8b2d5c2e ("mm/mglru: don't sync dis= k for each aging cycle")
> > > removed the opportunity to = wake up flushers during the MGLRU page
> > > reclamation proces= s can lead to an increased likelihood of triggering OOM
> > > w= hen encountering many dirty pages during reclamation on MGLRU.
> >= >
> > > This leads to premature OOM if there are too many d= irty pages in cgroup:
> > > Killed
> > >
> &g= t; > dd invoked oom-killer: gfp_mask=3D0x101cca(GFP_HIGHUSER_MOVABLE|__G= FP_WRITE),
> > > order=3D0, oom_score_adj=3D0
> > >=
> > > Call Trace:
> > > =C2=A0 <TASK>
>= ; > > =C2=A0 dump_stack_lvl+0x5f/0x80
> > > =C2=A0 dump_s= tack+0x14/0x20
> > > =C2=A0 dump_header+0x46/0x1b0
> >= > =C2=A0 oom_kill_process+0x104/0x220
> > > =C2=A0 out_of_m= emory+0x112/0x5a0
> > > =C2=A0 mem_cgroup_out_of_memory+0x13b/0= x150
> > > =C2=A0 try_charge_memcg+0x44f/0x5c0
> > >= ; =C2=A0 charge_memcg+0x34/0x50
> > > =C2=A0 __mem_cgroup_charg= e+0x31/0x90
> > > =C2=A0 filemap_add_folio+0x4b/0xf0
> &g= t; > =C2=A0 __filemap_get_folio+0x1a4/0x5b0
> > > =C2=A0 ? s= rso_return_thunk+0x5/0x5f
> > > =C2=A0 ? __block_commit_write+0= x82/0xb0
> > > =C2=A0 ext4_da_write_begin+0xe5/0x270
> &g= t; > =C2=A0 generic_perform_write+0x134/0x2b0
> > > =C2=A0 e= xt4_buffered_write_iter+0x57/0xd0
> > > =C2=A0 ext4_file_write_= iter+0x76/0x7d0
> > > =C2=A0 ? selinux_file_permission+0x119/0x= 150
> > > =C2=A0 ? srso_return_thunk+0x5/0x5f
> > >= =C2=A0 ? srso_return_thunk+0x5/0x5f
> > > =C2=A0 vfs_write+0x3= 0c/0x440
> > > =C2=A0 ksys_write+0x65/0xe0
> > > = =C2=A0 __x64_sys_write+0x1e/0x30
> > > =C2=A0 x64_sys_call+0x11= c2/0x1d50
> > > =C2=A0 do_syscall_64+0x47/0x110
> > &g= t; =C2=A0 entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >
>= ; > > =C2=A0memory: usage 308224kB, limit 308224kB, failcnt 2589
&= gt; > > =C2=A0swap: usage 0kB, limit 9007199254740988kB, failcnt 0> > >
> > > =C2=A0 ...
> > > =C2=A0 file_= dirty 303247360
> > > =C2=A0 file_writeback 0
> > >= =C2=A0 ...
> > >
> > > oom-kill:constraint=3DCONST= RAINT_MEMCG,nodemask=3D(null),cpuset=3Dtest,
> > > mems_allowed= =3D0,oom_memcg=3D/test,task_memcg=3D/test,task=3Ddd,pid=3D4404,uid=3D0
&= gt; > > Memory cgroup out of memory: Killed process 4404 (dd) total-v= m:10512kB,
> > > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0k= B, UID:0 pgtables:76kB
> > > oom_score_adj:0
> > ><= br>> > > The flusher wake up was removed to decrease SSD wearing, = but if we are
> > > seeing all dirty folios at the tail of an L= RU, not waking up the flusher
> > > could lead to thrashing eas= ily.=C2=A0 So wake it up when a mem cgroups is about
> > > to O= OM due to dirty caches.
> > >
> > > ---
> >= ; > Changes from v3:
> > > - Avoid taking lock and reduce ov= erhead on folio isolation by
> > > =C2=A0 checking the right fl= ags and rework wake up condition, fixing the
> > > =C2=A0 perfo= rmance regression reported by Chris Li.
> > > =C2=A0 [Chris Li,= Kairui Song]
> > > - Move the wake up check to try_to_shrink_l= ruvec to cover kswapd
> > > =C2=A0 case as well, and update com= ments. [Kairui Song]
> > > - Link to v3: https= ://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/<= /a>
> > > Changes from v2:
> > > - Acquire the lock= before calling the folio_check_dirty_writeback
> > > =C2=A0 fu= nction. [Wei Xu, Jingxiang Zeng]
> > > - Link to v2:
https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng= .cas@gmail.com/
> > > Changes from v1:
> > > - = Add code to count the number of unqueued_dirty in the sort_folio
> &g= t; > =C2=A0 function. [Wei Xu, Jingxiang Zeng]
> > > - Link = to v1: https://lore.kernel.org/all/20240829102543.189453-= 1-jingxiangzeng.cas@gmail.com/
> > > ---
> > ><= br>> > > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk = for each aging cycle")
> > > Signed-off-by: Zeng Jingxiang= <linuszeng@tencent.com>=
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > Cc: T.J. Mercier = <tjmercier@google.com>> > > Cc: Wei Xu <weixug= c@google.com>
> > > Cc: Yu Zhao <yuzhao@google.com>
> > > ---
> &g= t; > =C2=A0mm/vmscan.c | 19 ++++++++++++++++---
> > > =C2=A0= 1 file changed, 16 insertions(+), 3 deletions(-)
> > >
> = > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index dc= 7a285b256b..2a5c2fe81467 100644
> > > --- a/mm/vmscan.c
>= > > +++ b/mm/vmscan.c
> > > @@ -4291,6 +4291,7 @@ static= bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c<= br>> > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0int tier_idx)
> > > =C2=A0{
> = > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 bool success;
> > > + =C2= =A0 =C2=A0 =C2=A0 bool dirty, writeback;
> > > =C2=A0 =C2=A0 = =C2=A0 =C2=A0 int gen =3D folio_lru_gen(folio);
> > > =C2=A0 = =C2=A0 =C2=A0 =C2=A0 int type =3D folio_is_file_lru(folio);
> > &g= t; =C2=A0 =C2=A0 =C2=A0 =C2=A0 int zone =3D folio_zonenum(folio);
> &= gt; > @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruve= c, struct folio *folio, struct scan_c
> > > =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return true;
> > > =C2= =A0 =C2=A0 =C2=A0 =C2=A0 }
> > >
> > > + =C2=A0 =C2= =A0 =C2=A0 dirty =3D folio_test_dirty(folio);
> > > + =C2=A0 = =C2=A0 =C2=A0 writeback =3D folio_test_writeback(folio);
> > > = + =C2=A0 =C2=A0 =C2=A0 if (type =3D=3D LRU_GEN_FILE && dirty &&= amp; !writeback)
> > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 sc->nr.unqueued_dirty +=3D delta;
> > > +
>= > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* waiting for writeback */
> = > > - =C2=A0 =C2=A0 =C2=A0 if (folio_test_locked(folio) || folio_test= _writeback(folio) ||
> > > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= (type =3D=3D LRU_GEN_FILE && folio_test_dirty(folio))) {
> &= gt; > + =C2=A0 =C2=A0 =C2=A0 if (folio_test_locked(folio) || writeback |= |
> > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (type =3D=3D LRU_G= EN_FILE && dirty)) {
> > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 gen =3D folio_inc_gen(lruvec, folio, true);
= > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 list= _move(&folio->lru, &lrugen->folios[gen][type][zone]);
>= > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return t= rue;
> > > @@ -4454,7 +4460,7 @@ static int scan_folios(struct = lruvec *lruvec, struct scan_control *sc,
> > > =C2=A0 =C2=A0 = =C2=A0 =C2=A0 trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order,= MAX_LRU_BATCH,
> > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 scan= ned, skipped, isolated,
> > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> > > -> > > + =C2=A0 =C2=A0 =C2=A0 sc->nr.taken +=3D scanned;
&g= t; > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 /*
> > > =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0* There might not be eligible folios due to reclaim_id= x. Check the
> > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* remainin= g to prevent livelock if it's not making progress.
> > > @@= -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec= , struct scan_control *sc)
> > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 cond_resched();
> > > =C2=A0 =C2=A0= =C2=A0 =C2=A0 }
> > >
> > > + =C2=A0 =C2=A0 =C2=A0= /*
> > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0* If too many file cache= in the coldest generation can't be evicted
> > > + =C2=A0 = =C2=A0 =C2=A0 =C2=A0* due to being dirty, wake up the flusher.
> >= > + =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
> > > + =C2=A0 =C2=A0 =C2= =A0 if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty =3D=3D= sc->nr.taken)
> > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 wakeup_flusher_threads(WB_REASON_VMSCAN);
> > > +=
> > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* whether this lruvec shoul= d be rotated */
> > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 return nr_to_= scan < 0;
> > > =C2=A0}
> > > --
> > &g= t; 2.43.5
> > >
--000000000000b5b22b062404bb6e--