From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6C688CF0443
	for <linux-mm@archiver.kernel.org>; Wed,  9 Oct 2024 05:43:32 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id ED00C6B00A4; Wed,  9 Oct 2024 01:43:31 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E7E166B00A9; Wed,  9 Oct 2024 01:43:31 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D1EC56B00D1; Wed,  9 Oct 2024 01:43:31 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id B22786B00A4
	for <linux-mm@kvack.org>; Wed,  9 Oct 2024 01:43:31 -0400 (EDT)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 35CADAC27C
	for <linux-mm@kvack.org>; Wed,  9 Oct 2024 05:43:27 +0000 (UTC)
X-FDA: 82652971422.30.A153BD7
Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com [209.85.214.174])
	by imf13.hostedemail.com (Postfix) with ESMTP id 7069420003
	for <linux-mm@kvack.org>; Wed,  9 Oct 2024 05:43:29 +0000 (UTC)
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=jQqbLDxt;
	spf=pass (imf13.hostedemail.com: domain of jingxiangzeng.cas@gmail.com designates 209.85.214.174 as permitted sender) smtp.mailfrom=jingxiangzeng.cas@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1728452566;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=hLtN0wlTxY8jWtMzdhoOCsjGueTunUKwVw4gA0b7X5E=;
	b=qdxFlYx7GmvZpC6Uw+b29Ox1Y2Q17m8MHjsY4VgDSGRskIzyFxGxWdJKfeHKE63j+h9JGp
	E4jkNMVxED6XHeE3sTIkH1cDmtCM+OFaBl8/cKttnobuL6f72NWgCL7JKFdQUlMiiLnSoa
	76KEOHdWklGi08bSCuVJ9fmtU7arsdk=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=jQqbLDxt;
	spf=pass (imf13.hostedemail.com: domain of jingxiangzeng.cas@gmail.com designates 209.85.214.174 as permitted sender) smtp.mailfrom=jingxiangzeng.cas@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728452566; a=rsa-sha256;
	cv=none;
	b=iA3VinTw0SPiNF/pFWovkH6jVMkxINGvjWirzRGHpZnV9g64YhU79wqpVwZv+8dEPqv2yw
	UzsZIHKg4BqzgIltmYX+iU1u0YatI5XryErGGTuPwZBIod0CPUs55Ac3Kf3ngTzRo4gvV4
	IpnMTLTC2iABNtOke0Hl8CDL7FDs8AE=
Received: by mail-pl1-f174.google.com with SMTP id d9443c01a7336-20b9b35c7c7so45875015ad.1
        for <linux-mm@kvack.org>; Tue, 08 Oct 2024 22:43:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1728452608; x=1729057408; darn=kvack.org;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=hLtN0wlTxY8jWtMzdhoOCsjGueTunUKwVw4gA0b7X5E=;
        b=jQqbLDxtZlllZOBr/0Ov5vjPLVWBcYOf7e6sDLRrqeIG/rVPWuw/GxYgrli641wXkn
         SMUJ3SlrX2Ig4GzdLc5v9LppDGnZ783ULlEnMTkP+LLe9pysCE5CiVn0xKW6OJyqye6V
         ZddwFsHC4dzXK5GVNfbQMzowhbWDLFxmkL4vdC+JclQQ66/OVNE0L64oLXE+ioLU4z0B
         EqDYLf3gohUh2BuE/+Jc8rQxSFRJqOsW8C4R9TLYayGQJ4tSSJsz4ZBfyD5kXxzqwnNw
         tKBUuKOd7eMM9KSPgrH6BAtdiqxv+8uAdymwRTAqIAjKhRWg9YT4jTgn5ASHLgu9eyCG
         AYgg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1728452608; x=1729057408;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=hLtN0wlTxY8jWtMzdhoOCsjGueTunUKwVw4gA0b7X5E=;
        b=uE2CPWcIKdVCK7YgHLfAMLUnBG4PXQgX3hbDueB7gv5pfA1Jxtrzi5rjoay9HsTOpL
         0/EAjOkn5OtJ7DeaDoQ1qr5TT6kMJ2UyydBaHuIMVIqUG/yeBSQb5PIXPhS/aG7C94Vc
         /NN09VETlWUjCUoCE3sKDOetydLDg3tnrZKi/Cklht1ciHr6z+OWeEuBcFRaafrYfqlN
         RmF2BKdhvezn11anWQbja9qh801C+7kU4iKrOpRzsqWiBBHk76WSEl84Zlgz3y3VmM9J
         xtuh95TjIf2Dk8bb/38voXkbQ4hK21Jce4RrMvOYZ2/8izzWoiUWUQIh8lzUnbrh8xGl
         5DJw==
X-Gm-Message-State: AOJu0YwSBnUp+Fiyjc4oJ4kAzVnRh9jUkEwAvCCG6z1wG/D4wmyXubIH
	DBkXA1EygTqHPtqsCmJ/qxVcyzM7tUaSGda3136LTbWWOPJEvzfX31S01kHL4CEy5PfBmy8HUCi
	fTwwMT4iITEdWGcxp/VsD1yI8mqw=
X-Google-Smtp-Source: AGHT+IHwRs1ZCpKz7OSzw2UPJSI8DbX82fviLfkF/nCd9HI5zwF23d+8PNdvnNiKlQF8wNQjJ4Bw6YfRP6EWOJs57jc=
X-Received: by 2002:a17:903:24f:b0:206:a79c:ba37 with SMTP id
 d9443c01a7336-20c6376f311mr21455335ad.19.1728452607977; Tue, 08 Oct 2024
 22:43:27 -0700 (PDT)
MIME-Version: 1.0
References: <linuszeng@tencent.com> <20241008015635.2782751-1-jingxiangzeng.cas@gmail.com>
 <CACePvbXGPyPyyGa+EWkfdyU0wa9eHuwvzAVNBrE-tjb6otQ7KQ@mail.gmail.com> <CAJqJ8ijoU4qXA87SLDH1jZq8yVGZZtJhVffmbjCCsaHNv1Zfuw@mail.gmail.com>
In-Reply-To: <CAJqJ8ijoU4qXA87SLDH1jZq8yVGZZtJhVffmbjCCsaHNv1Zfuw@mail.gmail.com>
From: jingxiang zeng <jingxiangzeng.cas@gmail.com>
Date: Wed, 9 Oct 2024 13:43:16 +0800
Message-ID: <CAJqJ8ihHwD_es2t8G=7aJZCMhnKH5n6320TaHgjFSaYuhWdc_Q@mail.gmail.com>
Subject: Re: [RESEND][PATCH v4] mm/vmscan: wake up flushers conditionally to
 avoid cgroup OOM
To: Chris Li <chrisl@kernel.org>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org, kasong@tencent.com, 
	linuszeng@tencent.com, linux-kernel@vger.kernel.org, tjmercier@google.com, 
	weixugc@google.com, yuzhao@google.com
Content-Type: multipart/alternative; boundary="000000000000b5b22b062404bb6e"
X-Rspam-User: 
X-Stat-Signature: yrog7oedpk78mqp7kwtx8s7y1jgxy3fc
X-Rspamd-Queue-Id: 7069420003
X-Rspamd-Server: rspam11
X-HE-Tag: 1728452609-703353
X-HE-Meta: U2FsdGVkX19WiTBndfw122reXUtw3bOr2Ih8YQ1nTRMYsAzBll/85zRDe6FAu+MDDnVWQTYD9I1mVq6/oOFluO8Z3ftV2VJ50HCbMLy3Qc4WzjRxJfYCgRGXNS4JfebflS0osy/MPyaz2l28/8RBXTyyU5hB+vKoNYM79y+idht7oDpfY1I8D+sQX0R5yrG9hijer//3N0+nI0EqEyJtOrL0wtqktSG56batc0hSWH8X0sjF1x7nYI4wIeBzxoMw0dZs+TaZ7DsblVowRE4hVxN6xSzFkQBSovV7nwptrDY1OfeSUsSITvVNwRxQqpNwVePN58ogVxXiYuQp3hffbmzUoW2E0EA0WQEwFe7OwsNyoiEIiUssKHfd3AkObNmlyHvecIynXOzcgSAY+u+5EJQn+LtdZu9hfWK0y2/dKAtq7qPIhkTrlpIRizxIactlEpMFdu7EI7cqN9LINihwL7rFU72y6zGQ6IeU1zoBTpDlc5+dLNrZJNoq4vHq52IQjo9bKr8MvzbwLlIfhXyUWBSFGg5ndQ3cJYS2oQCvWFsxSyXWL392dBtl+7iAr7cGFqpL3Am1GRpxZ0wfQaiEOHK1xG9YKUA4Sx0EvYI3g7Lq09Sip0MU1iF1zjp3vCpOYtZJjbqtRMP55OgbTMmr++jBDIJHKJmrR/NflEvZ+uZExAiGlsc15e0++cp7uFxdmb1vHhjdnQrOGL+isAyG0340J/NX9mj1LEDqhOUj/s3kk2G0GZBybPIwm2I5uYKHTdUYK2VJKZmUH0cUxDBDbyk/37IwTUYI4nKuyiEQozgk5aj4mmagsxdj+9vOjkxw5eRpP7DQdjt/Dq1cQ6rVEdll2HW3k5jl04iAnhc9kRA9StV8ETUs2+gSQcAsJHnU+RS+FREzUD6I879OrJkZv/tamm40VhzO99gwwhNHCfJjH5u0nt9f/vpKVHOItWq2Uy9vUV+i/8ZtrE67s0B
 FL9mOPb+
 qclAEsTuqBTg5ksf0KsfXXKuetqWu5yrhP1mp6/nkUTso1QqqSCo2iuzDnIjOelxskBZ2JwL+VAswu9CERUJckIPOpkXRUPd9pahrQQcnDbIt/aJT71Y41YeA7HLdNU1AlN48WeDXOdA/dAkFgMN0YIckBmjt4LEoNPjRjF232WMFbtUjSm2NvlK/fR4IxXYctuwzfnNn0Qk6Ceb5It/LGKWn/rYc7MOdlsvi+25O8R6xQl8JlxMk8W93UnCHaMcJ2hMFQ75dcNdhVPcdEx83jL3DtMwSmgC47jnxlfLTzrQZNzimkUDnKrRUb405v2pWDhg0QVKCVNSj+mdJIwxd6PnU0qq2+FElp6f8w+JvTzZr+CUpRor12jTkkHAMPr4gB8G68OwJZrWvEKqbtFwXHuWT2jzmVsJvGYE5G6u3nH4ToJMyCOkJRCi5177PQSqFieBlOeaORqoRy/Ei4btFnE4vvVA4rO7AHVPgTS/rTli8SCODla5j6rBFVNP9qSNjUV+K47MwbhAOvKk=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

--000000000000b5b22b062404bb6e
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Wed, 9 Oct 2024 at 13:29, jingxiang zeng <jingxiangzeng.cas@gmail.com>
wrote:
>
> On Wed, 9 Oct 2024 at 01:12, Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Jingxiang,
> >
> > I did run the same swap stress test on V4 and it is much better than V3=
.
> > V3 test was hang there (time out). V4 did not hang any more, it
> > finishes in about the same time.
> >
> > If we look closer of V4, it seems suggest that v4 system time is
slightly worse.
> > Is that kind of expected or might be the noise of my test? Just trying
> > to understand it better, it is not a NACK by any means.
> >
> > Here is the number on mm-unstable
c121617e3606be6575cdacfdb63cc8d67b46a568:
> > Without (10 times):
> > user    2688.328
> > system  6059.021 : 6031.57 6043.61 6044.35 6045.01 6052.46 6053.75
> > 6057.21 6063.31 6075.76 6123.18
> > real    277.145
> >
> > With V4:
> > First run (10 times):
> > user    2688.537
> > system  6180.907 : 6128.4 6145.47 6160.25 6167.09 6193.31 6195.93
> > 6197.26 6202.98 6204.64 6213.74
> > real    280.174
> > Second run (10 times):
> > user    2771.498
> > system  6199.043 : 6165.39 6173.49 6179.97 6189.03 6193.13 6199.33
> > 6204.03 6212.9 6216.32 6256.84
> > real    284.854
> >
> > Chris
> >
>
> Hi Chris,
>
> Before I released the V4 version, I also ran the swap stress test you
gave me,
> with -j32, 1G memcg on my local branch:
>
> Without the patch:
> 1952.07user 1768.35system 4:51.89elapsed 1274%CPU (0avgtext+0avgdata
> 920100maxresident)k
>
> With the patch:
> 1957.83user 1757.06system 4:51.15elapsed 1275%CPU (0avgtext+0avgdata
> 919880maxresident)k

The result here is written in reverse, please refer to another email:
https://lore.kernel.org/all/CAJqJ8ijzAhVxuE16-oawhhs3YmHKWmmoo0ca5KyaLGME7a=
oXjw@mail.gmail.com/
>
> My test results are the same as yours. This should not be test noise. I a=
m
> trying to analyze whether it can be further optimized.
>
> Jingxiang Zeng
>
> > On Mon, Oct 7, 2024 at 6:57=E2=80=AFPM Jingxiang Zeng
> > <jingxiangzeng.cas@gmail.com> wrote:
> > >
> > > From: Jingxiang Zeng <linuszeng@tencent.com>
> > >
> > > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle"=
)
> > > removed the opportunity to wake up flushers during the MGLRU page
> > > reclamation process can lead to an increased likelihood of triggering
OOM
> > > when encountering many dirty pages during reclamation on MGLRU.
> > >
> > > This leads to premature OOM if there are too many dirty pages in
cgroup:
> > > Killed
> > >
> > > dd invoked oom-killer:
gfp_mask=3D0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> > > order=3D0, oom_score_adj=3D0
> > >
> > > Call Trace:
> > >   <TASK>
> > >   dump_stack_lvl+0x5f/0x80
> > >   dump_stack+0x14/0x20
> > >   dump_header+0x46/0x1b0
> > >   oom_kill_process+0x104/0x220
> > >   out_of_memory+0x112/0x5a0
> > >   mem_cgroup_out_of_memory+0x13b/0x150
> > >   try_charge_memcg+0x44f/0x5c0
> > >   charge_memcg+0x34/0x50
> > >   __mem_cgroup_charge+0x31/0x90
> > >   filemap_add_folio+0x4b/0xf0
> > >   __filemap_get_folio+0x1a4/0x5b0
> > >   ? srso_return_thunk+0x5/0x5f
> > >   ? __block_commit_write+0x82/0xb0
> > >   ext4_da_write_begin+0xe5/0x270
> > >   generic_perform_write+0x134/0x2b0
> > >   ext4_buffered_write_iter+0x57/0xd0
> > >   ext4_file_write_iter+0x76/0x7d0
> > >   ? selinux_file_permission+0x119/0x150
> > >   ? srso_return_thunk+0x5/0x5f
> > >   ? srso_return_thunk+0x5/0x5f
> > >   vfs_write+0x30c/0x440
> > >   ksys_write+0x65/0xe0
> > >   __x64_sys_write+0x1e/0x30
> > >   x64_sys_call+0x11c2/0x1d50
> > >   do_syscall_64+0x47/0x110
> > >   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >
> > >  memory: usage 308224kB, limit 308224kB, failcnt 2589
> > >  swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> > >
> > >   ...
> > >   file_dirty 303247360
> > >   file_writeback 0
> > >   ...
> > >
> > > oom-kill:constraint=3DCONSTRAINT_MEMCG,nodemask=3D(null),cpuset=3Dtes=
t,
> > > mems_allowed=3D0,oom_memcg=3D/test,task_memcg=3D/test,task=3Ddd,pid=
=3D4404,uid=3D0
> > > Memory cgroup out of memory: Killed process 4404 (dd)
total-vm:10512kB,
> > > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> > > oom_score_adj:0
> > >
> > > The flusher wake up was removed to decrease SSD wearing, but if we ar=
e
> > > seeing all dirty folios at the tail of an LRU, not waking up the
flusher
> > > could lead to thrashing easily.  So wake it up when a mem cgroups is
about
> > > to OOM due to dirty caches.
> > >
> > > ---
> > > Changes from v3:
> > > - Avoid taking lock and reduce overhead on folio isolation by
> > >   checking the right flags and rework wake up condition, fixing the
> > >   performance regression reported by Chris Li.
> > >   [Chris Li, Kairui Song]
> > > - Move the wake up check to try_to_shrink_lruvec to cover kswapd
> > >   case as well, and update comments. [Kairui Song]
> > > - Link to v3:
https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.=
com/
> > > Changes from v2:
> > > - Acquire the lock before calling the folio_check_dirty_writeback
> > >   function. [Wei Xu, Jingxiang Zeng]
> > > - Link to v2:
https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmai=
l.com/
> > > Changes from v1:
> > > - Add code to count the number of unqueued_dirty in the sort_folio
> > >   function. [Wei Xu, Jingxiang Zeng]
> > > - Link to v1:
https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail=
.com/
> > > ---
> > >
> > > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle"=
)
> > > Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > Cc: T.J. Mercier <tjmercier@google.com>
> > > Cc: Wei Xu <weixugc@google.com>
> > > Cc: Yu Zhao <yuzhao@google.com>
> > > ---
> > >  mm/vmscan.c | 19 ++++++++++++++++---
> > >  1 file changed, 16 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index dc7a285b256b..2a5c2fe81467 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec,
struct folio *folio, struct scan_c
> > >                        int tier_idx)
> > >  {
> > >         bool success;
> > > +       bool dirty, writeback;
> > >         int gen =3D folio_lru_gen(folio);
> > >         int type =3D folio_is_file_lru(folio);
> > >         int zone =3D folio_zonenum(folio);
> > > @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec,
struct folio *folio, struct scan_c
> > >                 return true;
> > >         }
> > >
> > > +       dirty =3D folio_test_dirty(folio);
> > > +       writeback =3D folio_test_writeback(folio);
> > > +       if (type =3D=3D LRU_GEN_FILE && dirty && !writeback)
> > > +               sc->nr.unqueued_dirty +=3D delta;
> > > +
> > >         /* waiting for writeback */
> > > -       if (folio_test_locked(folio) || folio_test_writeback(folio) |=
|
> > > -           (type =3D=3D LRU_GEN_FILE && folio_test_dirty(folio))) {
> > > +       if (folio_test_locked(folio) || writeback ||
> > > +           (type =3D=3D LRU_GEN_FILE && dirty)) {
> > >                 gen =3D folio_inc_gen(lruvec, folio, true);
> > >                 list_move(&folio->lru,
&lrugen->folios[gen][type][zone]);
> > >                 return true;
> > > @@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec,
struct scan_control *sc,
> > >         trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order,
MAX_LRU_BATCH,
> > >                                 scanned, skipped, isolated,
> > >                                 type ? LRU_INACTIVE_FILE :
LRU_INACTIVE_ANON);
> > > -
> > > +       sc->nr.taken +=3D scanned;
> > >         /*
> > >          * There might not be eligible folios due to reclaim_idx.
Check the
> > >          * remaining to prevent livelock if it's not making progress.
> > > @@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec
*lruvec, struct scan_control *sc)
> > >                 cond_resched();
> > >         }
> > >
> > > +       /*
> > > +        * If too many file cache in the coldest generation can't be
evicted
> > > +        * due to being dirty, wake up the flusher.
> > > +        */
> > > +       if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty =3D=3D
sc->nr.taken)
> > > +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> > > +
> > >         /* whether this lruvec should be rotated */
> > >         return nr_to_scan < 0;
> > >  }
> > > --
> > > 2.43.5
> > >

--000000000000b5b22b062404bb6e
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br>On Wed, 9 Oct 2024 at 13:29, jingxiang zeng &lt;<a=
 href=3D"mailto:jingxiangzeng.cas@gmail.com">jingxiangzeng.cas@gmail.com</a=
>&gt; wrote:<br>&gt;<br>&gt; On Wed, 9 Oct 2024 at 01:12, Chris Li &lt;<a h=
ref=3D"mailto:chrisl@kernel.org">chrisl@kernel.org</a>&gt; wrote:<br>&gt; &=
gt;<br>&gt; &gt; Hi Jingxiang,<br>&gt; &gt;<br>&gt; &gt; I did run the same=
 swap stress test on V4 and it is much better than V3.<br>&gt; &gt; V3 test=
 was hang there (time out). V4 did not hang any more, it<br>&gt; &gt; finis=
hes in about the same time.<br>&gt; &gt;<br>&gt; &gt; If we look closer of =
V4, it seems suggest that v4 system time is slightly worse.<br>&gt; &gt; Is=
 that kind of expected or might be the noise of my test? Just trying<br>&gt=
; &gt; to understand it better, it is not a NACK by any means.<br>&gt; &gt;=
<br>&gt; &gt; Here is the number on mm-unstable c121617e3606be6575cdacfdb63=
cc8d67b46a568:<br>&gt; &gt; Without (10 times):<br>&gt; &gt; user =C2=A0 =
=C2=A02688.328<br>&gt; &gt; system =C2=A06059.021 : 6031.57 6043.61 6044.35=
 6045.01 6052.46 6053.75<br>&gt; &gt; 6057.21 6063.31 6075.76 6123.18<br>&g=
t; &gt; real =C2=A0 =C2=A0277.145<br>&gt; &gt;<br>&gt; &gt; With V4:<br>&gt=
; &gt; First run (10 times):<br>&gt; &gt; user =C2=A0 =C2=A02688.537<br>&gt=
; &gt; system =C2=A06180.907 : 6128.4 6145.47 6160.25 6167.09 6193.31 6195.=
93<br>&gt; &gt; 6197.26 6202.98 6204.64 6213.74<br>&gt; &gt; real =C2=A0 =
=C2=A0280.174<br>&gt; &gt; Second run (10 times):<br>&gt; &gt; user =C2=A0 =
=C2=A02771.498<br>&gt; &gt; system =C2=A06199.043 : 6165.39 6173.49 6179.97=
 6189.03 6193.13 6199.33<br>&gt; &gt; 6204.03 6212.9 6216.32 6256.84<br>&gt=
; &gt; real =C2=A0 =C2=A0284.854<br>&gt; &gt;<br>&gt; &gt; Chris<br>&gt; &g=
t;<br>&gt;<br>&gt; Hi Chris,<br>&gt;<br>&gt; Before I released the V4 versi=
on, I also ran the swap stress test you gave me,<br>&gt; with -j32, 1G memc=
g on my local branch:<br>&gt;<br>&gt; Without the patch:<br>&gt; 1952.07use=
r 1768.35system 4:51.89elapsed 1274%CPU (0avgtext+0avgdata<br>&gt; 920100ma=
xresident)k<br>&gt;<br>&gt; With the patch:<br>&gt; 1957.83user 1757.06syst=
em 4:51.15elapsed 1275%CPU (0avgtext+0avgdata<br>&gt; 919880maxresident)k<d=
iv><br></div><div>The result here is written in reverse, please refer to an=
other email:<br><a href=3D"https://lore.kernel.org/all/CAJqJ8ijzAhVxuE16-oa=
whhs3YmHKWmmoo0ca5KyaLGME7aoXjw@mail.gmail.com/">https://lore.kernel.org/al=
l/CAJqJ8ijzAhVxuE16-oawhhs3YmHKWmmoo0ca5KyaLGME7aoXjw@mail.gmail.com/</a><b=
r></div><div>&gt;<br>&gt; My test results are the same as yours. This shoul=
d not be test noise. I am<br>&gt; trying to analyze whether it can be furth=
er optimized.<br>&gt;<br>&gt; Jingxiang Zeng<br>&gt;<br>&gt; &gt; On Mon, O=
ct 7, 2024 at 6:57=E2=80=AFPM Jingxiang Zeng<br>&gt; &gt; &lt;<a href=3D"ma=
ilto:jingxiangzeng.cas@gmail.com">jingxiangzeng.cas@gmail.com</a>&gt; wrote=
:<br>&gt; &gt; &gt;<br>&gt; &gt; &gt; From: Jingxiang Zeng &lt;<a href=3D"m=
ailto:linuszeng@tencent.com">linuszeng@tencent.com</a>&gt;<br>&gt; &gt; &gt=
;<br>&gt; &gt; &gt; Commit 14aa8b2d5c2e (&quot;mm/mglru: don&#39;t sync dis=
k for each aging cycle&quot;)<br>&gt; &gt; &gt; removed the opportunity to =
wake up flushers during the MGLRU page<br>&gt; &gt; &gt; reclamation proces=
s can lead to an increased likelihood of triggering OOM<br>&gt; &gt; &gt; w=
hen encountering many dirty pages during reclamation on MGLRU.<br>&gt; &gt;=
 &gt;<br>&gt; &gt; &gt; This leads to premature OOM if there are too many d=
irty pages in cgroup:<br>&gt; &gt; &gt; Killed<br>&gt; &gt; &gt;<br>&gt; &g=
t; &gt; dd invoked oom-killer: gfp_mask=3D0x101cca(GFP_HIGHUSER_MOVABLE|__G=
FP_WRITE),<br>&gt; &gt; &gt; order=3D0, oom_score_adj=3D0<br>&gt; &gt; &gt;=
<br>&gt; &gt; &gt; Call Trace:<br>&gt; &gt; &gt; =C2=A0 &lt;TASK&gt;<br>&gt=
; &gt; &gt; =C2=A0 dump_stack_lvl+0x5f/0x80<br>&gt; &gt; &gt; =C2=A0 dump_s=
tack+0x14/0x20<br>&gt; &gt; &gt; =C2=A0 dump_header+0x46/0x1b0<br>&gt; &gt;=
 &gt; =C2=A0 oom_kill_process+0x104/0x220<br>&gt; &gt; &gt; =C2=A0 out_of_m=
emory+0x112/0x5a0<br>&gt; &gt; &gt; =C2=A0 mem_cgroup_out_of_memory+0x13b/0=
x150<br>&gt; &gt; &gt; =C2=A0 try_charge_memcg+0x44f/0x5c0<br>&gt; &gt; &gt=
; =C2=A0 charge_memcg+0x34/0x50<br>&gt; &gt; &gt; =C2=A0 __mem_cgroup_charg=
e+0x31/0x90<br>&gt; &gt; &gt; =C2=A0 filemap_add_folio+0x4b/0xf0<br>&gt; &g=
t; &gt; =C2=A0 __filemap_get_folio+0x1a4/0x5b0<br>&gt; &gt; &gt; =C2=A0 ? s=
rso_return_thunk+0x5/0x5f<br>&gt; &gt; &gt; =C2=A0 ? __block_commit_write+0=
x82/0xb0<br>&gt; &gt; &gt; =C2=A0 ext4_da_write_begin+0xe5/0x270<br>&gt; &g=
t; &gt; =C2=A0 generic_perform_write+0x134/0x2b0<br>&gt; &gt; &gt; =C2=A0 e=
xt4_buffered_write_iter+0x57/0xd0<br>&gt; &gt; &gt; =C2=A0 ext4_file_write_=
iter+0x76/0x7d0<br>&gt; &gt; &gt; =C2=A0 ? selinux_file_permission+0x119/0x=
150<br>&gt; &gt; &gt; =C2=A0 ? srso_return_thunk+0x5/0x5f<br>&gt; &gt; &gt;=
 =C2=A0 ? srso_return_thunk+0x5/0x5f<br>&gt; &gt; &gt; =C2=A0 vfs_write+0x3=
0c/0x440<br>&gt; &gt; &gt; =C2=A0 ksys_write+0x65/0xe0<br>&gt; &gt; &gt; =
=C2=A0 __x64_sys_write+0x1e/0x30<br>&gt; &gt; &gt; =C2=A0 x64_sys_call+0x11=
c2/0x1d50<br>&gt; &gt; &gt; =C2=A0 do_syscall_64+0x47/0x110<br>&gt; &gt; &g=
t; =C2=A0 entry_SYSCALL_64_after_hwframe+0x76/0x7e<br>&gt; &gt; &gt;<br>&gt=
; &gt; &gt; =C2=A0memory: usage 308224kB, limit 308224kB, failcnt 2589<br>&=
gt; &gt; &gt; =C2=A0swap: usage 0kB, limit 9007199254740988kB, failcnt 0<br=
>&gt; &gt; &gt;<br>&gt; &gt; &gt; =C2=A0 ...<br>&gt; &gt; &gt; =C2=A0 file_=
dirty 303247360<br>&gt; &gt; &gt; =C2=A0 file_writeback 0<br>&gt; &gt; &gt;=
 =C2=A0 ...<br>&gt; &gt; &gt;<br>&gt; &gt; &gt; oom-kill:constraint=3DCONST=
RAINT_MEMCG,nodemask=3D(null),cpuset=3Dtest,<br>&gt; &gt; &gt; mems_allowed=
=3D0,oom_memcg=3D/test,task_memcg=3D/test,task=3Ddd,pid=3D4404,uid=3D0<br>&=
gt; &gt; &gt; Memory cgroup out of memory: Killed process 4404 (dd) total-v=
m:10512kB,<br>&gt; &gt; &gt; anon-rss:1152kB, file-rss:1824kB, shmem-rss:0k=
B, UID:0 pgtables:76kB<br>&gt; &gt; &gt; oom_score_adj:0<br>&gt; &gt; &gt;<=
br>&gt; &gt; &gt; The flusher wake up was removed to decrease SSD wearing, =
but if we are<br>&gt; &gt; &gt; seeing all dirty folios at the tail of an L=
RU, not waking up the flusher<br>&gt; &gt; &gt; could lead to thrashing eas=
ily.=C2=A0 So wake it up when a mem cgroups is about<br>&gt; &gt; &gt; to O=
OM due to dirty caches.<br>&gt; &gt; &gt;<br>&gt; &gt; &gt; ---<br>&gt; &gt=
; &gt; Changes from v3:<br>&gt; &gt; &gt; - Avoid taking lock and reduce ov=
erhead on folio isolation by<br>&gt; &gt; &gt; =C2=A0 checking the right fl=
ags and rework wake up condition, fixing the<br>&gt; &gt; &gt; =C2=A0 perfo=
rmance regression reported by Chris Li.<br>&gt; &gt; &gt; =C2=A0 [Chris Li,=
 Kairui Song]<br>&gt; &gt; &gt; - Move the wake up check to try_to_shrink_l=
ruvec to cover kswapd<br>&gt; &gt; &gt; =C2=A0 case as well, and update com=
ments. [Kairui Song]<br>&gt; &gt; &gt; - Link to v3: <a href=3D"https://lor=
e.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/">https=
://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/<=
/a><br>&gt; &gt; &gt; Changes from v2:<br>&gt; &gt; &gt; - Acquire the lock=
 before calling the folio_check_dirty_writeback<br>&gt; &gt; &gt; =C2=A0 fu=
nction. [Wei Xu, Jingxiang Zeng]<br>&gt; &gt; &gt; - Link to v2: <a href=3D=
"https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gma=
il.com/">https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng=
.cas@gmail.com/</a><br>&gt; &gt; &gt; Changes from v1:<br>&gt; &gt; &gt; - =
Add code to count the number of unqueued_dirty in the sort_folio<br>&gt; &g=
t; &gt; =C2=A0 function. [Wei Xu, Jingxiang Zeng]<br>&gt; &gt; &gt; - Link =
to v1: <a href=3D"https://lore.kernel.org/all/20240829102543.189453-1-jingx=
iangzeng.cas@gmail.com/">https://lore.kernel.org/all/20240829102543.189453-=
1-jingxiangzeng.cas@gmail.com/</a><br>&gt; &gt; &gt; ---<br>&gt; &gt; &gt;<=
br>&gt; &gt; &gt; Fixes: 14aa8b2d5c2e (&quot;mm/mglru: don&#39;t sync disk =
for each aging cycle&quot;)<br>&gt; &gt; &gt; Signed-off-by: Zeng Jingxiang=
 &lt;<a href=3D"mailto:linuszeng@tencent.com">linuszeng@tencent.com</a>&gt;=
<br>&gt; &gt; &gt; Signed-off-by: Kairui Song &lt;<a href=3D"mailto:kasong@=
tencent.com">kasong@tencent.com</a>&gt;<br>&gt; &gt; &gt; Cc: T.J. Mercier =
&lt;<a href=3D"mailto:tjmercier@google.com">tjmercier@google.com</a>&gt;<br=
>&gt; &gt; &gt; Cc: Wei Xu &lt;<a href=3D"mailto:weixugc@google.com">weixug=
c@google.com</a>&gt;<br>&gt; &gt; &gt; Cc: Yu Zhao &lt;<a href=3D"mailto:yu=
zhao@google.com">yuzhao@google.com</a>&gt;<br>&gt; &gt; &gt; ---<br>&gt; &g=
t; &gt; =C2=A0mm/vmscan.c | 19 ++++++++++++++++---<br>&gt; &gt; &gt; =C2=A0=
1 file changed, 16 insertions(+), 3 deletions(-)<br>&gt; &gt; &gt;<br>&gt; =
&gt; &gt; diff --git a/mm/vmscan.c b/mm/vmscan.c<br>&gt; &gt; &gt; index dc=
7a285b256b..2a5c2fe81467 100644<br>&gt; &gt; &gt; --- a/mm/vmscan.c<br>&gt;=
 &gt; &gt; +++ b/mm/vmscan.c<br>&gt; &gt; &gt; @@ -4291,6 +4291,7 @@ static=
 bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c<=
br>&gt; &gt; &gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0int tier_idx)<br>&gt; &gt; &gt; =C2=A0{<br>&gt; =
&gt; &gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 bool success;<br>&gt; &gt; &gt; + =C2=
=A0 =C2=A0 =C2=A0 bool dirty, writeback;<br>&gt; &gt; &gt; =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 int gen =3D folio_lru_gen(folio);<br>&gt; &gt; &gt; =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 int type =3D folio_is_file_lru(folio);<br>&gt; &gt; &g=
t; =C2=A0 =C2=A0 =C2=A0 =C2=A0 int zone =3D folio_zonenum(folio);<br>&gt; &=
gt; &gt; @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruve=
c, struct folio *folio, struct scan_c<br>&gt; &gt; &gt; =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return true;<br>&gt; &gt; &gt; =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 }<br>&gt; &gt; &gt;<br>&gt; &gt; &gt; + =C2=A0 =C2=
=A0 =C2=A0 dirty =3D folio_test_dirty(folio);<br>&gt; &gt; &gt; + =C2=A0 =
=C2=A0 =C2=A0 writeback =3D folio_test_writeback(folio);<br>&gt; &gt; &gt; =
+ =C2=A0 =C2=A0 =C2=A0 if (type =3D=3D LRU_GEN_FILE &amp;&amp; dirty &amp;&=
amp; !writeback)<br>&gt; &gt; &gt; + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 sc-&gt;nr.unqueued_dirty +=3D delta;<br>&gt; &gt; &gt; +<br>&gt;=
 &gt; &gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* waiting for writeback */<br>&gt; =
&gt; &gt; - =C2=A0 =C2=A0 =C2=A0 if (folio_test_locked(folio) || folio_test=
_writeback(folio) ||<br>&gt; &gt; &gt; - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 (type =3D=3D LRU_GEN_FILE &amp;&amp; folio_test_dirty(folio))) {<br>&gt; &=
gt; &gt; + =C2=A0 =C2=A0 =C2=A0 if (folio_test_locked(folio) || writeback |=
|<br>&gt; &gt; &gt; + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (type =3D=3D LRU_G=
EN_FILE &amp;&amp; dirty)) {<br>&gt; &gt; &gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 gen =3D folio_inc_gen(lruvec, folio, true);<br>=
&gt; &gt; &gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 list=
_move(&amp;folio-&gt;lru, &amp;lrugen-&gt;folios[gen][type][zone]);<br>&gt;=
 &gt; &gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return t=
rue;<br>&gt; &gt; &gt; @@ -4454,7 +4460,7 @@ static int scan_folios(struct =
lruvec *lruvec, struct scan_control *sc,<br>&gt; &gt; &gt; =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 trace_mm_vmscan_lru_isolate(sc-&gt;reclaim_idx, sc-&gt;order,=
 MAX_LRU_BATCH,<br>&gt; &gt; &gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 scan=
ned, skipped, isolated,<br>&gt; &gt; &gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);<br>&gt; &gt; &gt; -<b=
r>&gt; &gt; &gt; + =C2=A0 =C2=A0 =C2=A0 sc-&gt;nr.taken +=3D scanned;<br>&g=
t; &gt; &gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 /*<br>&gt; &gt; &gt; =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0* There might not be eligible folios due to reclaim_id=
x. Check the<br>&gt; &gt; &gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* remainin=
g to prevent livelock if it&#39;s not making progress.<br>&gt; &gt; &gt; @@=
 -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec=
, struct scan_control *sc)<br>&gt; &gt; &gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 cond_resched();<br>&gt; &gt; &gt; =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 }<br>&gt; &gt; &gt;<br>&gt; &gt; &gt; + =C2=A0 =C2=A0 =C2=A0=
 /*<br>&gt; &gt; &gt; + =C2=A0 =C2=A0 =C2=A0 =C2=A0* If too many file cache=
 in the coldest generation can&#39;t be evicted<br>&gt; &gt; &gt; + =C2=A0 =
=C2=A0 =C2=A0 =C2=A0* due to being dirty, wake up the flusher.<br>&gt; &gt;=
 &gt; + =C2=A0 =C2=A0 =C2=A0 =C2=A0*/<br>&gt; &gt; &gt; + =C2=A0 =C2=A0 =C2=
=A0 if (sc-&gt;nr.unqueued_dirty &amp;&amp; sc-&gt;nr.unqueued_dirty =3D=3D=
 sc-&gt;nr.taken)<br>&gt; &gt; &gt; + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 wakeup_flusher_threads(WB_REASON_VMSCAN);<br>&gt; &gt; &gt; +=
<br>&gt; &gt; &gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* whether this lruvec shoul=
d be rotated */<br>&gt; &gt; &gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 return nr_to_=
scan &lt; 0;<br>&gt; &gt; &gt; =C2=A0}<br>&gt; &gt; &gt; --<br>&gt; &gt; &g=
t; 2.43.5<br>&gt; &gt; &gt;</div></div>

--000000000000b5b22b062404bb6e--