From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45FE0CF0444 for ; Wed, 9 Oct 2024 05:35:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B1D7B6B00D4; Wed, 9 Oct 2024 01:35:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ACBFF6B00D7; Wed, 9 Oct 2024 01:35:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 96DD16B00D8; Wed, 9 Oct 2024 01:35:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 7419B6B00D4 for ; Wed, 9 Oct 2024 01:35:09 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 71C7180FA2 for ; Wed, 9 Oct 2024 05:35:07 +0000 (UTC) X-FDA: 82652950338.07.307E521 Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) by imf03.hostedemail.com (Postfix) with ESMTP id 456F32000A for ; Wed, 9 Oct 2024 05:35:07 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=lmRfk35y; spf=pass (imf03.hostedemail.com: domain of jingxiangzeng.cas@gmail.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=jingxiangzeng.cas@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728452063; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WMBy1FqazUH62o/ICplufSjK7LMBbZY7J4FwWcLit9c=; b=bqaoZ7ZTUoZSNi50H1J/zzCzaM6AWxLeO4KZjhUZYHFASmCG+ah30ti23+OfxyjAfa1XBg 2MFZpB+aWSE6GCsHvrBo90dJ2vnl3FWeHOl6QEgFcl2zsxuTnZvekcZArwPUBEIbYSM04Y M6I9DCiO5WaFZSmrqTP+CEyakF70+Bg= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=lmRfk35y; spf=pass (imf03.hostedemail.com: domain of jingxiangzeng.cas@gmail.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=jingxiangzeng.cas@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728452063; a=rsa-sha256; cv=none; b=TabWAaBQiHlJ00IxzPMBn8KE53YJEgWA54qsfSZPy5GyBQcJG+Dv/d4OQhOczrMbVmwYOD 6nliHB71hoFtGdEHCiP87yL0sAbxJ1aWM1x4lYgOoa/FoKtWcE0MASv9Gu4LSzGQN2x38f xwBbvXj34KRBgaJ0OpWkfkH8qe7oc6I= Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-20c5a7b2908so11387675ad.1 for ; Tue, 08 Oct 2024 22:35:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1728452106; x=1729056906; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=WMBy1FqazUH62o/ICplufSjK7LMBbZY7J4FwWcLit9c=; b=lmRfk35ynfUI2JUwM4d8hOsLENCznegWFAmoLJoA5b9WPnANqtSkb33PKnls7Acqe9 FkAZIyJkupRzF+2t29mPawmg6a++G4jlMAbFbYAf61vPKqUVaDTXHn0Wnj582AsbcsXk 9Ltbv2LJFnE9MzMDn0Brof0l5rjBoqdDrfbkhtiEz0AZTFcv+RcRrl5dM1CoKuGhbq8P VHkDuFN3nTlq216CLLXAwKNWuaKbpLvMgfqNSnI9V+qq+hoxmH7PW6p3Il7PXCRyASyi 1EmRaHoGpM5y4ynAgdHPgDnp4CrrINvhALGZVXQufICGQaftR9QBkiyDKEWb+i7zEixQ T4Qg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728452106; x=1729056906; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=WMBy1FqazUH62o/ICplufSjK7LMBbZY7J4FwWcLit9c=; b=NbpODtGAhnWxeaqErPchMUXaQxyhXsGIclZimyi5GSRI79gWUwgyuh1hXcC6ELsyqj iCc86uEbuDJxTO5tud3wDutjBP8JiQMrYPVvw2WmzKCVV+gQn793vqSaWF35Dd9dG+TT HMeiAs3U9lS4aNKdgmZl8E3ltpiGi2SSf99p2jREHSaLwnHR/gDdbp2mP1vQKUkplZ8x v5P0Azx5sG46pdJydpF55E+oumEjLq2xGyjU7Y5jZ2URJTbyzz4iqdRv+B10fD0TzzZs nd49phOu3kGWbFEwgGFMlNhlYxSHocNYjUyEG2YCHFcMXWr6dKfMalvWtUcthwwC70Xz v7OA== X-Gm-Message-State: AOJu0Yyd7DSRkzuGMMyhmkBNS7GeqHRuQwRQ+8dkWBDBE/DOiUcp1EIY 2I/x9QgiCTsR22yjNFUrAmHxekvf7JfdM3DMuRzakR4Jc48U79G4BD0voDA7KrBCzSeLjz/Dxb+ woi4WgSmmphBmxokJ/taX+zswziM= X-Google-Smtp-Source: AGHT+IHD9z/Wnf1e5PukYuyYa6ooWB/L/uL+spW+ZXRkn7NlmBLy50uhEDWmyCA+F5yUFgAELKccW0vXEKGtGw3d294= X-Received: by 2002:a17:903:2ac6:b0:20b:85da:a6e5 with SMTP id d9443c01a7336-20c636de15amr21756155ad.8.1728452105853; Tue, 08 Oct 2024 22:35:05 -0700 (PDT) MIME-Version: 1.0 References: <20241008015635.2782751-1-jingxiangzeng.cas@gmail.com> In-Reply-To: From: jingxiang zeng Date: Wed, 9 Oct 2024 13:34:54 +0800 Message-ID: Subject: Re: [RESEND][PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM To: Chris Li Cc: linux-mm@kvack.org, akpm@linux-foundation.org, kasong@tencent.com, linuszeng@tencent.com, linux-kernel@vger.kernel.org, tjmercier@google.com, weixugc@google.com, yuzhao@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: w6gj4f9p9gi8tt37cqdargd8wuzybx45 X-Rspamd-Queue-Id: 456F32000A X-Rspamd-Server: rspam11 X-HE-Tag: 1728452107-605494 X-HE-Meta: U2FsdGVkX1/vtGztRy1WTsIwqKJPH80DwUwfaiD3d2CJZxc0/i3Q7/GOVUt9fMcZNcoetK10UEutv1R8zyzumyMj5M+vvNe0Q8tTEb+6/4J1vAlZh7hqfIBVkeWmpFRzCeRmLS3OySzxBWOCW7yKjzjrh3ON71r9ZtutI3xmSXdXawUZUkZEHC3A9Z6tTe9sE4zXLrpVexeDNYhHY5xXmhAPJwwVzZ+5P+tuLrPXEHCtjJyRg3+VmtzwCVy4QRKI+5urkHvDSRTszjuUNdcKivHyxbcw3ddwjJlbew5e1XzjBTVdBtOiIyPdtdn1L2HXPZCNwQWjeSkrdXJ9+O7biA78551jSmDkB9sxChMIgtV2QqYI03DjQJb2qCWWa+ioO4hoh9ZCa07je/jW6z9uYp7SdNf9Q0iTWhsMQ537bJHNIdCOeOa1sKNvTNf9zk4lkqF1olfx36/Zg5BtTGn0sPFoxk+rL0qCom4UTQeDzzPVd7Osw5SszW9A+XRHpq56fO0Bsa8vKs7yv4O6RG7BNWYIQlxE/X+pBVFiI5M+A/mPpvgFVaZsT31LEJHsTE+VNJRp0C6l0u/yVbJ/+oIobDx9a9EPRO0UzN469i/hJQPcgfjSlL4ofr8hDaIYyM4nu7GaFnw8d68B/QX9O16li5wojE90i8J3xzq2ZTNVrCm9O7Luq3ONY3tiSsaODwkVGrsDRbjmVcvMqWE7vXrYIB6cGAYRzoERhdFqmdC1RIkXp+dZCjBAl3lMfFj6kmETiO+NQIeWTc1WLGWvn2N0Q+Lqz4LKp8ba4LQFfyw3Q/e0dEZ6zjH3aCq3PqQpKRgieETr9aCt8PsKtCSvxEnDbiApqEInNAZIbO+F9hvtTyi7+aLvRfIsGusgboE971/UtGxM+hRvRYCwsgw/Xkc905dh7AXBuaI0OnonI55rsR0kUOLXpG8GvBbKyAzxNG3ztck3XqdB66PyUeVTKKv LchfgZa/ s6nlvdXT9+aiYMFg0DJykUe3P1/s/i+oFOAASYkoiIxOmXn7xPV+rblFaj5beMLmgeOkrUG2CqtPCLY2ZerfTmkwMdwqRwrl9Qnz6dMOsMeEsCdWMNkAE+NBdyi290toOVAmdq9OI339oDTCHZeEBvUHBGDIJCmbGqlTIBsoEIYsxXbr5wDdfSr09c/H46+Xpm+YFXZuBciKtYP6lCkqBX5yD+p6em2CZFbU/2QJlHkDrSZ4m7Aq2MvRPYoghm/psOMOADmggTIS9tQItjYSN1cTewVq5+BTNIi6gx6pJh5JqreVTO8mfRR0BPRrDA1ZWEyl6izmM+caHdcNfGITOZcs5IkaibOXSknksRQwI0+e1gFKQf1rTaWd6FQ5YHTjBLrOSJAN7PFNXcyeK6WI6Y6ldRBmqIhYe/llc X-Bogosity: Ham, tests=bogofilter, spamicity=0.000013, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Chris, Before I released the V4 version, I also ran the swap stress test you gave = me, with -j32, 1G memcg on my local branch: With the V4 patch: 1952.07user 1768.35system 4:51.89elapsed 1274%CPU (0avgtext+0avgdata 920100maxresident)k Without the patch: 1957.83user 1757.06system 4:51.15elapsed 1275%CPU (0avgtext+0avgdata 919880maxresident)k My test results are the same as yours. This should not be test noise. I am trying to analyze whether it can be further optimized. Jingxiang Zeng On Wed, 9 Oct 2024 at 01:12, Chris Li wrote: > > Hi Jingxiang, > > I did run the same swap stress test on V4 and it is much better than V3. > V3 test was hang there (time out). V4 did not hang any more, it > finishes in about the same time. > > If we look closer of V4, it seems suggest that v4 system time is slightly= worse. > Is that kind of expected or might be the noise of my test? Just trying > to understand it better, it is not a NACK by any means. > > Here is the number on mm-unstable c121617e3606be6575cdacfdb63cc8d67b46a56= 8: > Without (10 times): > user 2688.328 > system 6059.021 : 6031.57 6043.61 6044.35 6045.01 6052.46 6053.75 > 6057.21 6063.31 6075.76 6123.18 > real 277.145 > > With V4: > First run (10 times): > user 2688.537 > system 6180.907 : 6128.4 6145.47 6160.25 6167.09 6193.31 6195.93 > 6197.26 6202.98 6204.64 6213.74 > real 280.174 > Second run (10 times): > user 2771.498 > system 6199.043 : 6165.39 6173.49 6179.97 6189.03 6193.13 6199.33 > 6204.03 6212.9 6216.32 6256.84 > real 284.854 > > Chris > > On Mon, Oct 7, 2024 at 6:57=E2=80=AFPM Jingxiang Zeng > wrote: > > > > From: Jingxiang Zeng > > > > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > > removed the opportunity to wake up flushers during the MGLRU page > > reclamation process can lead to an increased likelihood of triggering O= OM > > when encountering many dirty pages during reclamation on MGLRU. > > > > This leads to premature OOM if there are too many dirty pages in cgroup= : > > Killed > > > > dd invoked oom-killer: gfp_mask=3D0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_W= RITE), > > order=3D0, oom_score_adj=3D0 > > > > Call Trace: > > > > dump_stack_lvl+0x5f/0x80 > > dump_stack+0x14/0x20 > > dump_header+0x46/0x1b0 > > oom_kill_process+0x104/0x220 > > out_of_memory+0x112/0x5a0 > > mem_cgroup_out_of_memory+0x13b/0x150 > > try_charge_memcg+0x44f/0x5c0 > > charge_memcg+0x34/0x50 > > __mem_cgroup_charge+0x31/0x90 > > filemap_add_folio+0x4b/0xf0 > > __filemap_get_folio+0x1a4/0x5b0 > > ? srso_return_thunk+0x5/0x5f > > ? __block_commit_write+0x82/0xb0 > > ext4_da_write_begin+0xe5/0x270 > > generic_perform_write+0x134/0x2b0 > > ext4_buffered_write_iter+0x57/0xd0 > > ext4_file_write_iter+0x76/0x7d0 > > ? selinux_file_permission+0x119/0x150 > > ? srso_return_thunk+0x5/0x5f > > ? srso_return_thunk+0x5/0x5f > > vfs_write+0x30c/0x440 > > ksys_write+0x65/0xe0 > > __x64_sys_write+0x1e/0x30 > > x64_sys_call+0x11c2/0x1d50 > > do_syscall_64+0x47/0x110 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > > memory: usage 308224kB, limit 308224kB, failcnt 2589 > > swap: usage 0kB, limit 9007199254740988kB, failcnt 0 > > > > ... > > file_dirty 303247360 > > file_writeback 0 > > ... > > > > oom-kill:constraint=3DCONSTRAINT_MEMCG,nodemask=3D(null),cpuset=3Dtest, > > mems_allowed=3D0,oom_memcg=3D/test,task_memcg=3D/test,task=3Ddd,pid=3D4= 404,uid=3D0 > > Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB, > > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB > > oom_score_adj:0 > > > > The flusher wake up was removed to decrease SSD wearing, but if we are > > seeing all dirty folios at the tail of an LRU, not waking up the flushe= r > > could lead to thrashing easily. So wake it up when a mem cgroups is ab= out > > to OOM due to dirty caches. > > > > --- > > Changes from v3: > > - Avoid taking lock and reduce overhead on folio isolation by > > checking the right flags and rework wake up condition, fixing the > > performance regression reported by Chris Li. > > [Chris Li, Kairui Song] > > - Move the wake up check to try_to_shrink_lruvec to cover kswapd > > case as well, and update comments. [Kairui Song] > > - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxi= angzeng.cas@gmail.com/ > > Changes from v2: > > - Acquire the lock before calling the folio_check_dirty_writeback > > function. [Wei Xu, Jingxiang Zeng] > > - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jing= xiangzeng.cas@gmail.com/ > > Changes from v1: > > - Add code to count the number of unqueued_dirty in the sort_folio > > function. [Wei Xu, Jingxiang Zeng] > > - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingx= iangzeng.cas@gmail.com/ > > --- > > > > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > > Signed-off-by: Zeng Jingxiang > > Signed-off-by: Kairui Song > > Cc: T.J. Mercier > > Cc: Wei Xu > > Cc: Yu Zhao > > --- > > mm/vmscan.c | 19 ++++++++++++++++--- > > 1 file changed, 16 insertions(+), 3 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index dc7a285b256b..2a5c2fe81467 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec, str= uct folio *folio, struct scan_c > > int tier_idx) > > { > > bool success; > > + bool dirty, writeback; > > int gen =3D folio_lru_gen(folio); > > int type =3D folio_is_file_lru(folio); > > int zone =3D folio_zonenum(folio); > > @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec, st= ruct folio *folio, struct scan_c > > return true; > > } > > > > + dirty =3D folio_test_dirty(folio); > > + writeback =3D folio_test_writeback(folio); > > + if (type =3D=3D LRU_GEN_FILE && dirty && !writeback) > > + sc->nr.unqueued_dirty +=3D delta; > > + > > /* waiting for writeback */ > > - if (folio_test_locked(folio) || folio_test_writeback(folio) || > > - (type =3D=3D LRU_GEN_FILE && folio_test_dirty(folio))) { > > + if (folio_test_locked(folio) || writeback || > > + (type =3D=3D LRU_GEN_FILE && dirty)) { > > gen =3D folio_inc_gen(lruvec, folio, true); > > list_move(&folio->lru, &lrugen->folios[gen][type][zone]= ); > > return true; > > @@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec, str= uct scan_control *sc, > > trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU= _BATCH, > > scanned, skipped, isolated, > > type ? LRU_INACTIVE_FILE : LRU_INACTIVE= _ANON); > > - > > + sc->nr.taken +=3D scanned; > > /* > > * There might not be eligible folios due to reclaim_idx. Check= the > > * remaining to prevent livelock if it's not making progress. > > @@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *= lruvec, struct scan_control *sc) > > cond_resched(); > > } > > > > + /* > > + * If too many file cache in the coldest generation can't be ev= icted > > + * due to being dirty, wake up the flusher. > > + */ > > + if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty =3D=3D sc->n= r.taken) > > + wakeup_flusher_threads(WB_REASON_VMSCAN); > > + > > /* whether this lruvec should be rotated */ > > return nr_to_scan < 0; > > } > > -- > > 2.43.5 > >