From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0CEE2CF0444 for ; Wed, 9 Oct 2024 05:29:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 945DE6B00D5; Wed, 9 Oct 2024 01:29:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 91CA86B00D7; Wed, 9 Oct 2024 01:29:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 80A076B00D8; Wed, 9 Oct 2024 01:29:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 605A16B00D5 for ; Wed, 9 Oct 2024 01:29:23 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id ACBBEC01E6 for ; Wed, 9 Oct 2024 05:29:20 +0000 (UTC) X-FDA: 82652935764.25.0F1CE83 Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182]) by imf01.hostedemail.com (Postfix) with ESMTP id 0A47140009 for ; Wed, 9 Oct 2024 05:29:20 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=fJMampa1; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of jingxiangzeng.cas@gmail.com designates 209.85.215.182 as permitted sender) smtp.mailfrom=jingxiangzeng.cas@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728451661; a=rsa-sha256; cv=none; b=HDnj/2b/Fa1rcBVteE6FQNu0eUbeFKVZnetC5vU6ns6aaiN2OSJ7xJOnuJgsZHzCbVPY9d QiuCk41iZvsmwpAyg7GhCQFnl8aDdy0BNe3t0yiE710iVuvEDrCwUF5UdV9TWMt5YEqkNU v2yKx3bb5kjbLQse/pZTju9oJzUv+f8= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=fJMampa1; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of jingxiangzeng.cas@gmail.com designates 209.85.215.182 as permitted sender) smtp.mailfrom=jingxiangzeng.cas@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728451661; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CkLcwWzcLNSV8TvjO21RtzrUM2vM+HcBzvDht1ZB6T4=; b=UXz1v+/SwWzpTRlrg2T4pKccOOiWA7gXT10Nb1pT+46ZhfDsUeKg61B/JPl1kUQmlzrEmP XQ2AtBW2vbACNJFYfT2LGVm15N3XRXvNdVrn1cGVX/ImugO2m58UPjLoY1yWEVMmwK2yet 6eI5QkT5rqQv5yeAmLUVVjcuGzMkZ5A= Received: by mail-pg1-f182.google.com with SMTP id 41be03b00d2f7-7e9ad969a4fso4029521a12.3 for ; Tue, 08 Oct 2024 22:29:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1728451759; x=1729056559; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=CkLcwWzcLNSV8TvjO21RtzrUM2vM+HcBzvDht1ZB6T4=; b=fJMampa1mpTTN8oT0uqZyHBENm6uwkZ8lAh3RSZjiuMA/LWH0vdBNO4LeSHrVv3NNL jhXY2RVQS1PDNbxi3crttsq7s5VG5pngEx23toVG8Jj0EAiAUbVcKVwQT2LpQwAUIM2m 8T+asxKo6O3x3+Bugsph9SnVkkiEC0WwXn/eOLXfGgepgXLzxo9lFpt6sEC0H0UbnZ8p 4aiTcmcYQHrNoTbYf7g/UGCaAUDoz150jHkj1a74xMsNIEENqufy8iU0mYOc/Mm8i09F 9n8LFD1JtZRTqhFfQAmRc64JiiQOpEN21cLKIgLI1IF/gIi/hGTeQHkADzOuxlFmXsXL Pibg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728451759; x=1729056559; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=CkLcwWzcLNSV8TvjO21RtzrUM2vM+HcBzvDht1ZB6T4=; b=UVHBAQkzatxysKuNeGHJ/HG5bUIVH80Jn4IwQmGH6OJRHu/R5nofMOw+n5TE9uHyAL Qe9kbtqJszgIMvxsLwFGBzL9tRnNnzjQOwRU989Tn42xYkCOcazFyEPnytiok9q45G6b RgT6Sn/5izLi1I9iUZO/HoD47SOSr7Kzn7g0MxG64YjJHRyD6aibw+doRg7Gcoapk+/f PJxy7Tq6tX7XvJ9Fz1I4fmLnda7b5STu/MPFYKNcvSFZI6tJGzzztte1NvFsi7TDc+sm AiT9VOw8svQdD3dY6lI99y0deyYj0LKgEsBma4FHmPi06CVucu2ao2OtFWIY7pftWqUL iEsg== X-Gm-Message-State: AOJu0YyzNgaiT2pa8IU7ZVYsoBE4RW4Oxmj2OUNIu2VXGZjutG2bKp24 g1NYGso4y77XR2EiO07qBzTJe3EkFFgmGZP6v5kSotzS4/PLBj/UAZcRN/ozn3Zi6jgfSJ3OmBz SkRf9oPiSxuFBZ2pUDnS5va3mMJk= X-Google-Smtp-Source: AGHT+IElUhAInvSN6z+SDZkeq/ZAHvX8y26mEVECvDzAMzdvCXWd+VZGR6olqOuEKTa6HR5JwYuJ9Wb9lqYIyAxgKg4= X-Received: by 2002:a05:6a21:e591:b0:1c0:ef24:4125 with SMTP id adf61e73a8af0-1d8a3c2f8d0mr2298481637.26.1728451759559; Tue, 08 Oct 2024 22:29:19 -0700 (PDT) MIME-Version: 1.0 References: <20241008015635.2782751-1-jingxiangzeng.cas@gmail.com> In-Reply-To: From: jingxiang zeng Date: Wed, 9 Oct 2024 13:29:08 +0800 Message-ID: Subject: Re: [RESEND][PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM To: Chris Li Cc: linux-mm@kvack.org, akpm@linux-foundation.org, kasong@tencent.com, linuszeng@tencent.com, linux-kernel@vger.kernel.org, tjmercier@google.com, weixugc@google.com, yuzhao@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 0A47140009 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: daq9r3j9zrj8ntqipz4uh8iroou8n7y8 X-HE-Tag: 1728451760-415829 X-HE-Meta: U2FsdGVkX1+VckS0wT7MMS+9634VZKW6nHE+8eBmXIQn99dcLtsjq4WxXDcIsATjwWNGQvN5RVqI0seONlgV17MmgMPO8v046jPxxSGxcGXnuCJfzGMZ2sVUXBLvEzt2lf0xFR47H96X6RjacntmTumexDNsBKlTFZDt/elgM4jCrJP3+iQDifpd3zX/r0SqmqEOfIxL8mwTP2GJSOMk5hpqSJoS9WewZWBUAA2pkhW7BnDCEIg70qWilCM1tDNmajABPA+A0HYhmFglWkv1I39eduNbQ5k8iaWMIvlNrv/3rmUZ0eMFd8JzbftNRN0NCeUM3sHoOXqHCh/kYB5/gMzhURBneeAdRSP4g8ne37SI0MDEeTaURD6+xllsrEEY10mVu9Vb7iFSFQOm+K5CBkTO1fNXtTgNsgj9XJ0DtkEMowAzgnC25/82IwVbz/jk/ZQfbHpNrl8azTPuSvJkniLMQyJuj6BAcCLT5dbKqxWVvurfjkKpEc5O/CKbBhZ2RSxCBINBop1g2D3gyP1EVyGo6T+tbzOVe7O08azG7RAhC9iXUvCZhZ8aqNpgXiCE1mD0tPhwGox+eij7/eDKjhUUdAKN1semJJ1+S0WKg3psdhWavOHgTEwt0Lo99vuVs0CHqPUq37+dtV3DsbaX6gwAeiWWQ1czKPFJkmqOkC4Qr1R/uXg7rRfHTpkp4Sz7PvuuXjvm1j1AM1vWAMOaODLm79hSVMwXNMfhLs80BDC6Tqfu6wljIX9NcL+13tT5KVlweCp9P+5zs18aazCOr4iC6m5g2nWiBsQigTCR4mo2+rr1zfK7AjKJKYvpjfGYdIZG3fKXCB4knXlAGCR++mgKoktX8Ais3oknKDBhSxQV7hI+0I86Ax43e5HNYdapheJDlkWTTAYoqzbUTOuMmsPpPBhzJBjaR9NvOPMN69AEifNEgCIi/ZFjFHbm+H6JxTU77gUmGZhGdvgNP8z kcoHdS4p SBNw+Kumq9ogblLB/xxL+szPXbD2rgxVgmrHdyGaCwtFpfawpuByHG69dNCXMDhFVcvF9JqxDTQVcSxcJ/0vRJcyamj922VsJefcGV24wO19iJ69TopsgcR3ZHos+fI6mz4ozHzuaJOdKyIfNZWKHh+Yoo/ZIWqZIPaZR8IH2HXf7OZeucCbJHjFXKeg0huyA5AmNKNZwt+Pjsacs3KMVTUlbMM5v5V17w39sUX+f81/FGIKAqVulxse705fQmr7LG0XGLRvFKETzjUFPi4JxzIXROV/5SZN/JC5ZYspZCHt9n4VxZDp+8mIHBkxHH/4gfN1GxKpN884iWN0FBoUhTT00kAnXarnFqIEOg+nzgt0JIbOsM23wci4p0G0doyjpzS5xVILLCPFXxOv8qBW6QLZAqu7aaTFIWHd4 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, 9 Oct 2024 at 01:12, Chris Li wrote: > > Hi Jingxiang, > > I did run the same swap stress test on V4 and it is much better than V3. > V3 test was hang there (time out). V4 did not hang any more, it > finishes in about the same time. > > If we look closer of V4, it seems suggest that v4 system time is slightly= worse. > Is that kind of expected or might be the noise of my test? Just trying > to understand it better, it is not a NACK by any means. > > Here is the number on mm-unstable c121617e3606be6575cdacfdb63cc8d67b46a56= 8: > Without (10 times): > user 2688.328 > system 6059.021 : 6031.57 6043.61 6044.35 6045.01 6052.46 6053.75 > 6057.21 6063.31 6075.76 6123.18 > real 277.145 > > With V4: > First run (10 times): > user 2688.537 > system 6180.907 : 6128.4 6145.47 6160.25 6167.09 6193.31 6195.93 > 6197.26 6202.98 6204.64 6213.74 > real 280.174 > Second run (10 times): > user 2771.498 > system 6199.043 : 6165.39 6173.49 6179.97 6189.03 6193.13 6199.33 > 6204.03 6212.9 6216.32 6256.84 > real 284.854 > > Chris > Hi Chris, Before I released the V4 version, I also ran the swap stress test you gave = me, with -j32, 1G memcg on my local branch: Without the patch: 1952.07user 1768.35system 4:51.89elapsed 1274%CPU (0avgtext+0avgdata 920100maxresident)k With the patch: 1957.83user 1757.06system 4:51.15elapsed 1275%CPU (0avgtext+0avgdata 919880maxresident)k My test results are the same as yours. This should not be test noise. I am trying to analyze whether it can be further optimized. Jingxiang Zeng > On Mon, Oct 7, 2024 at 6:57=E2=80=AFPM Jingxiang Zeng > wrote: > > > > From: Jingxiang Zeng > > > > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > > removed the opportunity to wake up flushers during the MGLRU page > > reclamation process can lead to an increased likelihood of triggering O= OM > > when encountering many dirty pages during reclamation on MGLRU. > > > > This leads to premature OOM if there are too many dirty pages in cgroup= : > > Killed > > > > dd invoked oom-killer: gfp_mask=3D0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_W= RITE), > > order=3D0, oom_score_adj=3D0 > > > > Call Trace: > > > > dump_stack_lvl+0x5f/0x80 > > dump_stack+0x14/0x20 > > dump_header+0x46/0x1b0 > > oom_kill_process+0x104/0x220 > > out_of_memory+0x112/0x5a0 > > mem_cgroup_out_of_memory+0x13b/0x150 > > try_charge_memcg+0x44f/0x5c0 > > charge_memcg+0x34/0x50 > > __mem_cgroup_charge+0x31/0x90 > > filemap_add_folio+0x4b/0xf0 > > __filemap_get_folio+0x1a4/0x5b0 > > ? srso_return_thunk+0x5/0x5f > > ? __block_commit_write+0x82/0xb0 > > ext4_da_write_begin+0xe5/0x270 > > generic_perform_write+0x134/0x2b0 > > ext4_buffered_write_iter+0x57/0xd0 > > ext4_file_write_iter+0x76/0x7d0 > > ? selinux_file_permission+0x119/0x150 > > ? srso_return_thunk+0x5/0x5f > > ? srso_return_thunk+0x5/0x5f > > vfs_write+0x30c/0x440 > > ksys_write+0x65/0xe0 > > __x64_sys_write+0x1e/0x30 > > x64_sys_call+0x11c2/0x1d50 > > do_syscall_64+0x47/0x110 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > > memory: usage 308224kB, limit 308224kB, failcnt 2589 > > swap: usage 0kB, limit 9007199254740988kB, failcnt 0 > > > > ... > > file_dirty 303247360 > > file_writeback 0 > > ... > > > > oom-kill:constraint=3DCONSTRAINT_MEMCG,nodemask=3D(null),cpuset=3Dtest, > > mems_allowed=3D0,oom_memcg=3D/test,task_memcg=3D/test,task=3Ddd,pid=3D4= 404,uid=3D0 > > Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB, > > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB > > oom_score_adj:0 > > > > The flusher wake up was removed to decrease SSD wearing, but if we are > > seeing all dirty folios at the tail of an LRU, not waking up the flushe= r > > could lead to thrashing easily. So wake it up when a mem cgroups is ab= out > > to OOM due to dirty caches. > > > > --- > > Changes from v3: > > - Avoid taking lock and reduce overhead on folio isolation by > > checking the right flags and rework wake up condition, fixing the > > performance regression reported by Chris Li. > > [Chris Li, Kairui Song] > > - Move the wake up check to try_to_shrink_lruvec to cover kswapd > > case as well, and update comments. [Kairui Song] > > - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxi= angzeng.cas@gmail.com/ > > Changes from v2: > > - Acquire the lock before calling the folio_check_dirty_writeback > > function. [Wei Xu, Jingxiang Zeng] > > - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jing= xiangzeng.cas@gmail.com/ > > Changes from v1: > > - Add code to count the number of unqueued_dirty in the sort_folio > > function. [Wei Xu, Jingxiang Zeng] > > - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingx= iangzeng.cas@gmail.com/ > > --- > > > > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > > Signed-off-by: Zeng Jingxiang > > Signed-off-by: Kairui Song > > Cc: T.J. Mercier > > Cc: Wei Xu > > Cc: Yu Zhao > > --- > > mm/vmscan.c | 19 ++++++++++++++++--- > > 1 file changed, 16 insertions(+), 3 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index dc7a285b256b..2a5c2fe81467 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec, str= uct folio *folio, struct scan_c > > int tier_idx) > > { > > bool success; > > + bool dirty, writeback; > > int gen =3D folio_lru_gen(folio); > > int type =3D folio_is_file_lru(folio); > > int zone =3D folio_zonenum(folio); > > @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec, st= ruct folio *folio, struct scan_c > > return true; > > } > > > > + dirty =3D folio_test_dirty(folio); > > + writeback =3D folio_test_writeback(folio); > > + if (type =3D=3D LRU_GEN_FILE && dirty && !writeback) > > + sc->nr.unqueued_dirty +=3D delta; > > + > > /* waiting for writeback */ > > - if (folio_test_locked(folio) || folio_test_writeback(folio) || > > - (type =3D=3D LRU_GEN_FILE && folio_test_dirty(folio))) { > > + if (folio_test_locked(folio) || writeback || > > + (type =3D=3D LRU_GEN_FILE && dirty)) { > > gen =3D folio_inc_gen(lruvec, folio, true); > > list_move(&folio->lru, &lrugen->folios[gen][type][zone]= ); > > return true; > > @@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec, str= uct scan_control *sc, > > trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU= _BATCH, > > scanned, skipped, isolated, > > type ? LRU_INACTIVE_FILE : LRU_INACTIVE= _ANON); > > - > > + sc->nr.taken +=3D scanned; > > /* > > * There might not be eligible folios due to reclaim_idx. Check= the > > * remaining to prevent livelock if it's not making progress. > > @@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *= lruvec, struct scan_control *sc) > > cond_resched(); > > } > > > > + /* > > + * If too many file cache in the coldest generation can't be ev= icted > > + * due to being dirty, wake up the flusher. > > + */ > > + if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty =3D=3D sc->n= r.taken) > > + wakeup_flusher_threads(WB_REASON_VMSCAN); > > + > > /* whether this lruvec should be rotated */ > > return nr_to_scan < 0; > > } > > -- > > 2.43.5 > >