From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B70CECEFC34 for ; Tue, 8 Oct 2024 17:12:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4AA8B6B0092; Tue, 8 Oct 2024 13:12:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 45A606B0096; Tue, 8 Oct 2024 13:12:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 349616B0098; Tue, 8 Oct 2024 13:12:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 0E7756B0092 for ; Tue, 8 Oct 2024 13:12:52 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 1FB1A1C2D28 for ; Tue, 8 Oct 2024 17:12:50 +0000 (UTC) X-FDA: 82651079742.27.2081FF8 Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91]) by imf04.hostedemail.com (Postfix) with ESMTP id A792840014 for ; Tue, 8 Oct 2024 17:12:49 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=I0iYt5g0; spf=pass (imf04.hostedemail.com: domain of chrisl@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728407500; a=rsa-sha256; cv=none; b=5g8pb6Nv55489C25yDS2gr4r+1yO/6M3buPhFiOOsAoG8ma5p83/ShxYjrrHiML2vJoEUh GhZ2fMv7nxv+W/0wAonMPUCsmWjgeNqT4a3jICdIAmyaYbfcM4RCaaBlEV2lr+MTa+CQ6q P2A7IqZbdHt1qtqHwUsw+jrBo7GkXY0= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=I0iYt5g0; spf=pass (imf04.hostedemail.com: domain of chrisl@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728407500; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=IapkHpeQRsGrL7jpr6mqTsHfVZWgk4wP/OAKpLo94NU=; b=X3tFPl87piMq5KZUjezpXHhhRD0XNt3KoxNRChyEB+BttfqqErz180R8+TIauumpYdfAjh 3kWCDEUBEPROuopimy3s2JElJ51LLTmWhDEW3jLrVq5mdftXfdw7aPqwevZcgS8YPusLLv SWXyC9Q1uXO+mu5utfPN9xIixHDWpm0= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id 2DF03A43203 for ; Tue, 8 Oct 2024 17:12:40 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 751BCC4CECE for ; Tue, 8 Oct 2024 17:12:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728407568; bh=pqZeH2kCkOngtWGO1Hbjmr1efIxDjAlmQHz2/esCqmE=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=I0iYt5g0+wLr9Aw6QnneQFEVW7d0Ci1C14UFnJ/8UVvJ388O5wOuqoS0SN3RTQdEy kDc/2unj85VveJC4+ba9P1hdoDpQg9TbhnHhFFDZC6O3opP+qpa3WwDWJH89PZR/Bo kh2v9jsC+P8HwLAO/mM0rWrm4TArrc/JMfdcesLSVjefB5Kd1EgWrnO34CpFhrZfl+ fOHpeoon4+b3EZyU1deHfS3pf5yFLH6c9CFhdYiNB37bqmSi6ACXH5CFP2WetvTT0x CAsy6KQmusakkNBUJOiedM9+XEw1cA9MPTaAdJNsm9cSfsIrPzywD0dJkzWQ1ppUm+ WX0CKOnY7Bi/Q== Received: by mail-yw1-f182.google.com with SMTP id 00721157ae682-6e2e3e4f65dso29864577b3.3 for ; Tue, 08 Oct 2024 10:12:48 -0700 (PDT) X-Gm-Message-State: AOJu0YyzTIJKBEFZVcYAXJMJG12Yip4kYhW/zQSoQziVSeO1SdkIQwpg J+O3h2Ux4pJZ11FjIdJWZudXyaemICvstYwLnUua8FvfmjjzmLeMq8T7oqbFCuCWY52Uxs0Cfti fgT5UQlqPDTaaiUEgrmMPjJjnijkLpMfllnHNqg== X-Google-Smtp-Source: AGHT+IG6dUQ1yR+9kVwhoXU9JxIe7iqrrWw1GHi57Oek2YFEg3/bJ1xmyN18nbxEgvMvicbqYYd5Iz5hO2UrNy+AzZw= X-Received: by 2002:a05:690c:f88:b0:6e2:3e17:1849 with SMTP id 00721157ae682-6e2ecb69a8amr81785137b3.2.1728407567707; Tue, 08 Oct 2024 10:12:47 -0700 (PDT) MIME-Version: 1.0 References: <20241008015635.2782751-1-jingxiangzeng.cas@gmail.com> In-Reply-To: <20241008015635.2782751-1-jingxiangzeng.cas@gmail.com> From: Chris Li Date: Tue, 8 Oct 2024 10:12:36 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [RESEND][PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM To: Jingxiang Zeng Cc: linux-mm@kvack.org, akpm@linux-foundation.org, kasong@tencent.com, linuszeng@tencent.com, linux-kernel@vger.kernel.org, tjmercier@google.com, weixugc@google.com, yuzhao@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: o815osrzdcapjzdcjfgoihqhix9tetjp X-Rspamd-Queue-Id: A792840014 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1728407569-518119 X-HE-Meta: U2FsdGVkX1+S9cTaA4dgrKwWHihV9rw7jWg3xPaJAyNcw/MX1dNUCFeM5sEUPm7NpsK6EKuYG+u0V30+EMy6RuxK/Fpum0SFDR6wgav4fSGnsSjYYAnV/K1iQVh3iIgXs3NVztjiJjnnx4cQeGRX93QZmQ8hHoRbfsExDkIXMVzy9CV5UsUe8qNZKAsr0xqu7VFCUwY62VsnvN+RWd5kvmnfiPOftJtlnxWZkT22rUwiobhU0HXZXv4ZryPTUYr4jEnHP1WIr8mxUhtbOvhqZXErPWMzjaAOocGJ6x4tgfjBUd8vd1oUZHvIi/XVDG07UsW7kkYrqTigfGDBdZDJJQFGMS/dIectfTP5Rca6MoFgRIuDLE+pTS1hbZIf2OhIgu9e4RFM7tt+g5xbPkWTfY2Pe8gUyAhTVdG4teevhTZQmsd4No2vNFW1IrVUF3i12T6RF9uLfKsg/RLYRU+HdGCoPt4EjrMp8jMr5Cd2/B6FKSljRPQoMICxNev9CWCy3tY1sJptOlTOkKD58/ExhSU0/PlEd2d/avctnUKNKC2eVL8bMs7JeFWduxsMdGc+lT+MUL+gSxSPYYtq8bDiJ+/gU9kHArZyhmpC/qpRhpbD+l3U3qX4Sz2bCCzZHgXIeuaBcMiNnjsyDCBuRh7NAd56jejUa9w5KBBWI2zHCX94zkjTN/w0ZNF6Fhw1qtvQY5HZLM1pIijPXIiPY5wdkrYKVshweShDe9kJk/uB/Aq1qQYys4VfBe3WbwylAepJhJ90rXxlZzIRtz3POVkqG4SGxFzJj1C4dixGAPdZ6Vbm2RuWThUgrPDOlqWpsmN3hcFykrKfcgvM+7Xv8pBiWruHDvXwqoChnYMWXb434VALzrcYEJrcJ+zdu9N+FqWVsRNZpDSPZOxs7Xvho43jO+OO1J4Q3fSairVNjLOdgkGt/ij2xX6J2Ab2Tw7stU7P2A5Mndiw71S+g8F+/YW 7Yz0mVAa +kpFVTWWKt+48rPbikCxG6BMEJcaj6d/e8uJDtd5I+4wuhVa7GW5Wuc8dUisvCpHya5nXZ3Ui+khT/oRq2iLZlV+ZPuOQMdHkJFL1nUWlfEwhwx9JRLjow5/Tn6ljNc+YExc6AFR6zHZBLWLB6bGS62h6ggYOkMRea5qsaCsqJL1FIo0YNVR9o34NzpFDoLzLMRlGzrr5yphulkCSvjb9uYlhglbLRbJtyd5WWHvPxYpX41qXemsVSuxIRabC03AZbmowL/yuUVhV8dAxU5ZAxBSvyCGy0E5NYHb9tX6HDS4K4jQGQhLyf7Zr4reLAak65JSNWrDelg0S0nexEx39WqOzAZSsNHBcMrOh71cp83VBOUjgVtFRnfeffUp7iQNt5jpBVeOxP1cLb82m1MNdYQhYOkiw1rwovfL1IesvZsLtQ/eFbbjQ4nzg0ZlhAv/E43CU X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Jingxiang, I did run the same swap stress test on V4 and it is much better than V3. V3 test was hang there (time out). V4 did not hang any more, it finishes in about the same time. If we look closer of V4, it seems suggest that v4 system time is slightly w= orse. Is that kind of expected or might be the noise of my test? Just trying to understand it better, it is not a NACK by any means. Here is the number on mm-unstable c121617e3606be6575cdacfdb63cc8d67b46a568: Without (10 times): user 2688.328 system 6059.021 : 6031.57 6043.61 6044.35 6045.01 6052.46 6053.75 6057.21 6063.31 6075.76 6123.18 real 277.145 With V4: First run (10 times): user 2688.537 system 6180.907 : 6128.4 6145.47 6160.25 6167.09 6193.31 6195.93 6197.26 6202.98 6204.64 6213.74 real 280.174 Second run (10 times): user 2771.498 system 6199.043 : 6165.39 6173.49 6179.97 6189.03 6193.13 6199.33 6204.03 6212.9 6216.32 6256.84 real 284.854 Chris On Mon, Oct 7, 2024 at 6:57=E2=80=AFPM Jingxiang Zeng wrote: > > From: Jingxiang Zeng > > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > removed the opportunity to wake up flushers during the MGLRU page > reclamation process can lead to an increased likelihood of triggering OOM > when encountering many dirty pages during reclamation on MGLRU. > > This leads to premature OOM if there are too many dirty pages in cgroup: > Killed > > dd invoked oom-killer: gfp_mask=3D0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRI= TE), > order=3D0, oom_score_adj=3D0 > > Call Trace: > > dump_stack_lvl+0x5f/0x80 > dump_stack+0x14/0x20 > dump_header+0x46/0x1b0 > oom_kill_process+0x104/0x220 > out_of_memory+0x112/0x5a0 > mem_cgroup_out_of_memory+0x13b/0x150 > try_charge_memcg+0x44f/0x5c0 > charge_memcg+0x34/0x50 > __mem_cgroup_charge+0x31/0x90 > filemap_add_folio+0x4b/0xf0 > __filemap_get_folio+0x1a4/0x5b0 > ? srso_return_thunk+0x5/0x5f > ? __block_commit_write+0x82/0xb0 > ext4_da_write_begin+0xe5/0x270 > generic_perform_write+0x134/0x2b0 > ext4_buffered_write_iter+0x57/0xd0 > ext4_file_write_iter+0x76/0x7d0 > ? selinux_file_permission+0x119/0x150 > ? srso_return_thunk+0x5/0x5f > ? srso_return_thunk+0x5/0x5f > vfs_write+0x30c/0x440 > ksys_write+0x65/0xe0 > __x64_sys_write+0x1e/0x30 > x64_sys_call+0x11c2/0x1d50 > do_syscall_64+0x47/0x110 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > memory: usage 308224kB, limit 308224kB, failcnt 2589 > swap: usage 0kB, limit 9007199254740988kB, failcnt 0 > > ... > file_dirty 303247360 > file_writeback 0 > ... > > oom-kill:constraint=3DCONSTRAINT_MEMCG,nodemask=3D(null),cpuset=3Dtest, > mems_allowed=3D0,oom_memcg=3D/test,task_memcg=3D/test,task=3Ddd,pid=3D440= 4,uid=3D0 > Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB, > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB > oom_score_adj:0 > > The flusher wake up was removed to decrease SSD wearing, but if we are > seeing all dirty folios at the tail of an LRU, not waking up the flusher > could lead to thrashing easily. So wake it up when a mem cgroups is abou= t > to OOM due to dirty caches. > > --- > Changes from v3: > - Avoid taking lock and reduce overhead on folio isolation by > checking the right flags and rework wake up condition, fixing the > performance regression reported by Chris Li. > [Chris Li, Kairui Song] > - Move the wake up check to try_to_shrink_lruvec to cover kswapd > case as well, and update comments. [Kairui Song] > - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxian= gzeng.cas@gmail.com/ > Changes from v2: > - Acquire the lock before calling the folio_check_dirty_writeback > function. [Wei Xu, Jingxiang Zeng] > - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxi= angzeng.cas@gmail.com/ > Changes from v1: > - Add code to count the number of unqueued_dirty in the sort_folio > function. [Wei Xu, Jingxiang Zeng] > - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxia= ngzeng.cas@gmail.com/ > --- > > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > Signed-off-by: Zeng Jingxiang > Signed-off-by: Kairui Song > Cc: T.J. Mercier > Cc: Wei Xu > Cc: Yu Zhao > --- > mm/vmscan.c | 19 ++++++++++++++++--- > 1 file changed, 16 insertions(+), 3 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index dc7a285b256b..2a5c2fe81467 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec, struc= t folio *folio, struct scan_c > int tier_idx) > { > bool success; > + bool dirty, writeback; > int gen =3D folio_lru_gen(folio); > int type =3D folio_is_file_lru(folio); > int zone =3D folio_zonenum(folio); > @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec, stru= ct folio *folio, struct scan_c > return true; > } > > + dirty =3D folio_test_dirty(folio); > + writeback =3D folio_test_writeback(folio); > + if (type =3D=3D LRU_GEN_FILE && dirty && !writeback) > + sc->nr.unqueued_dirty +=3D delta; > + > /* waiting for writeback */ > - if (folio_test_locked(folio) || folio_test_writeback(folio) || > - (type =3D=3D LRU_GEN_FILE && folio_test_dirty(folio))) { > + if (folio_test_locked(folio) || writeback || > + (type =3D=3D LRU_GEN_FILE && dirty)) { > gen =3D folio_inc_gen(lruvec, folio, true); > list_move(&folio->lru, &lrugen->folios[gen][type][zone]); > return true; > @@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec, struc= t scan_control *sc, > trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_B= ATCH, > scanned, skipped, isolated, > type ? LRU_INACTIVE_FILE : LRU_INACTIVE_A= NON); > - > + sc->nr.taken +=3D scanned; > /* > * There might not be eligible folios due to reclaim_idx. Check t= he > * remaining to prevent livelock if it's not making progress. > @@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lr= uvec, struct scan_control *sc) > cond_resched(); > } > > + /* > + * If too many file cache in the coldest generation can't be evic= ted > + * due to being dirty, wake up the flusher. > + */ > + if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty =3D=3D sc->nr.= taken) > + wakeup_flusher_threads(WB_REASON_VMSCAN); > + > /* whether this lruvec should be rotated */ > return nr_to_scan < 0; > } > -- > 2.43.5 >