From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BC745CED60E for ; Wed, 9 Oct 2024 06:21:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C5EC76B009B; Wed, 9 Oct 2024 02:21:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C0CBB6B00CF; Wed, 9 Oct 2024 02:21:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AAD786B00C0; Wed, 9 Oct 2024 02:21:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 8D9606B00CF for ; Wed, 9 Oct 2024 02:21:33 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 484C5120F64 for ; Wed, 9 Oct 2024 06:21:31 +0000 (UTC) X-FDA: 82653067266.10.2E8831E Received: from mail-ua1-f48.google.com (mail-ua1-f48.google.com [209.85.222.48]) by imf22.hostedemail.com (Postfix) with ESMTP id 61D27C0011 for ; Wed, 9 Oct 2024 06:21:31 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=aVoPFs2F; spf=pass (imf22.hostedemail.com: domain of weixugc@google.com designates 209.85.222.48 as permitted sender) smtp.mailfrom=weixugc@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728454755; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0TOthnCv/sVBZcyHUxm/JOdzM8eNGBCE+FKUptBcdQo=; b=EAhgXdp3WjRCmZF9D7CGMbD/tjw4nIPQqTP9LolJolFWLVifYFI09VAzJnOAefM/h76tfz KOH5qf9p+69pJ1INSYyYFrJBFMstPm8v9TIRxZiG/yas6FW7TXbuN8OtxJhP00oKn9/Cef VANAhRIyR2XeA5v8ujpSa07zo3+McRc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728454755; a=rsa-sha256; cv=none; b=COz1ZT0Lh+3IkRN88+MsKOBhmq4g1SQ/ZacnsYxJ5FdVfts7ZjAYxub4GCCT6NzO+yi4Pm ggQWXOoHY7qRfbGL4bbat/K123epARUrbqH+e6U0LFKmoqzAOl1cCfsMeEgBWc38hLh8jf MM/swj2iXKOIRF0jpBPcUJu4IFLl8zc= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=aVoPFs2F; spf=pass (imf22.hostedemail.com: domain of weixugc@google.com designates 209.85.222.48 as permitted sender) smtp.mailfrom=weixugc@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ua1-f48.google.com with SMTP id a1e0cc1a2514c-84fb533bb9bso163250241.3 for ; Tue, 08 Oct 2024 23:21:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1728454890; x=1729059690; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=0TOthnCv/sVBZcyHUxm/JOdzM8eNGBCE+FKUptBcdQo=; b=aVoPFs2F52TUvPcdMJFbKFf4HXRpSTctpJbEU1Mwp4Ox6e0Mq/h1MWuPK/j4mBCVA2 Q8tjLEMHysrRx123pYpQPuRFPUbE5Otd/I06tkcCw2AGlik5iY29n67gYEUI3zrNtbvo Vk3xrPxIYfDL7dMGnpOFcO5XPe2cvdqv+iIubQRwAL4kP97bukh6cWCzBsI6YXHChSUc a94tXTSj16HXjfDDJJ8mLaMLlISWgF9iaoelBYBQJiCvl4rzEPnh+bn39p1AZWr4A1Qj KKeSQUGex42YuqYw+LsXA0P3owVS4hVpAohyhPh/QwDqnZny0eBa9K8obm5cReqMUf5W JK6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728454890; x=1729059690; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=0TOthnCv/sVBZcyHUxm/JOdzM8eNGBCE+FKUptBcdQo=; b=X3kFkocheffDbvun7XVS6tuOv4+WYGncPByMLFAtFIwY6097mXZDeRZQLa9z+uAbyH YTuQ0IKPZP2SL691MHoGicUTSySfV7fDmW04x8r3/fqx/XBZ2bMTmuCfmMyml7aOQ6dA 2XmqmYJjRSW44etUpnhnExe54OKIxSRPqQNzIoanvMr4OuDcguRmmJmSTkRRXYZta3oh AadMJq9NgYs1ozAkxsqvPW9YICwB97JFZfgRg2L2PgRw3O2DYTYvoX3YIWpat01qJVaY ynnfhOjCvxwvprhEPG4SHrPt58SJr6cWNlOAJvWzf/aG6GF9qSJGUXfbSNemtJdIQx6I Vl+g== X-Gm-Message-State: AOJu0Yxi7WQqF6WfeAOmlEBFHkWUM3kNC0obJ1WWapJRgnIHkIY5KAZv 2snQo1gN+XA1yym0TzfF+8+AMg5JDfc5bmJ8EGq7vSyBTunVVYy84wD0BzyxXbdJtps5quNBS/s AA2KSd5tZdfroSTlCaxbAcJHn5P97yJpc31zw X-Google-Smtp-Source: AGHT+IE8DCzR7OQOHS0UkC5EAjsR4LDS2DxYpf3ZUUsAu9f/m8ghZdyVynkuJLB+8e3Xp8hVxyfTEJ5Fg/lPmoGw1XM= X-Received: by 2002:a05:6102:50aa:b0:4a3:c3cf:42b1 with SMTP id ada2fe7eead31-4a448e3dbb2mr1164256137.26.1728454890200; Tue, 08 Oct 2024 23:21:30 -0700 (PDT) MIME-Version: 1.0 References: <20241008015635.2782751-1-jingxiangzeng.cas@gmail.com> In-Reply-To: From: Wei Xu Date: Tue, 8 Oct 2024 23:21:17 -0700 Message-ID: Subject: Re: [RESEND][PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM To: jingxiang zeng Cc: linux-mm@kvack.org, akpm@linux-foundation.org, kasong@tencent.com, linuszeng@tencent.com, linux-kernel@vger.kernel.org, tjmercier@google.com, yuzhao@google.com, chrisl@kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 61D27C0011 X-Stat-Signature: 3u5xxdzmcdwzos1net7s668dzrp7nu8e X-HE-Tag: 1728454891-547680 X-HE-Meta: U2FsdGVkX19n3jhBMtqUJlgGnt+6rkIVXoRIcss/aOh5g1QnJBkwVef1nqYB+5uCm564KtrGciCFJK281BSKe4aE0v+iSQ41mCV4saqpkkXknH9JkLeT+v2DVWstvZnkbi9BQABoygDbKFzB9wyD+oJTb5+uUvYMdn84r7XPqJWFcBqKqMsebqsB9EXovT9UZk0x6KdP95w6ATKKuXUNqcF4Aqibo5JRUjY8cGyxrPqU4OrAp6wKu247CJ3Q4X2gKNnGZfsI5n5hpCFAF2aaDZidEqD60Ow/uYBB0N6Kp8KJ5/69fuQuHKX20moUK0DebQV1flqk8suAXwxjlRZzQdcQBlIf1d24ujupXTuT3pjcy1vogZ/lSJa9z4XteFlrUyPxkt7Y+iP2bnBa72ehuaBAdtihetLoP2gUsNviMz0EYEmb+WGtl3p1SMHP7C+QEs7DWoXseVJTzdvE6P/bk6ZcmbtiR1rUQ23yG4VE4Tt4RyZQ5eVWbqrHxHFtJE9HSq1E6+x4Yvl8uHzyFgnHji4ZomJdZuMVLsqTgWH9YaB3y71zSgFVfnC1YKYKg2XruiTQN3CBMXRwEHi5f8AzTOtgjp5k/rYCebaqWX80S7bGCcfraziPRAbjLhiNbmgcdYaZs+5Lw0lBDHGm7wmli9jvqoN7CS8kC5X6ggRNt8EdhfJBUETA0GSMgkuZbNhbonROu0DF5B54V2413DnjioIxQAtIJp//h3qDbCVEvQ9j1VB/TpCaCol9lCMtYeZMYpcyfNHA7DZeW08zouaoVocqXP3I2Dv9712zFnhPfmPNsgsxZb/CaNBJ8EEGgVZeLnDKFvOWmo7E7WTVIpx+IiFM1YmVBt8UYXuQz8GgDM230NlCuJX96WMLjSwCYq3QLJeUPkD38KLKaA19ipFZhK06NzqAjqsrPhMEzsKXAhuX3PRWmH4gciowasJ5PmzFilmgpNiTzQVy6D0lhbi KrspHjMB rREcuyOAeL32eGluRbic7/P6dx81S1brjn5Gn7rJfLHucnmBzUQNzktTSlRso+qxPrh3woy34BGP6dZ1xdO4H23f0vG4rpCtD+UtOpW6t9ud5z0dyagdawrTvlsuwsVrl37Df8rtuOgYsI8Vr9choQjYlPxqHyOCoqvFfwHY6jEwOmDZ1hwAIieZyb6NzBDpGoYwy2yULKgRTONA0PchNyEMaiXjttZJjuSw2GBunG4fLOa/A6WXJe/teXhaTmzhH4WziTfmPSmp1m5VBJ/pUBhSPSVzhDUeruBod1F7PhdM4dMzq5f/DyleTxnWU9Io/eSPeRaR/hOcV2oBpye/eDgv7Z0H9hBFnOF8Z9EjBZcGYAM4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Oct 8, 2024 at 9:52=E2=80=AFPM jingxiang zeng wrote: > > On Tue, 8 Oct 2024 at 11:26, Wei Xu wrote: > > > > On Mon, Oct 7, 2024 at 6:57=E2=80=AFPM Jingxiang Zeng > > wrote: > > > > > > From: Jingxiang Zeng > > > > > > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle"= ) > > > removed the opportunity to wake up flushers during the MGLRU page > > > reclamation process can lead to an increased likelihood of triggering= OOM > > > when encountering many dirty pages during reclamation on MGLRU. > > > > > > This leads to premature OOM if there are too many dirty pages in cgro= up: > > > Killed > > > > > > dd invoked oom-killer: gfp_mask=3D0x101cca(GFP_HIGHUSER_MOVABLE|__GFP= _WRITE), > > > order=3D0, oom_score_adj=3D0 > > > > > > Call Trace: > > > > > > dump_stack_lvl+0x5f/0x80 > > > dump_stack+0x14/0x20 > > > dump_header+0x46/0x1b0 > > > oom_kill_process+0x104/0x220 > > > out_of_memory+0x112/0x5a0 > > > mem_cgroup_out_of_memory+0x13b/0x150 > > > try_charge_memcg+0x44f/0x5c0 > > > charge_memcg+0x34/0x50 > > > __mem_cgroup_charge+0x31/0x90 > > > filemap_add_folio+0x4b/0xf0 > > > __filemap_get_folio+0x1a4/0x5b0 > > > ? srso_return_thunk+0x5/0x5f > > > ? __block_commit_write+0x82/0xb0 > > > ext4_da_write_begin+0xe5/0x270 > > > generic_perform_write+0x134/0x2b0 > > > ext4_buffered_write_iter+0x57/0xd0 > > > ext4_file_write_iter+0x76/0x7d0 > > > ? selinux_file_permission+0x119/0x150 > > > ? srso_return_thunk+0x5/0x5f > > > ? srso_return_thunk+0x5/0x5f > > > vfs_write+0x30c/0x440 > > > ksys_write+0x65/0xe0 > > > __x64_sys_write+0x1e/0x30 > > > x64_sys_call+0x11c2/0x1d50 > > > do_syscall_64+0x47/0x110 > > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > > > > memory: usage 308224kB, limit 308224kB, failcnt 2589 > > > swap: usage 0kB, limit 9007199254740988kB, failcnt 0 > > > > > > ... > > > file_dirty 303247360 > > > file_writeback 0 > > > ... > > > > > > oom-kill:constraint=3DCONSTRAINT_MEMCG,nodemask=3D(null),cpuset=3Dtes= t, > > > mems_allowed=3D0,oom_memcg=3D/test,task_memcg=3D/test,task=3Ddd,pid= =3D4404,uid=3D0 > > > Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512k= B, > > > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB > > > oom_score_adj:0 > > > > > > The flusher wake up was removed to decrease SSD wearing, but if we ar= e > > > seeing all dirty folios at the tail of an LRU, not waking up the flus= her > > > could lead to thrashing easily. So wake it up when a mem cgroups is = about > > > to OOM due to dirty caches. > > > > > > --- > > > Changes from v3: > > > - Avoid taking lock and reduce overhead on folio isolation by > > > checking the right flags and rework wake up condition, fixing the > > > performance regression reported by Chris Li. > > > [Chris Li, Kairui Song] > > > - Move the wake up check to try_to_shrink_lruvec to cover kswapd > > > case as well, and update comments. [Kairui Song] > > > - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jing= xiangzeng.cas@gmail.com/ > > > Changes from v2: > > > - Acquire the lock before calling the folio_check_dirty_writeback > > > function. [Wei Xu, Jingxiang Zeng] > > > - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-ji= ngxiangzeng.cas@gmail.com/ > > > Changes from v1: > > > - Add code to count the number of unqueued_dirty in the sort_folio > > > function. [Wei Xu, Jingxiang Zeng] > > > - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jin= gxiangzeng.cas@gmail.com/ > > > --- > > > > > > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle"= ) > > > Signed-off-by: Zeng Jingxiang > > > Signed-off-by: Kairui Song > > > Cc: T.J. Mercier > > > Cc: Wei Xu > > > Cc: Yu Zhao > > > --- > > > mm/vmscan.c | 19 ++++++++++++++++--- > > > 1 file changed, 16 insertions(+), 3 deletions(-) > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index dc7a285b256b..2a5c2fe81467 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec, s= truct folio *folio, struct scan_c > > > int tier_idx) > > > { > > > bool success; > > > + bool dirty, writeback; > > > int gen =3D folio_lru_gen(folio); > > > int type =3D folio_is_file_lru(folio); > > > int zone =3D folio_zonenum(folio); > > > @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec, = struct folio *folio, struct scan_c > > > return true; > > > } > > > > > > + dirty =3D folio_test_dirty(folio); > > > + writeback =3D folio_test_writeback(folio); > > > + if (type =3D=3D LRU_GEN_FILE && dirty && !writeback) > > > + sc->nr.unqueued_dirty +=3D delta; > > > + > > > > This sounds good. BTW, when shrink_folio_list() in evict_folios() > > returns, we should add stat.nr_unqueued_dirty to sc->nr.unqueued_dirty > > there as well. > > Thank you for your valuable feedback, I will implement it in the next ver= sion. > > > > > /* waiting for writeback */ > > > - if (folio_test_locked(folio) || folio_test_writeback(folio) |= | > > > - (type =3D=3D LRU_GEN_FILE && folio_test_dirty(folio))) { > > > + if (folio_test_locked(folio) || writeback || > > > + (type =3D=3D LRU_GEN_FILE && dirty)) { > > > gen =3D folio_inc_gen(lruvec, folio, true); > > > list_move(&folio->lru, &lrugen->folios[gen][type][zon= e]); > > > return true; > > > @@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec, s= truct scan_control *sc, > > > trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_L= RU_BATCH, > > > scanned, skipped, isolated, > > > type ? LRU_INACTIVE_FILE : LRU_INACTI= VE_ANON); > > > - > > > + sc->nr.taken +=3D scanned; > > > > I think we should only include file pages (in sort_folio) and isolated > > pages into sc->nr.taken, instead of all scanned pages. For example, if > > Unqueued dirty pages are not isolated, but promoted to the newer generati= on > in the sort_folio function. So I tend to wake up the flusher thread when = the > number of scanned pages is equal to the number of unqueued dirty pages. > > > there are only unevictable and unqueued dirty pages, we would still > > like to wake up the flusher threads, but because nr.taken counts > > unevictable pages as well, the wakeup condition in > > try_to_shrink_lruvec() won't be met. > > The situation you mentioned will not happen because the number of > scanned pages does not include unevicatble pages. > However, there is another situation that needs attention. When the > scanned pages contain anonymous pages and unqueued dirty pages, > the flusher cannot be woken up. I will fix this situation in the next ver= sion. > Pages in unevictable LRUs are not scanned, but unevictable pages that are not yet moved to unevictable LRUs can be included in scanned pages. See the !folio_evictable(folio) check in sort_folio(). > > > > > /* > > > * There might not be eligible folios due to reclaim_idx. Che= ck the > > > * remaining to prevent livelock if it's not making progress. > > > @@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec= *lruvec, struct scan_control *sc) > > > cond_resched(); > > > } > > > > > > + /* > > > + * If too many file cache in the coldest generation can't be = evicted > > > + * due to being dirty, wake up the flusher. > > > + */ > > > + if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty =3D=3D sc-= >nr.taken) > > > + wakeup_flusher_threads(WB_REASON_VMSCAN); > > > + > > > > try_to_shrink_lruvec() can be called from shrink_node() for global > > reclaim as well. We need to reset sc->nr before calling > > lru_gen_shrink_node() there. MGLRU didn't need that because it didn't > > use sc->nr until this change. > > Thank you for your valuable feedback, I will implement it in the next ver= sion. > > > > > /* whether this lruvec should be rotated */ > > > return nr_to_scan < 0; > > > } > > > -- > > > 2.43.5 > > >