From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86F74D5B14E for ; Mon, 28 Oct 2024 21:11:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 07E516B00A6; Mon, 28 Oct 2024 17:11:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0061D6B00A9; Mon, 28 Oct 2024 17:10:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DE95A6B00AB; Mon, 28 Oct 2024 17:10:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id BD3B06B00A6 for ; Mon, 28 Oct 2024 17:10:59 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id D164D803E0 for ; Mon, 28 Oct 2024 21:10:58 +0000 (UTC) X-FDA: 82724255124.12.F0AEB38 Received: from mail-oi1-f180.google.com (mail-oi1-f180.google.com [209.85.167.180]) by imf14.hostedemail.com (Postfix) with ESMTP id 5B755100011 for ; Mon, 28 Oct 2024 21:10:30 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="BIQn/X0b"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf14.hostedemail.com: domain of weixugc@google.com designates 209.85.167.180 as permitted sender) smtp.mailfrom=weixugc@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730149682; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kXqaZgSFz5wEXFPr1dt9M0Rkvdhk0XgrtH0hauEDh8s=; b=D7me7/wSYUuuSmYA7Rf9nUsc3orL9ntN+wgTY4pT+foB2XUHWh7S7Farw6fjJt/vVV+d0Q YStg1+J4nELs/s5UFXbxGirvQEzqnzb4RjywpYcxriioHUmoZIX1+LfUiT7dOkbzQp1+FS xVwn7MAyVwLCWkdy6CMVbZAT/P1n+94= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730149682; a=rsa-sha256; cv=none; b=jPaNnCxi6SgjPoIj8ATBlVcT0mDokiDPOw12lc5fgYpbMf4wNAMUPQWX0sAV+swTcQpJPh loIUrKUcCQzezSMKNf7lIYjpZv0DsDRO59je0Akae4BmD95tLPz8JnBdrAeXUah5eQJtJK P9M9Ojb2a4fYXVH/SmKgBLcdf1Vd2UA= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="BIQn/X0b"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf14.hostedemail.com: domain of weixugc@google.com designates 209.85.167.180 as permitted sender) smtp.mailfrom=weixugc@google.com Received: by mail-oi1-f180.google.com with SMTP id 5614622812f47-3e5f533e1c2so2659601b6e.3 for ; Mon, 28 Oct 2024 14:10:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730149856; x=1730754656; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=kXqaZgSFz5wEXFPr1dt9M0Rkvdhk0XgrtH0hauEDh8s=; b=BIQn/X0bt4hQMO3jN6/WsIEnCNm3oFHE6WxLfRMBiXSRULHPAMcMeoHgpg/5bbuVjR HG+D3ib8N8YyJ+WEoMRpcXaWOVqgAKqfUDsZGxadvOi9Xb6XRCYL3+GrllYHcFQqpISM avdaXaHKgrR8C5P4zfbdN5IPAEG0kGOhwuDx7rRhwoJ13Tr/QeGQ9I/FC4f6KqL1Pd5U 8aZiLGLxa9c297sSK5weIHEy2JdcShxX1TgnWJew4CS+/Rj8GCCGHnSw8yQNIeJdxiat 9MDGsBefB8DZ28U6zVP6kdy4y08IfK8+cAbwZnfplRE2mJQR4PxAxpzD7RypAWpBAR9+ solg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730149856; x=1730754656; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=kXqaZgSFz5wEXFPr1dt9M0Rkvdhk0XgrtH0hauEDh8s=; b=ceIIkctvvkEVxOHNrqLZGGa2zKZKz5nE4KIm8r9f1FNzSX2j6vzAUFXPGBRBFFJRry zJrBsk5l/12LGxq3MCo4hQbdlai4iLq7+yn8K0Xv3Ke/HvgGafJ8/rP8McbxBxcSs/hc X2AGP4+t0cjGxqK5a9ZEFjY8I2mtbEQ1KZv8y0Y5At40SP6WW7o5ZOVlbAIAIyfbX6TG /AxwEimTBdELr2v/QcOPyJ1Tc5DSBkfU7Nded63XDuHm2cWba83S/qzIlPebtTGtqPZe o4ApVTuCFwo9HtPKjxcK63M1nTve2pTk9lThtYyZ9c2nBnIN06wD9c2HODnlVl/1NEp2 2FoA== X-Gm-Message-State: AOJu0YxS2uE4PDFGdRs4W+k7XbxcdZErbPj3KwCsef8x9lBqE/IsxaJl DeB2Gc2kxy4lY7yhkm4asah6dD0lHdwtp/cBLeI34eQID0NpwYg+5Wto6KAILV9hWAnkJeymySt i0jZloWBbX5uokgupUSRd1phz/HwXmOpf7oka X-Google-Smtp-Source: AGHT+IE2uCb7fTM0JGKbE3yWC5xzimyzUcK3ERKrWedcHafsT4QugEpM+RdFvxLC2YiYdsbRTxTkiq2Bt/xh7jIFA4c= X-Received: by 2002:a05:6808:22a5:b0:3e6:769:b042 with SMTP id 5614622812f47-3e638482ea3mr7557916b6e.29.1730149855694; Mon, 28 Oct 2024 14:10:55 -0700 (PDT) MIME-Version: 1.0 References: <20241026115714.1437435-1-jingxiangzeng.cas@gmail.com> In-Reply-To: <20241026115714.1437435-1-jingxiangzeng.cas@gmail.com> From: Wei Xu Date: Mon, 28 Oct 2024 14:10:43 -0700 Message-ID: Subject: Re: [PATCH v6] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM To: Jingxiang Zeng Cc: linux-mm@kvack.org, akpm@linux-foundation.org, kasong@tencent.com, linux-kernel@vger.kernel.org, tjmercier@google.com, yuzhao@google.com, chrisl@kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 5B755100011 X-Stat-Signature: yxfpesgttoc1q9g4ckzws3wiecq68x78 X-Rspam-User: X-HE-Tag: 1730149830-422209 X-HE-Meta: U2FsdGVkX1++6zmnq41In0ikw0zpIkm2N/kR2RMZ1IKTKtOejo+6T6x77BKgQY9HUpG1vNlYTCXe6hhW4yRBber6m/PmuE6V9MrrmlRM0x6EmIZCR2gCOEXC2rFoFrj1kEabjxphvyJJ0P1P+EBLCwVAnJrVve7+fIigR9ZDCu2h30Dv83slftoElDt8pBwb5p+/u+kIWhQLVCJNyO0gKxdz8Zio47JlFkuOnZ4Nw4jpGUL0FBbujNbSaVvZWKgq2dzKdslvZAVL8zRy+VeWOQ95i+me4GOORgP+9SFU4dCsxPivj/qrhX2o5bBg33mAoo6j+QZyz+G3EfvQH+E9GB92Es43vx4uVZsCdaAxkKdogjJbkN/9ExPUJS0E0J8NhYFka4XISWYbsRc/rJIqUVt4PvQ/EAn8QfKHNaukSUwROeUWJ9KUOTCK6DNWG5rYfCD1p2ugtloDuKeS68zonLj9JWQwxisMNN9OUPIrKw2EPd11ByTXEvXe4VjhouN6m8VYH0jWdYZmi4CvA+ZBM4dq5KL6zQYM4hBs486SncfxgI1UEvNp72CxsIsAK5sbI8l3oGCBEoSiTLbpAQyilvwCApqniP0MlNirA1dtivUucyY8+d5QfQHDOjA84rp6/tQR6cGANGxrmrVFAiNakq5x5B0I+QBYgi6PexnNc4Qnw24yWDVmeuP6PPf8saYa0Jy5o+67BAdBUiagmogJ8qGE7CE5WO+ihDlZlRR8i7Po4bImmCUy1nXvhwYrVbEgKRsNf+poUT3BiWzcJLbWTTlNv212Kqbju76MA0+e8fY4jP7jFwnbV1u4z9j9wIKQ+tkKaNXjSP8/5r5TOekipyVU6Sk5OwQsxTSby+JQvCW4XEIWDRtpPAxEKHljBIIGIVY1wBVdMoVhBAUAaZ2e0UTqSPLHq/Q+CKk4ghkOAik+uHV/coc6guRitUDJTC3oGdzXEFY0+KT98pTeQdg L6W1prLc PHCJ4X54BWmjoXXPfm8cZMutk8q8WaG6RqnaPv/X9HXITIp2K0N2/mXaiQkrtX4W58BdJ6MvKQEW5zBCCy0BCactMvZf4pak6dhhpbKp6WBS+BLLmWwoXBPSwtuZrOGuRlTaPGyTBP3C7iaROHbA776HlBxOlFrnds7Yo3oY9yhpDsC5h6y/RhGktQHO7ELrQ/3WYzCQdKPQpuMM4EQWebQ3ZqXwwqgODJs2iflG25kebOlfAxaRLsyC7k1Ax0wQG+8td7vIidhWw/5wp958l2QQ0CnDuvRxV1Sht X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Oct 26, 2024 at 4:57=E2=80=AFAM Jingxiang Zeng wrote: > > From: Zeng Jingxiang > > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > removed the opportunity to wake up flushers during the MGLRU page > reclamation process can lead to an increased likelihood of triggering OOM > when encountering many dirty pages during reclamation on MGLRU. > > This leads to premature OOM if there are too many dirty pages in cgroup: > Killed > > dd invoked oom-killer: gfp_mask=3D0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRI= TE), > order=3D0, oom_score_adj=3D0 > > Call Trace: > > dump_stack_lvl+0x5f/0x80 > dump_stack+0x14/0x20 > dump_header+0x46/0x1b0 > oom_kill_process+0x104/0x220 > out_of_memory+0x112/0x5a0 > mem_cgroup_out_of_memory+0x13b/0x150 > try_charge_memcg+0x44f/0x5c0 > charge_memcg+0x34/0x50 > __mem_cgroup_charge+0x31/0x90 > filemap_add_folio+0x4b/0xf0 > __filemap_get_folio+0x1a4/0x5b0 > ? srso_return_thunk+0x5/0x5f > ? __block_commit_write+0x82/0xb0 > ext4_da_write_begin+0xe5/0x270 > generic_perform_write+0x134/0x2b0 > ext4_buffered_write_iter+0x57/0xd0 > ext4_file_write_iter+0x76/0x7d0 > ? selinux_file_permission+0x119/0x150 > ? srso_return_thunk+0x5/0x5f > ? srso_return_thunk+0x5/0x5f > vfs_write+0x30c/0x440 > ksys_write+0x65/0xe0 > __x64_sys_write+0x1e/0x30 > x64_sys_call+0x11c2/0x1d50 > do_syscall_64+0x47/0x110 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > memory: usage 308224kB, limit 308224kB, failcnt 2589 > swap: usage 0kB, limit 9007199254740988kB, failcnt 0 > > ... > file_dirty 303247360 > file_writeback 0 > ... > > oom-kill:constraint=3DCONSTRAINT_MEMCG,nodemask=3D(null),cpuset=3Dtest, > mems_allowed=3D0,oom_memcg=3D/test,task_memcg=3D/test,task=3Ddd,pid=3D440= 4,uid=3D0 > Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB, > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB > oom_score_adj:0 > > The flusher wake up was removed to decrease SSD wearing, but if we are > seeing all dirty folios at the tail of an LRU, not waking up the flusher > could lead to thrashing easily. So wake it up when a mem cgroups is abou= t > to OOM due to dirty caches. > > I did run the build kernel test[1] on V6, with -j16 1G memcg on my local > branch: > > Without the patch(10 times): > user 1449.394 > system 368.78 372.58 363.03 362.31 360.84 372.70 368.72 364.94 373.51 > 366.58 (avg 367.399) > real 164.883 > > With the V6 patch(10 times): > user 1447.525 > system 360.87 360.63 372.39 364.09 368.49 365.15 359.93 362.04 359.72 > 354.60 (avg 362.79) > real 164.514 > > Test results show that this patch has about 1% performance improvement, > which should be caused by noise. > > --- > Changes from v5: > - Make wake-up conditions stricter. [Wei Xu, Jingxiang Zeng] > - Use sc->nr.file_taken to count. [Wei Xu, Jingxiang Zeng] > - Link to v5: https://lore.kernel.org/all/20241018052942.3810740-1-jingxi= angzeng.cas@gmail.com/ > Changes from v4: > - Add the number of unqueued dirty pages in the shrink_folio_list functio= n > to sc->nr.unqueued_dirty. [Wei Xu, Jingxiang Zeng] > - Reset sc->nr before calling lru_gen_shrink_node function. > [Wei Xu, Jingxiang Zeng] > - Modified the conditions for waking up the flusher thread to avoid > interference from unevictable and anonymous pages. > [Wei Xu, Jingxiang Zeng] > - Link to v4: https://lore.kernel.org/all/20240929113050.76079-1-jingxian= gzeng.cas@gmail.com/ > Changes from v3: > - Avoid taking lock and reduce overhead on folio isolation by > checking the right flags and rework wake up condition, fixing the > performance regression reported by Chris Li. > [Chris Li, Kairui Song] > - Move the wake up check to try_to_shrink_lruvec to cover kswapd > case as well, and update comments. [Kairui Song] > - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxian= gzeng.cas@gmail.com/ > Changes from v2: > - Acquire the lock before calling the folio_check_dirty_writeback > function. [Wei Xu, Jingxiang Zeng] > - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxi= angzeng.cas@gmail.com/ > Changes from v1: > - Add code to count the number of unqueued_dirty in the sort_folio > function. [Wei Xu, Jingxiang Zeng] > - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxia= ngzeng.cas@gmail.com/ > --- > > Link: https://lore.kernel.org/all/CACePvbV4L-gRN9UKKuUnksfVJjOTq_5Sti2-e= =3Dpb_w51kucLKQ@mail.gmail.com/ [1] > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > Suggested-by: Wei Xu > Signed-off-by: Zeng Jingxiang > Signed-off-by: Kairui Song > Cc: T.J. Mercier > Cc: Wei Xu > Cc: Yu Zhao > --- > mm/vmscan.c | 25 ++++++++++++++++++++++--- > 1 file changed, 22 insertions(+), 3 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 084de0efe59b..794730c8c1de 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4282,6 +4282,7 @@ static bool sort_folio(struct lruvec *lruvec, struc= t folio *folio, struct scan_c > int tier_idx) > { > bool success; > + bool dirty, writeback; > int gen =3D folio_lru_gen(folio); > int type =3D folio_is_file_lru(folio); > int zone =3D folio_zonenum(folio); > @@ -4327,9 +4328,17 @@ static bool sort_folio(struct lruvec *lruvec, stru= ct folio *folio, struct scan_c > return true; > } > > + dirty =3D folio_test_dirty(folio); > + writeback =3D folio_test_writeback(folio); > + if (type =3D=3D LRU_GEN_FILE && dirty) { > + sc->nr.file_taken +=3D delta; > + if (!writeback) > + sc->nr.unqueued_dirty +=3D delta; > + } > + > /* waiting for writeback */ > - if (folio_test_locked(folio) || folio_test_writeback(folio) || > - (type =3D=3D LRU_GEN_FILE && folio_test_dirty(folio))) { > + if (folio_test_locked(folio) || writeback || > + (type =3D=3D LRU_GEN_FILE && dirty)) { > gen =3D folio_inc_gen(lruvec, folio, true); > list_move(&folio->lru, &lrugen->folios[gen][type][zone]); > return true; > @@ -4445,7 +4454,8 @@ static int scan_folios(struct lruvec *lruvec, struc= t scan_control *sc, > trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_B= ATCH, > scanned, skipped, isolated, > type ? LRU_INACTIVE_FILE : LRU_INACTIVE_A= NON); > - > + if (type =3D=3D LRU_GEN_FILE) > + sc->nr.file_taken +=3D isolated; > /* > * There might not be eligible folios due to reclaim_idx. Check t= he > * remaining to prevent livelock if it's not making progress. > @@ -4579,6 +4589,7 @@ static int evict_folios(struct lruvec *lruvec, stru= ct scan_control *sc, int swap > return scanned; > retry: > reclaimed =3D shrink_folio_list(&list, pgdat, sc, &stat, false); > + sc->nr.unqueued_dirty +=3D stat.nr_unqueued_dirty; > sc->nr_reclaimed +=3D reclaimed; > trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id, > scanned, reclaimed, &stat, sc->priority, > @@ -4787,6 +4798,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lr= uvec, struct scan_control *sc) > cond_resched(); > } > > + /* > + * If too many file cache in the coldest generation can't be evic= ted > + * due to being dirty, wake up the flusher. > + */ > + if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty =3D=3D sc->nr.= file_taken) > + wakeup_flusher_threads(WB_REASON_VMSCAN); > + > /* whether this lruvec should be rotated */ > return nr_to_scan < 0; > } > @@ -5932,6 +5950,7 @@ static void shrink_node(pg_data_t *pgdat, struct sc= an_control *sc) > bool reclaimable =3D false; > > if (lru_gen_enabled() && root_reclaim(sc)) { > + memset(&sc->nr, 0, sizeof(sc->nr)); > lru_gen_shrink_node(pgdat, sc); > return; > } > -- > 2.43.5 > Reviewed-by: Wei Xu