From: Wei Xu <weixugc@google.com>
To: Jingxiang Zeng <jingxiangzeng.cas@gmail.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
kasong@tencent.com, linuszeng@tencent.com,
linux-kernel@vger.kernel.org, tjmercier@google.com,
yuzhao@google.com, chrisl@kernel.org
Subject: Re: [PATCH v5] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
Date: Thu, 17 Oct 2024 23:36:13 -0700 [thread overview]
Message-ID: <CAAPL-u_e=HxAXQPpvTzXSdv_3kLR-95S1CnFB9vk09nnOBYa_Q@mail.gmail.com> (raw)
In-Reply-To: <20241018052942.3810740-1-jingxiangzeng.cas@gmail.com>
On Thu, Oct 17, 2024 at 10:29 PM Jingxiang Zeng
<jingxiangzeng.cas@gmail.com> wrote:
>
> From: Zeng Jingxiang <linuszeng@tencent.com>
>
> Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> removed the opportunity to wake up flushers during the MGLRU page
> reclamation process can lead to an increased likelihood of triggering OOM
> when encountering many dirty pages during reclamation on MGLRU.
>
> This leads to premature OOM if there are too many dirty pages in cgroup:
> Killed
>
> dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> order=0, oom_score_adj=0
>
> Call Trace:
> <TASK>
> dump_stack_lvl+0x5f/0x80
> dump_stack+0x14/0x20
> dump_header+0x46/0x1b0
> oom_kill_process+0x104/0x220
> out_of_memory+0x112/0x5a0
> mem_cgroup_out_of_memory+0x13b/0x150
> try_charge_memcg+0x44f/0x5c0
> charge_memcg+0x34/0x50
> __mem_cgroup_charge+0x31/0x90
> filemap_add_folio+0x4b/0xf0
> __filemap_get_folio+0x1a4/0x5b0
> ? srso_return_thunk+0x5/0x5f
> ? __block_commit_write+0x82/0xb0
> ext4_da_write_begin+0xe5/0x270
> generic_perform_write+0x134/0x2b0
> ext4_buffered_write_iter+0x57/0xd0
> ext4_file_write_iter+0x76/0x7d0
> ? selinux_file_permission+0x119/0x150
> ? srso_return_thunk+0x5/0x5f
> ? srso_return_thunk+0x5/0x5f
> vfs_write+0x30c/0x440
> ksys_write+0x65/0xe0
> __x64_sys_write+0x1e/0x30
> x64_sys_call+0x11c2/0x1d50
> do_syscall_64+0x47/0x110
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> memory: usage 308224kB, limit 308224kB, failcnt 2589
> swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>
> ...
> file_dirty 303247360
> file_writeback 0
> ...
>
> oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
> anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> oom_score_adj:0
>
> The flusher wake up was removed to decrease SSD wearing, but if we are
> seeing all dirty folios at the tail of an LRU, not waking up the flusher
> could lead to thrashing easily. So wake it up when a mem cgroups is about
> to OOM due to dirty caches.
>
> I did run the build kernel test[1] on V5, with -j16 1G memcg on my local
> branch:
>
> Without the patch(10 times):
> user 1473.29
> system 347.87 339.17 345.28 354.64 352.46 355.63 358.80 359.40 358.28
> 350.95 (avg 352.248)
> real 166.651
>
> With the V5 patch(10 times):
> user 1470.7
> system 339.13 350.58 350.07 355.58 348.96 344.83 351.78 336.39 350.45
> 343.31 (avg 347.108)
> real 165.821
>
> Test results show that this patch has about 1% performance improvement,
> which should be caused by noise.
>
> ---
> Changes from v4:
> - Add the number of unqueued dirty pages in the shrink_folio_list function
> to sc->nr.unqueued_dirty. [Wei Xu, Jingxiang Zeng]
> - Reset sc->nr before calling lru_gen_shrink_node function.
> [Wei Xu, Jingxiang Zeng]
> - Modified the conditions for waking up the flusher thread to avoid
> interference from unevictable and anonymous pages.
> [Wei Xu, Jingxiang Zeng]
> - Link to v4: https://lore.kernel.org/all/20240929113050.76079-1-jingxiangzeng.cas@gmail.com/
> Changes from v3:
> - Avoid taking lock and reduce overhead on folio isolation by
> checking the right flags and rework wake up condition, fixing the
> performance regression reported by Chris Li.
> [Chris Li, Kairui Song]
> - Move the wake up check to try_to_shrink_lruvec to cover kswapd
> case as well, and update comments. [Kairui Song]
> - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/
> Changes from v2:
> - Acquire the lock before calling the folio_check_dirty_writeback
> function. [Wei Xu, Jingxiang Zeng]
> - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com/
> Changes from v1:
> - Add code to count the number of unqueued_dirty in the sort_folio
> function. [Wei Xu, Jingxiang Zeng]
> - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail.com/
> ---
>
> Link: https://lore.kernel.org/all/CACePvbV4L-gRN9UKKuUnksfVJjOTq_5Sti2-e=pb_w51kucLKQ@mail.gmail.com/ [1]
> Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Cc: T.J. Mercier <tjmercier@google.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Yu Zhao <yuzhao@google.com>
> ---
> mm/vmscan.c | 22 +++++++++++++++++++---
> 1 file changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2d0486189804..97e0af338ee0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4292,6 +4292,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> int tier_idx)
> {
> bool success;
> + bool dirty, writeback;
> int gen = folio_lru_gen(folio);
> int type = folio_is_file_lru(folio);
> int zone = folio_zonenum(folio);
> @@ -4337,9 +4338,14 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> return true;
> }
>
> + dirty = folio_test_dirty(folio);
> + writeback = folio_test_writeback(folio);
> + if (type == LRU_GEN_FILE && dirty && !writeback)
> + sc->nr.unqueued_dirty += delta;
> +
> /* waiting for writeback */
> - if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> - (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> + if (folio_test_locked(folio) || writeback ||
> + (type == LRU_GEN_FILE && dirty)) {
> gen = folio_inc_gen(lruvec, folio, true);
> list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> return true;
> @@ -4455,7 +4461,8 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
> trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH,
> scanned, skipped, isolated,
> type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> -
> + if (type == LRU_GEN_FILE)
> + sc->nr.taken += isolated;
> /*
> * There might not be eligible folios due to reclaim_idx. Check the
> * remaining to prevent livelock if it's not making progress.
> @@ -4589,6 +4596,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
> return scanned;
> retry:
> reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false);
> + sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
> sc->nr_reclaimed += reclaimed;
> trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> scanned, reclaimed, &stat, sc->priority,
> @@ -4797,6 +4805,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> cond_resched();
> }
>
> + /*
> + * If too many file cache in the coldest generation can't be evicted
> + * due to being dirty, wake up the flusher.
> + */
> + if (sc->nr.unqueued_dirty && !sc->nr.taken)
> + wakeup_flusher_threads(WB_REASON_VMSCAN);
This wakeup condition is too relaxed: We can wake up the the flusher
even when there are only one unqueued dirty page and many clean file
pages.
I'd suggest that we add dirty file pages to sc->nr.taken in
sort_folio() as well, in addition to isolated file pages in
scan_folios(). Then we can change the wakeup condition to be:
if (sc->nr.taken && sc->nr.unqueued_dirty == sc->nr.taken)
Also, given that only file pages are counted here, it would be better
to add sc->nr.file_taken as the new counter and use it in this patch
for more clarity.
> +
> /* whether this lruvec should be rotated */
> return nr_to_scan < 0;
> }
> @@ -5942,6 +5957,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> bool reclaimable = false;
>
> if (lru_gen_enabled() && root_reclaim(sc)) {
> + memset(&sc->nr, 0, sizeof(sc->nr));
> lru_gen_shrink_node(pgdat, sc);
> return;
> }
> --
> 2.43.5
>
prev parent reply other threads:[~2024-10-18 6:36 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <Jingxiang Zeng <linuszeng@tencent.com>
2024-09-13 8:44 ` [PATCH V2] mm/memcontrol: add per-memcg pgpgin/pswpin counter Jingxiang Zeng
2024-09-16 7:41 ` Andrew Morton
2024-09-21 7:51 ` jingxiang zeng
2024-10-11 20:01 ` Nhat Pham
2024-09-13 8:45 ` [PATCH V2] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM Jingxiang Zeng
2024-09-16 16:11 ` Wei Xu
2024-09-23 1:44 ` jingxiang zeng
2024-09-25 0:41 ` Chris Li
2024-09-24 12:13 ` [PATCH V3] " Jingxiang Zeng
2024-09-25 2:07 ` Chris Li
2024-09-29 11:26 ` [PATCH] [PATCH v4] " Jingxiang Zeng
2024-09-29 11:47 ` jingxiang zeng
2024-09-29 11:30 ` Jingxiang Zeng
2024-10-08 1:56 ` [RESEND][PATCH " Jingxiang Zeng
2024-10-08 2:15 ` Andrew Morton
2024-10-08 3:26 ` Wei Xu
2024-10-09 4:52 ` jingxiang zeng
2024-10-09 6:21 ` Wei Xu
2024-10-08 17:12 ` Chris Li
2024-10-09 5:29 ` jingxiang zeng
2024-10-09 5:43 ` jingxiang zeng
2024-10-09 5:34 ` jingxiang zeng
2024-10-18 5:29 ` [PATCH v5] " Jingxiang Zeng
2024-10-18 6:36 ` Wei Xu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAAPL-u_e=HxAXQPpvTzXSdv_3kLR-95S1CnFB9vk09nnOBYa_Q@mail.gmail.com' \
--to=weixugc@google.com \
--cc=akpm@linux-foundation.org \
--cc=chrisl@kernel.org \
--cc=jingxiangzeng.cas@gmail.com \
--cc=kasong@tencent.com \
--cc=linuszeng@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=tjmercier@google.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox