From: jingxiang zeng <jingxiangzeng.cas@gmail.com>
To: Chris Li <chrisl@kernel.org>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
kasong@tencent.com, linuszeng@tencent.com,
linux-kernel@vger.kernel.org, tjmercier@google.com,
weixugc@google.com, yuzhao@google.com
Subject: Re: [RESEND][PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
Date: Wed, 9 Oct 2024 13:43:16 +0800 [thread overview]
Message-ID: <CAJqJ8ihHwD_es2t8G=7aJZCMhnKH5n6320TaHgjFSaYuhWdc_Q@mail.gmail.com> (raw)
In-Reply-To: <CAJqJ8ijoU4qXA87SLDH1jZq8yVGZZtJhVffmbjCCsaHNv1Zfuw@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 8623 bytes --]
On Wed, 9 Oct 2024 at 13:29, jingxiang zeng <jingxiangzeng.cas@gmail.com>
wrote:
>
> On Wed, 9 Oct 2024 at 01:12, Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Jingxiang,
> >
> > I did run the same swap stress test on V4 and it is much better than V3.
> > V3 test was hang there (time out). V4 did not hang any more, it
> > finishes in about the same time.
> >
> > If we look closer of V4, it seems suggest that v4 system time is
slightly worse.
> > Is that kind of expected or might be the noise of my test? Just trying
> > to understand it better, it is not a NACK by any means.
> >
> > Here is the number on mm-unstable
c121617e3606be6575cdacfdb63cc8d67b46a568:
> > Without (10 times):
> > user 2688.328
> > system 6059.021 : 6031.57 6043.61 6044.35 6045.01 6052.46 6053.75
> > 6057.21 6063.31 6075.76 6123.18
> > real 277.145
> >
> > With V4:
> > First run (10 times):
> > user 2688.537
> > system 6180.907 : 6128.4 6145.47 6160.25 6167.09 6193.31 6195.93
> > 6197.26 6202.98 6204.64 6213.74
> > real 280.174
> > Second run (10 times):
> > user 2771.498
> > system 6199.043 : 6165.39 6173.49 6179.97 6189.03 6193.13 6199.33
> > 6204.03 6212.9 6216.32 6256.84
> > real 284.854
> >
> > Chris
> >
>
> Hi Chris,
>
> Before I released the V4 version, I also ran the swap stress test you
gave me,
> with -j32, 1G memcg on my local branch:
>
> Without the patch:
> 1952.07user 1768.35system 4:51.89elapsed 1274%CPU (0avgtext+0avgdata
> 920100maxresident)k
>
> With the patch:
> 1957.83user 1757.06system 4:51.15elapsed 1275%CPU (0avgtext+0avgdata
> 919880maxresident)k
The result here is written in reverse, please refer to another email:
https://lore.kernel.org/all/CAJqJ8ijzAhVxuE16-oawhhs3YmHKWmmoo0ca5KyaLGME7aoXjw@mail.gmail.com/
>
> My test results are the same as yours. This should not be test noise. I am
> trying to analyze whether it can be further optimized.
>
> Jingxiang Zeng
>
> > On Mon, Oct 7, 2024 at 6:57 PM Jingxiang Zeng
> > <jingxiangzeng.cas@gmail.com> wrote:
> > >
> > > From: Jingxiang Zeng <linuszeng@tencent.com>
> > >
> > > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > > removed the opportunity to wake up flushers during the MGLRU page
> > > reclamation process can lead to an increased likelihood of triggering
OOM
> > > when encountering many dirty pages during reclamation on MGLRU.
> > >
> > > This leads to premature OOM if there are too many dirty pages in
cgroup:
> > > Killed
> > >
> > > dd invoked oom-killer:
gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> > > order=0, oom_score_adj=0
> > >
> > > Call Trace:
> > > <TASK>
> > > dump_stack_lvl+0x5f/0x80
> > > dump_stack+0x14/0x20
> > > dump_header+0x46/0x1b0
> > > oom_kill_process+0x104/0x220
> > > out_of_memory+0x112/0x5a0
> > > mem_cgroup_out_of_memory+0x13b/0x150
> > > try_charge_memcg+0x44f/0x5c0
> > > charge_memcg+0x34/0x50
> > > __mem_cgroup_charge+0x31/0x90
> > > filemap_add_folio+0x4b/0xf0
> > > __filemap_get_folio+0x1a4/0x5b0
> > > ? srso_return_thunk+0x5/0x5f
> > > ? __block_commit_write+0x82/0xb0
> > > ext4_da_write_begin+0xe5/0x270
> > > generic_perform_write+0x134/0x2b0
> > > ext4_buffered_write_iter+0x57/0xd0
> > > ext4_file_write_iter+0x76/0x7d0
> > > ? selinux_file_permission+0x119/0x150
> > > ? srso_return_thunk+0x5/0x5f
> > > ? srso_return_thunk+0x5/0x5f
> > > vfs_write+0x30c/0x440
> > > ksys_write+0x65/0xe0
> > > __x64_sys_write+0x1e/0x30
> > > x64_sys_call+0x11c2/0x1d50
> > > do_syscall_64+0x47/0x110
> > > entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >
> > > memory: usage 308224kB, limit 308224kB, failcnt 2589
> > > swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> > >
> > > ...
> > > file_dirty 303247360
> > > file_writeback 0
> > > ...
> > >
> > > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> > > mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> > > Memory cgroup out of memory: Killed process 4404 (dd)
total-vm:10512kB,
> > > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> > > oom_score_adj:0
> > >
> > > The flusher wake up was removed to decrease SSD wearing, but if we are
> > > seeing all dirty folios at the tail of an LRU, not waking up the
flusher
> > > could lead to thrashing easily. So wake it up when a mem cgroups is
about
> > > to OOM due to dirty caches.
> > >
> > > ---
> > > Changes from v3:
> > > - Avoid taking lock and reduce overhead on folio isolation by
> > > checking the right flags and rework wake up condition, fixing the
> > > performance regression reported by Chris Li.
> > > [Chris Li, Kairui Song]
> > > - Move the wake up check to try_to_shrink_lruvec to cover kswapd
> > > case as well, and update comments. [Kairui Song]
> > > - Link to v3:
https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/
> > > Changes from v2:
> > > - Acquire the lock before calling the folio_check_dirty_writeback
> > > function. [Wei Xu, Jingxiang Zeng]
> > > - Link to v2:
https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com/
> > > Changes from v1:
> > > - Add code to count the number of unqueued_dirty in the sort_folio
> > > function. [Wei Xu, Jingxiang Zeng]
> > > - Link to v1:
https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail.com/
> > > ---
> > >
> > > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > > Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > Cc: T.J. Mercier <tjmercier@google.com>
> > > Cc: Wei Xu <weixugc@google.com>
> > > Cc: Yu Zhao <yuzhao@google.com>
> > > ---
> > > mm/vmscan.c | 19 ++++++++++++++++---
> > > 1 file changed, 16 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index dc7a285b256b..2a5c2fe81467 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec,
struct folio *folio, struct scan_c
> > > int tier_idx)
> > > {
> > > bool success;
> > > + bool dirty, writeback;
> > > int gen = folio_lru_gen(folio);
> > > int type = folio_is_file_lru(folio);
> > > int zone = folio_zonenum(folio);
> > > @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec,
struct folio *folio, struct scan_c
> > > return true;
> > > }
> > >
> > > + dirty = folio_test_dirty(folio);
> > > + writeback = folio_test_writeback(folio);
> > > + if (type == LRU_GEN_FILE && dirty && !writeback)
> > > + sc->nr.unqueued_dirty += delta;
> > > +
> > > /* waiting for writeback */
> > > - if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> > > - (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > > + if (folio_test_locked(folio) || writeback ||
> > > + (type == LRU_GEN_FILE && dirty)) {
> > > gen = folio_inc_gen(lruvec, folio, true);
> > > list_move(&folio->lru,
&lrugen->folios[gen][type][zone]);
> > > return true;
> > > @@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec,
struct scan_control *sc,
> > > trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order,
MAX_LRU_BATCH,
> > > scanned, skipped, isolated,
> > > type ? LRU_INACTIVE_FILE :
LRU_INACTIVE_ANON);
> > > -
> > > + sc->nr.taken += scanned;
> > > /*
> > > * There might not be eligible folios due to reclaim_idx.
Check the
> > > * remaining to prevent livelock if it's not making progress.
> > > @@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec
*lruvec, struct scan_control *sc)
> > > cond_resched();
> > > }
> > >
> > > + /*
> > > + * If too many file cache in the coldest generation can't be
evicted
> > > + * due to being dirty, wake up the flusher.
> > > + */
> > > + if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty ==
sc->nr.taken)
> > > + wakeup_flusher_threads(WB_REASON_VMSCAN);
> > > +
> > > /* whether this lruvec should be rotated */
> > > return nr_to_scan < 0;
> > > }
> > > --
> > > 2.43.5
> > >
[-- Attachment #2: Type: text/html, Size: 11860 bytes --]
next prev parent reply other threads:[~2024-10-09 5:43 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <Jingxiang Zeng <linuszeng@tencent.com>
2024-09-13 8:44 ` [PATCH V2] mm/memcontrol: add per-memcg pgpgin/pswpin counter Jingxiang Zeng
2024-09-16 7:41 ` Andrew Morton
2024-09-21 7:51 ` jingxiang zeng
2024-10-11 20:01 ` Nhat Pham
2024-09-13 8:45 ` [PATCH V2] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM Jingxiang Zeng
2024-09-16 16:11 ` Wei Xu
2024-09-23 1:44 ` jingxiang zeng
2024-09-25 0:41 ` Chris Li
2024-09-24 12:13 ` [PATCH V3] " Jingxiang Zeng
2024-09-25 2:07 ` Chris Li
2024-09-29 11:26 ` [PATCH] [PATCH v4] " Jingxiang Zeng
2024-09-29 11:47 ` jingxiang zeng
2024-09-29 11:30 ` Jingxiang Zeng
2024-10-08 1:56 ` [RESEND][PATCH " Jingxiang Zeng
2024-10-08 2:15 ` Andrew Morton
2024-10-08 3:26 ` Wei Xu
2024-10-09 4:52 ` jingxiang zeng
2024-10-09 6:21 ` Wei Xu
2024-10-08 17:12 ` Chris Li
2024-10-09 5:29 ` jingxiang zeng
2024-10-09 5:43 ` jingxiang zeng [this message]
2024-10-09 5:34 ` jingxiang zeng
2024-10-18 5:29 ` [PATCH v5] " Jingxiang Zeng
2024-10-18 6:36 ` Wei Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAJqJ8ihHwD_es2t8G=7aJZCMhnKH5n6320TaHgjFSaYuhWdc_Q@mail.gmail.com' \
--to=jingxiangzeng.cas@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=chrisl@kernel.org \
--cc=kasong@tencent.com \
--cc=linuszeng@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=tjmercier@google.com \
--cc=weixugc@google.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox