[PATCH V2] mm/memcontrol: add per-memcg pgpgin/pswpin counter

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V2] mm/memcontrol: add per-memcg pgpgin/pswpin counter
       [not found] <Jingxiang Zeng <linuszeng@tencent.com>
@ 2024-09-13  8:44 ` Jingxiang Zeng
  2024-09-16  7:41   ` Andrew Morton
  2024-10-11 20:01   ` Nhat Pham
  2024-09-13  8:45 ` [PATCH V2] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM Jingxiang Zeng
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Jingxiang Zeng @ 2024-09-13  8:44 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, cgroups, hannes, linuszeng, linux-kernel, mhocko,
	muchun.song, roman.gushchin, shakeel.butt, yosryahmed

From: Jingxiang Zeng <linuszeng@tencent.com>

In proactive memory reclamation scenarios, it is necessary to estimate the
pswpin and pswpout metrics of the cgroup to determine whether to continue
reclaiming anonymous pages in the current batch.  This patch will collect
these metrics and expose them.

Link: https://lkml.kernel.org/r/20240830082244.156923-1-jingxiangzeng.cas@gmail.com
Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 mm/memcontrol.c | 2 ++
 mm/page_io.c    | 4 ++++
 2 files changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6efbfc9399d0..dbc1d43a5c4c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -418,6 +418,8 @@ static const unsigned int memcg_vm_event_stat[] = {
 	PGPGIN,
 	PGPGOUT,
 #endif
+	PSWPIN,
+	PSWPOUT,
 	PGSCAN_KSWAPD,
 	PGSCAN_DIRECT,
 	PGSCAN_KHUGEPAGED,
diff --git a/mm/page_io.c b/mm/page_io.c
index b6f1519d63b0..4bc77d1c6bfa 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -310,6 +310,7 @@ static inline void count_swpout_vm_event(struct folio *folio)
 	}
 	count_mthp_stat(folio_order(folio), MTHP_STAT_SWPOUT);
 #endif
+	count_memcg_folio_events(folio, PSWPOUT, folio_nr_pages(folio));
 	count_vm_events(PSWPOUT, folio_nr_pages(folio));
 }
 
@@ -505,6 +506,7 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
 		for (p = 0; p < sio->pages; p++) {
 			struct folio *folio = page_folio(sio->bvec[p].bv_page);
 
+			count_memcg_folio_events(folio, PSWPIN, folio_nr_pages(folio));
 			folio_mark_uptodate(folio);
 			folio_unlock(folio);
 		}
@@ -588,6 +590,7 @@ static void swap_read_folio_bdev_sync(struct folio *folio,
 	 * attempt to access it in the page fault retry time check.
 	 */
 	get_task_struct(current);
+	count_memcg_folio_events(folio, PSWPIN, folio_nr_pages(folio));
 	count_vm_event(PSWPIN);
 	submit_bio_wait(&bio);
 	__end_swap_bio_read(&bio);
@@ -603,6 +606,7 @@ static void swap_read_folio_bdev_async(struct folio *folio,
 	bio->bi_iter.bi_sector = swap_folio_sector(folio);
 	bio->bi_end_io = end_swap_bio_read;
 	bio_add_folio_nofail(bio, folio, folio_size(folio), 0);
+	count_memcg_folio_events(folio, PSWPIN, folio_nr_pages(folio));
 	count_vm_event(PSWPIN);
 	submit_bio(bio);
 }
-- 
2.43.5



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V2] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
       [not found] <Jingxiang Zeng <linuszeng@tencent.com>
  2024-09-13  8:44 ` [PATCH V2] mm/memcontrol: add per-memcg pgpgin/pswpin counter Jingxiang Zeng
@ 2024-09-13  8:45 ` Jingxiang Zeng
  2024-09-16 16:11   ` Wei Xu
  2024-09-25  0:41   ` Chris Li
  2024-09-24 12:13 ` [PATCH V3] " Jingxiang Zeng
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Jingxiang Zeng @ 2024-09-13  8:45 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, kasong, linuszeng, linux-kernel, tjmercier, weixugc, yuzhao

From: Zeng Jingxiang <linuszeng@tencent.com>

Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
removed the opportunity to wake up flushers during the MGLRU page
reclamation process can lead to an increased likelihood of triggering OOM
when encountering many dirty pages during reclamation on MGLRU.

This leads to premature OOM if there are too many dirty pages in cgroup:
Killed

dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
order=0, oom_score_adj=0

Call Trace:
  <TASK>
  dump_stack_lvl+0x5f/0x80
  dump_stack+0x14/0x20
  dump_header+0x46/0x1b0
  oom_kill_process+0x104/0x220
  out_of_memory+0x112/0x5a0
  mem_cgroup_out_of_memory+0x13b/0x150
  try_charge_memcg+0x44f/0x5c0
  charge_memcg+0x34/0x50
  __mem_cgroup_charge+0x31/0x90
  filemap_add_folio+0x4b/0xf0
  __filemap_get_folio+0x1a4/0x5b0
  ? srso_return_thunk+0x5/0x5f
  ? __block_commit_write+0x82/0xb0
  ext4_da_write_begin+0xe5/0x270
  generic_perform_write+0x134/0x2b0
  ext4_buffered_write_iter+0x57/0xd0
  ext4_file_write_iter+0x76/0x7d0
  ? selinux_file_permission+0x119/0x150
  ? srso_return_thunk+0x5/0x5f
  ? srso_return_thunk+0x5/0x5f
  vfs_write+0x30c/0x440
  ksys_write+0x65/0xe0
  __x64_sys_write+0x1e/0x30
  x64_sys_call+0x11c2/0x1d50
  do_syscall_64+0x47/0x110
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

 memory: usage 308224kB, limit 308224kB, failcnt 2589
 swap: usage 0kB, limit 9007199254740988kB, failcnt 0

  ...
  file_dirty 303247360
  file_writeback 0
  ...

oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
oom_score_adj:0

The flusher wake up was removed to decrease SSD wearing, but if we are
seeing all dirty folios at the tail of an LRU, not waking up the flusher
could lead to thrashing easily. So wake it up when a mem cgroups is
about to OOM due to dirty caches.

Link: https://lkml.kernel.org/r/20240829102543.189453-1-jingxiangzeng.cas@gmail.com
Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 mm/vmscan.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 749cdc110c74..ce471d686a88 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4284,6 +4284,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		       int tier_idx)
 {
 	bool success;
+	bool dirty, writeback;
 	int gen = folio_lru_gen(folio);
 	int type = folio_is_file_lru(folio);
 	int zone = folio_zonenum(folio);
@@ -4332,6 +4333,9 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	/* waiting for writeback */
 	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
 	    (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
+		folio_check_dirty_writeback(folio, &dirty, &writeback);
+		if (dirty && !writeback)
+			sc->nr.unqueued_dirty += delta;
 		gen = folio_inc_gen(lruvec, folio, true);
 		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
@@ -4448,6 +4452,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
+	sc->nr.taken += isolated;
 	/*
 	 * There might not be eligible folios due to reclaim_idx. Check the
 	 * remaining to prevent livelock if it's not making progress.
@@ -4920,6 +4925,14 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 	if (try_to_shrink_lruvec(lruvec, sc))
 		lru_gen_rotate_memcg(lruvec, MEMCG_LRU_YOUNG);
 
+	/*
+	 * If too many pages failed to evict due to page being dirty,
+	 * memory pressure have pushed dirty pages to oldest gen,
+	 * wake up flusher.
+	 */
+	if (sc->nr.unqueued_dirty > sc->nr.taken)
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+
 	clear_mm_walk();
 
 	blk_finish_plug(&plug);
-- 
2.43.5



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2] mm/memcontrol: add per-memcg pgpgin/pswpin counter
  2024-09-13  8:44 ` [PATCH V2] mm/memcontrol: add per-memcg pgpgin/pswpin counter Jingxiang Zeng
@ 2024-09-16  7:41   ` Andrew Morton
  2024-09-21  7:51     ` jingxiang zeng
  2024-10-11 20:01   ` Nhat Pham
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2024-09-16  7:41 UTC (permalink / raw)
  To: Jingxiang Zeng
  Cc: linux-mm, cgroups, hannes, linuszeng, linux-kernel, mhocko,
	muchun.song, roman.gushchin, shakeel.butt, yosryahmed

On Fri, 13 Sep 2024 16:44:53 +0800 Jingxiang Zeng <jingxiangzeng.cas@gmail.com> wrote:

> From: Jingxiang Zeng <linuszeng@tencent.com>
> 
> In proactive memory reclamation scenarios, it is necessary to estimate the
> pswpin and pswpout metrics of the cgroup to determine whether to continue
> reclaiming anonymous pages in the current batch.  This patch will collect
> these metrics and expose them.
> 

Please explain the differences between v1 and v2:

--- a/mm/memcontrol-v1.c~mm-memcontrol-add-per-memcg-pgpgin-pswpin-counter-v2
+++ a/mm/memcontrol-v1.c
@@ -2729,8 +2729,6 @@ static const char *const memcg1_stat_nam
 static const unsigned int memcg1_events[] = {
 	PGPGIN,
 	PGPGOUT,
-	PSWPIN,
-	PSWPOUT,
 	PGFAULT,
 	PGMAJFAULT,
 };
_



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
  2024-09-13  8:45 ` [PATCH V2] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM Jingxiang Zeng
@ 2024-09-16 16:11   ` Wei Xu
  2024-09-23  1:44     ` jingxiang zeng
  2024-09-25  0:41   ` Chris Li
  1 sibling, 1 reply; 24+ messages in thread
From: Wei Xu @ 2024-09-16 16:11 UTC (permalink / raw)
  To: Jingxiang Zeng
  Cc: linux-mm, akpm, kasong, linuszeng, linux-kernel, tjmercier, yuzhao

On Fri, Sep 13, 2024 at 1:45 AM Jingxiang Zeng
<jingxiangzeng.cas@gmail.com> wrote:
>
> From: Zeng Jingxiang <linuszeng@tencent.com>
>
> Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> removed the opportunity to wake up flushers during the MGLRU page
> reclamation process can lead to an increased likelihood of triggering OOM
> when encountering many dirty pages during reclamation on MGLRU.
>
> This leads to premature OOM if there are too many dirty pages in cgroup:
> Killed
>
> dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> order=0, oom_score_adj=0
>
> Call Trace:
>   <TASK>
>   dump_stack_lvl+0x5f/0x80
>   dump_stack+0x14/0x20
>   dump_header+0x46/0x1b0
>   oom_kill_process+0x104/0x220
>   out_of_memory+0x112/0x5a0
>   mem_cgroup_out_of_memory+0x13b/0x150
>   try_charge_memcg+0x44f/0x5c0
>   charge_memcg+0x34/0x50
>   __mem_cgroup_charge+0x31/0x90
>   filemap_add_folio+0x4b/0xf0
>   __filemap_get_folio+0x1a4/0x5b0
>   ? srso_return_thunk+0x5/0x5f
>   ? __block_commit_write+0x82/0xb0
>   ext4_da_write_begin+0xe5/0x270
>   generic_perform_write+0x134/0x2b0
>   ext4_buffered_write_iter+0x57/0xd0
>   ext4_file_write_iter+0x76/0x7d0
>   ? selinux_file_permission+0x119/0x150
>   ? srso_return_thunk+0x5/0x5f
>   ? srso_return_thunk+0x5/0x5f
>   vfs_write+0x30c/0x440
>   ksys_write+0x65/0xe0
>   __x64_sys_write+0x1e/0x30
>   x64_sys_call+0x11c2/0x1d50
>   do_syscall_64+0x47/0x110
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
>  memory: usage 308224kB, limit 308224kB, failcnt 2589
>  swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>
>   ...
>   file_dirty 303247360
>   file_writeback 0
>   ...
>
> oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
> anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> oom_score_adj:0
>
> The flusher wake up was removed to decrease SSD wearing, but if we are
> seeing all dirty folios at the tail of an LRU, not waking up the flusher
> could lead to thrashing easily. So wake it up when a mem cgroups is
> about to OOM due to dirty caches.
>
> Link: https://lkml.kernel.org/r/20240829102543.189453-1-jingxiangzeng.cas@gmail.com
> Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Cc: T.J. Mercier <tjmercier@google.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Yu Zhao <yuzhao@google.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>  mm/vmscan.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 749cdc110c74..ce471d686a88 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4284,6 +4284,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>                        int tier_idx)
>  {
>         bool success;
> +       bool dirty, writeback;
>         int gen = folio_lru_gen(folio);
>         int type = folio_is_file_lru(folio);
>         int zone = folio_zonenum(folio);
> @@ -4332,6 +4333,9 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>         /* waiting for writeback */
>         if (folio_test_locked(folio) || folio_test_writeback(folio) ||
>             (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> +               folio_check_dirty_writeback(folio, &dirty, &writeback);

We cannot simply call folio_check_dirty_writeback() here because
folio_check_dirty_writeback() assumes that the folio is locked (e.g.
see buffer_check_dirty_writeback()).

> +               if (dirty && !writeback)
> +                       sc->nr.unqueued_dirty += delta;
>                 gen = folio_inc_gen(lruvec, folio, true);
>                 list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
>                 return true;
> @@ -4448,6 +4452,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
>                                 scanned, skipped, isolated,
>                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>
> +       sc->nr.taken += isolated;
>         /*
>          * There might not be eligible folios due to reclaim_idx. Check the
>          * remaining to prevent livelock if it's not making progress.
> @@ -4920,6 +4925,14 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
>         if (try_to_shrink_lruvec(lruvec, sc))
>                 lru_gen_rotate_memcg(lruvec, MEMCG_LRU_YOUNG);
>
> +       /*
> +        * If too many pages failed to evict due to page being dirty,
> +        * memory pressure have pushed dirty pages to oldest gen,
> +        * wake up flusher.
> +        */
> +       if (sc->nr.unqueued_dirty > sc->nr.taken)
> +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> +

The wakeups will be much more often than intended because
sc->nr.unqueued_dirty includes the drity file pages now, but
sc->nr.taken doesn't.

>         clear_mm_walk();
>
>         blk_finish_plug(&plug);
> --
> 2.43.5
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2] mm/memcontrol: add per-memcg pgpgin/pswpin counter
  2024-09-16  7:41   ` Andrew Morton
@ 2024-09-21  7:51     ` jingxiang zeng
  0 siblings, 0 replies; 24+ messages in thread
From: jingxiang zeng @ 2024-09-21  7:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, cgroups, hannes, linuszeng, linux-kernel, mhocko,
	muchun.song, roman.gushchin, shakeel.butt, yosryahmed

On Mon, 16 Sept 2024 at 15:41, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 13 Sep 2024 16:44:53 +0800 Jingxiang Zeng <jingxiangzeng.cas@gmail.com> wrote:
>
> > From: Jingxiang Zeng <linuszeng@tencent.com>
> >
> > In proactive memory reclamation scenarios, it is necessary to estimate the
> > pswpin and pswpout metrics of the cgroup to determine whether to continue
> > reclaiming anonymous pages in the current batch.  This patch will collect
> > these metrics and expose them.
> >
>
> Please explain the differences between v1 and v2:

Currently, the PSWPIN and PSWPOUT fields are only used in proactive memory
reclamation scenarios, but memory.reclaim is only exposed in cgroup v2, so
here we simply delete these fields in cgroup v1.
>
> --- a/mm/memcontrol-v1.c~mm-memcontrol-add-per-memcg-pgpgin-pswpin-counter-v2
> +++ a/mm/memcontrol-v1.c
> @@ -2729,8 +2729,6 @@ static const char *const memcg1_stat_nam
>  static const unsigned int memcg1_events[] = {
>         PGPGIN,
>         PGPGOUT,
> -       PSWPIN,
> -       PSWPOUT,
>         PGFAULT,
>         PGMAJFAULT,
>  };
> _
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
  2024-09-16 16:11   ` Wei Xu
@ 2024-09-23  1:44     ` jingxiang zeng
  0 siblings, 0 replies; 24+ messages in thread
From: jingxiang zeng @ 2024-09-23  1:44 UTC (permalink / raw)
  To: Wei Xu; +Cc: linux-mm, akpm, kasong, linuszeng, linux-kernel, tjmercier, yuzhao

On Tue, 17 Sept 2024 at 00:11, Wei Xu <weixugc@google.com> wrote:
>
> On Fri, Sep 13, 2024 at 1:45 AM Jingxiang Zeng
> <jingxiangzeng.cas@gmail.com> wrote:
> >
> > From: Zeng Jingxiang <linuszeng@tencent.com>
> >
> > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > removed the opportunity to wake up flushers during the MGLRU page
> > reclamation process can lead to an increased likelihood of triggering OOM
> > when encountering many dirty pages during reclamation on MGLRU.
> >
> > This leads to premature OOM if there are too many dirty pages in cgroup:
> > Killed
> >
> > dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> > order=0, oom_score_adj=0
> >
> > Call Trace:
> >   <TASK>
> >   dump_stack_lvl+0x5f/0x80
> >   dump_stack+0x14/0x20
> >   dump_header+0x46/0x1b0
> >   oom_kill_process+0x104/0x220
> >   out_of_memory+0x112/0x5a0
> >   mem_cgroup_out_of_memory+0x13b/0x150
> >   try_charge_memcg+0x44f/0x5c0
> >   charge_memcg+0x34/0x50
> >   __mem_cgroup_charge+0x31/0x90
> >   filemap_add_folio+0x4b/0xf0
> >   __filemap_get_folio+0x1a4/0x5b0
> >   ? srso_return_thunk+0x5/0x5f
> >   ? __block_commit_write+0x82/0xb0
> >   ext4_da_write_begin+0xe5/0x270
> >   generic_perform_write+0x134/0x2b0
> >   ext4_buffered_write_iter+0x57/0xd0
> >   ext4_file_write_iter+0x76/0x7d0
> >   ? selinux_file_permission+0x119/0x150
> >   ? srso_return_thunk+0x5/0x5f
> >   ? srso_return_thunk+0x5/0x5f
> >   vfs_write+0x30c/0x440
> >   ksys_write+0x65/0xe0
> >   __x64_sys_write+0x1e/0x30
> >   x64_sys_call+0x11c2/0x1d50
> >   do_syscall_64+0x47/0x110
> >   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >
> >  memory: usage 308224kB, limit 308224kB, failcnt 2589
> >  swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> >
> >   ...
> >   file_dirty 303247360
> >   file_writeback 0
> >   ...
> >
> > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> > mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> > Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
> > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> > oom_score_adj:0
> >
> > The flusher wake up was removed to decrease SSD wearing, but if we are
> > seeing all dirty folios at the tail of an LRU, not waking up the flusher
> > could lead to thrashing easily. So wake it up when a mem cgroups is
> > about to OOM due to dirty caches.
> >
> > Link: https://lkml.kernel.org/r/20240829102543.189453-1-jingxiangzeng.cas@gmail.com
> > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > Cc: T.J. Mercier <tjmercier@google.com>
> > Cc: Wei Xu <weixugc@google.com>
> > Cc: Yu Zhao <yuzhao@google.com>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > ---
> >  mm/vmscan.c | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 749cdc110c74..ce471d686a88 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4284,6 +4284,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >                        int tier_idx)
> >  {
> >         bool success;
> > +       bool dirty, writeback;
> >         int gen = folio_lru_gen(folio);
> >         int type = folio_is_file_lru(folio);
> >         int zone = folio_zonenum(folio);
> > @@ -4332,6 +4333,9 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >         /* waiting for writeback */
> >         if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> >             (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > +               folio_check_dirty_writeback(folio, &dirty, &writeback);
>
> We cannot simply call folio_check_dirty_writeback() here because
> folio_check_dirty_writeback() assumes that the folio is locked (e.g.
> see buffer_check_dirty_writeback()).

Thank you for your modification suggestion.
>
> > +               if (dirty && !writeback)
> > +                       sc->nr.unqueued_dirty += delta;
> >                 gen = folio_inc_gen(lruvec, folio, true);
> >                 list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> >                 return true;
> > @@ -4448,6 +4452,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
> >                                 scanned, skipped, isolated,
> >                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> >
> > +       sc->nr.taken += isolated;
> >         /*
> >          * There might not be eligible folios due to reclaim_idx. Check the
> >          * remaining to prevent livelock if it's not making progress.
> > @@ -4920,6 +4925,14 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
> >         if (try_to_shrink_lruvec(lruvec, sc))
> >                 lru_gen_rotate_memcg(lruvec, MEMCG_LRU_YOUNG);
> >
> > +       /*
> > +        * If too many pages failed to evict due to page being dirty,
> > +        * memory pressure have pushed dirty pages to oldest gen,
> > +        * wake up flusher.
> > +        */
> > +       if (sc->nr.unqueued_dirty > sc->nr.taken)
> > +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> > +
>
> The wakeups will be much more often than intended because
> sc->nr.unqueued_dirty includes the drity file pages now, but
> sc->nr.taken doesn't.

When the scan_folios function scans the coldest generation,
when the number of pages that need to be promoted due to
unqueued dirty pages is greater than the number of pages
that can be isolated, the logic here will wake up the flusher
thread.
This condition will indeed be triggered when there
are too many pages in the coldest generation that cannot
be isolated.
I think this condition for triggering the flusher to wake up is
reasonable. Are there any other good trigger conditions?
>
> >         clear_mm_walk();
> >
> >         blk_finish_plug(&plug);
> > --
> > 2.43.5
> >


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V3] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
       [not found] <Jingxiang Zeng <linuszeng@tencent.com>
  2024-09-13  8:44 ` [PATCH V2] mm/memcontrol: add per-memcg pgpgin/pswpin counter Jingxiang Zeng
  2024-09-13  8:45 ` [PATCH V2] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM Jingxiang Zeng
@ 2024-09-24 12:13 ` Jingxiang Zeng
  2024-09-25  2:07   ` Chris Li
  2024-09-29 11:26 ` [PATCH] [PATCH v4] " Jingxiang Zeng
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Jingxiang Zeng @ 2024-09-24 12:13 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, kasong, linuszeng, linux-kernel, tjmercier, weixugc, yuzhao

From: Zeng Jingxiang <linuszeng@tencent.com>

Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
removed the opportunity to wake up flushers during the MGLRU page
reclamation process can lead to an increased likelihood of triggering OOM
when encountering many dirty pages during reclamation on MGLRU.

This leads to premature OOM if there are too many dirty pages in cgroup:
Killed

dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
order=0, oom_score_adj=0

Call Trace:
  <TASK>
  dump_stack_lvl+0x5f/0x80
  dump_stack+0x14/0x20
  dump_header+0x46/0x1b0
  oom_kill_process+0x104/0x220
  out_of_memory+0x112/0x5a0
  mem_cgroup_out_of_memory+0x13b/0x150
  try_charge_memcg+0x44f/0x5c0
  charge_memcg+0x34/0x50
  __mem_cgroup_charge+0x31/0x90
  filemap_add_folio+0x4b/0xf0
  __filemap_get_folio+0x1a4/0x5b0
  ? srso_return_thunk+0x5/0x5f
  ? __block_commit_write+0x82/0xb0
  ext4_da_write_begin+0xe5/0x270
  generic_perform_write+0x134/0x2b0
  ext4_buffered_write_iter+0x57/0xd0
  ext4_file_write_iter+0x76/0x7d0
  ? selinux_file_permission+0x119/0x150
  ? srso_return_thunk+0x5/0x5f
  ? srso_return_thunk+0x5/0x5f
  vfs_write+0x30c/0x440
  ksys_write+0x65/0xe0
  __x64_sys_write+0x1e/0x30
  x64_sys_call+0x11c2/0x1d50
  do_syscall_64+0x47/0x110
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

 memory: usage 308224kB, limit 308224kB, failcnt 2589
 swap: usage 0kB, limit 9007199254740988kB, failcnt 0

  ...
  file_dirty 303247360
  file_writeback 0
  ...

oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
oom_score_adj:0

The flusher wake up was removed to decrease SSD wearing, but if we are
seeing all dirty folios at the tail of an LRU, not waking up the flusher
could lead to thrashing easily.  So wake it up when a mem cgroups is about
to OOM due to dirty caches.

Link: https://lkml.kernel.org/r/20240829102543.189453-1-jingxiangzeng.cas@gmail.com
Link: https://lkml.kernel.org/r/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com
Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yu Zhao <yuzhao@google.com>
---
Changes from v2:
- Acquire the lock before calling the folio_check_dirty_writeback
  function.
- Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com/
Changes from v1:
- Add code to count the number of unqueued_dirty in the sort_folio
  function.
- Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail.com/
---
 mm/vmscan.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 749cdc110c74..12c285a96353 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4290,6 +4290,8 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	int delta = folio_nr_pages(folio);
 	int refs = folio_lru_refs(folio);
 	int tier = lru_tier_from_refs(refs);
+	bool dirty = folio_test_dirty(folio);
+	bool writeback = folio_test_writeback(folio);
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 
 	VM_WARN_ON_ONCE_FOLIO(gen >= MAX_NR_GENS, folio);
@@ -4330,8 +4332,10 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	}
 
 	/* waiting for writeback */
-	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
-	    (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
+	if (folio_test_locked(folio) || dirty ||
+	    (type == LRU_GEN_FILE && writeback)) {
+		if (type == LRU_GEN_FILE && dirty && !writeback)
+			sc->nr.unqueued_dirty += delta;
 		gen = folio_inc_gen(lruvec, folio, true);
 		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
@@ -4448,6 +4452,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
+	sc->nr.taken += isolated;
 	/*
 	 * There might not be eligible folios due to reclaim_idx. Check the
 	 * remaining to prevent livelock if it's not making progress.
@@ -4920,6 +4925,13 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 	if (try_to_shrink_lruvec(lruvec, sc))
 		lru_gen_rotate_memcg(lruvec, MEMCG_LRU_YOUNG);
 
+	/*
+	 * If too many pages in the coldest generation that cannot
+	 * be isolated, wake up flusher.
+	 */
+	if (sc->nr.unqueued_dirty > sc->nr.taken)
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+
 	clear_mm_walk();
 
 	blk_finish_plug(&plug);
-- 
2.43.5



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
  2024-09-13  8:45 ` [PATCH V2] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM Jingxiang Zeng
  2024-09-16 16:11   ` Wei Xu
@ 2024-09-25  0:41   ` Chris Li
  1 sibling, 0 replies; 24+ messages in thread
From: Chris Li @ 2024-09-25  0:41 UTC (permalink / raw)
  To: Jingxiang Zeng
  Cc: linux-mm, akpm, kasong, linuszeng, linux-kernel, tjmercier,
	weixugc, yuzhao, Kairui Song

Hi Jingxiang,

The mm-unstable is breaking my swap stress test again. I am able to
bisect it down to this change

efa7d80846115635db76b4a55b6de97fdb456549 ("mm/vmscan: wake up flushers
conditionally to avoid cgroup OOM").

$ git bisect log
# bad: [acfabf7e197f7a5bedf4749dac1f39551417b049] mm: optimization on
page allocation when CMA enabled
# good: [0885ef4705607936fc36a38fd74356e1c465b023] mm: vmscan.c: fix
OOM on swap stress test
git bisect start 'mm-unstable' 'mm-unstable-good'
# good: [38075679b5f157eeacd46c900e9cfc684bdbc167] mm/mremap: replace
can_modify_mm with can_modify_vma
git bisect good 38075679b5f157eeacd46c900e9cfc684bdbc167
# good: [6050df6d706f58b2240bfabece9f406170104d57] maple_tree: fix
comment typo on ma_flag of allocation tree
git bisect good 6050df6d706f58b2240bfabece9f406170104d57
# good: [ae3c99e650da4a8f4deb3670c29059de375a88be] mm/gup: detect huge
pfnmap entries in gup-fast
git bisect good ae3c99e650da4a8f4deb3670c29059de375a88be
# good: [ed8d5b0ce1d738e13c60d6b1a901a56d832e5070] Revert "uprobes:
use vm_special_mapping close() functionality"
git bisect good ed8d5b0ce1d738e13c60d6b1a901a56d832e5070
# good: [508bae239d4e41a34f88405720837030ea456db1] mm/hugetlb:
simplify refs in memfd_alloc_folio
git bisect good 508bae239d4e41a34f88405720837030ea456db1
# good: [bae3bf040ce9b0def345855fd13025c5a518d897] mm/vmstat: defer
the refresh_zone_stat_thresholds after all CPUs bringup
git bisect good bae3bf040ce9b0def345855fd13025c5a518d897
# good: [d675c821b65f0c496df1d33150619b7635827e89]
mm-memcontrol-add-per-memcg-pgpgin-pswpin-counter-v2
git bisect good d675c821b65f0c496df1d33150619b7635827e89
# bad: [a90a317691612f7681c58756762e895682068abb] mm/hugetlb_cgroup:
introduce peak and rsvd.peak to v2
git bisect bad a90a317691612f7681c58756762e895682068abb
# bad: [efa7d80846115635db76b4a55b6de97fdb456549] mm/vmscan: wake up
flushers conditionally to avoid cgroup OOM
git bisect bad efa7d80846115635db76b4a55b6de97fdb456549
# first bad commit: [efa7d80846115635db76b4a55b6de97fdb456549]
mm/vmscan: wake up flushers conditionally to avoid cgroup OOM

The kernel dmesg:

[   43.929971] zswap: loaded using pool lzo/zsmalloc
[   43.998722] zram0: detected capacity change from 16777216 to 0
[   44.012198] zram: Removed device: zram0
[   44.055409] zram: Added device: zram0
[   44.081232] zram: Added device: zram1
[   44.169864] zram0: detected capacity change from 0 to 6553600
[   44.231357] zram1: detected capacity change from 0 to 40960000
[   44.339063] Adding 3276796k swap on /dev/zram0.  Priority:100
extents:1 across:3276796k SS
[   44.368001] Adding 20479996k swap on /dev/zram1.  Priority:0
extents:1 across:20479996k SS
[   62.722642] ------------[ cut here ]------------
[   62.727234] kernel BUG at fs/buffer.c:95!
[   62.727506] Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
[   62.727850] CPU: 13 UID: 1000 PID: 5617 Comm: cc1 Tainted: G S
           6.11.0-rc6+ #51
[   62.732770] Tainted: [S]=CPU_OUT_OF_SPEC
[   62.733001] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360
Gen9, BIOS P89 09/21/2023
[   62.733465] RIP: 0010:buffer_check_dirty_writeback+0x51/0x60
[   62.738006] Code: 4c 89 c0 80 22 01 48 8b 08 83 e1 04 74 03 c6 02
01 48 8b 08 83 e1 02 74 03 c6 06 01 48 8b 40 08 4c 3
9 c0 75 e1 c3 cc cc cc cc <0f> 0b 90 66 66 2e 0f 1f 84 00 00 00 00 00
90 90 90 90 90 90 90 90
[   62.743232] RSP: 0000:ffff9fa76e6df5f0 EFLAGS: 00010046
[   62.743564] RAX: 0017ffffc8004030 RBX: 0000000000000001 RCX: 000000000000003c
[   62.748182] RDX: ffff9fa76e6df69f RSI: ffff9fa76e6df69e RDI: ffffe921c509cf40
[   62.752778] RBP: 0000000000000000 R08: ffffe921c517f608 R09: 0000000000000002
[   62.757558] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   62.762190] R13: 0000000000000001 R14: ffffe921c509cf48 R15: ffff8cb18f892040
[   62.766841] FS:  00007ff078918ac0(0000) GS:ffff8cd07f280000(0000)
knlGS:0000000000000000
[   62.767399] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   62.772154] CR2: 00007ff076e7ec54 CR3: 0000002087118004 CR4: 00000000001706f0
[   62.776774] Call Trace:
[   62.776976]  <TASK>
[   62.781292]  ? __die_body.cold+0x19/0x27
[   62.781546]  ? die+0x2e/0x50
[   62.785950]  ? do_trap+0xca/0x110
[   62.790393]  ? do_error_trap+0x6a/0x90
[   62.790660]  ? buffer_check_dirty_writeback+0x51/0x60
[   62.790959]  ? exc_invalid_op+0x50/0x70
[   62.791168]  ? buffer_check_dirty_writeback+0x51/0x60
[   62.791469]  ? asm_exc_invalid_op+0x1a/0x20
[   62.791691]  ? buffer_check_dirty_writeback+0x51/0x60
[   62.791982]  isolate_folios+0xa8a/0x14c0
[   62.792234]  evict_folios+0xbe/0x9a0
[   62.792454]  ? __mod_memcg_lruvec_state+0xd8/0x1d0
[   62.796999]  try_to_shrink_lruvec+0x2a8/0x3e0
[   62.801713]  ? __pfx_lru_activate+0x10/0x10
[   62.801957]  ? folio_batch_move_lru+0xe8/0x130
[   62.806410]  shrink_lruvec+0x122/0x12f0
[   62.806648]  ? list_lru_count_one+0x50/0x90
[   62.806897]  ? super_cache_count+0x75/0xf0
[   62.807190]  shrink_node+0x2d1/0xc80
[   62.807418]  ? post_alloc_hook+0xb0/0x130
[   62.807675]  ? get_page_from_freelist+0x3a6/0x1b80
[   62.812392]  do_try_to_free_pages+0xb3/0x5b0
[   62.816910]  try_to_free_mem_cgroup_pages+0x102/0x240
[   62.817299]  try_charge_memcg+0x1be/0x630
[   62.817528]  mem_cgroup_swapin_charge_folio+0x7a/0x170
[   62.817827]  do_swap_page+0xf8b/0x1720
[   62.818048]  ? set_ptes.isra.0+0x41/0x90
[   62.818296]  ? do_anonymous_page+0xfc/0x8e0
[   62.819135]  ? __pte_offset_map+0x1b/0x180
[   62.820017]  __handle_mm_fault+0x82e/0xfb0
[   62.820884]  handle_mm_fault+0xe2/0x2c0
[   62.821738]  do_user_addr_fault+0x2ca/0x7b0
[   62.822588]  exc_page_fault+0x7e/0x180
[   62.823427]  asm_exc_page_fault+0x26/0x30
[   62.824252] RIP: 0033:0x1406d76
[   62.829250] Code: 01 41 8d 4b ff 41 89 cc 45 21 ec 44 89 e0 49 8b
1c c2 48 85 db 0f 84 79 01 00 00 48 83 fb ff 0f 84 0
9 04 00 00 44 89 5c 24 08 <44> 39 6b 0c 0f 84 f0 00 00 00 45 89 ef 41
c1 e7 04 45 01 ef 41 21
[   62.835669] RSP: 002b:00007ffffb6a0730 EFLAGS: 00010213
[   62.836567] RAX: 000000000000153a RBX: 00007ff076e7ec48 RCX: 000000000000ffff
[   62.841859] RDX: 000000000000000c RSI: 000000001ffd7c35 RDI: 000000001fdf30f0
[   62.847171] RBP: 000000001fdf30f0 R08: 0000000000000001 R09: 000000001ffd7c35
[   62.852424] R10: 00007ff076b7f010 R11: 0000000000010000 R12: 000000000000153a
[   62.857655] R13: 0000000054d9153a R14: 000000000000000c R15: 000000001fdee620
[   62.862887]  </TASK>
[   62.863653] Modules linked in:
[   62.868655] ---[ end trace 0000000000000000 ]---
[   62.880764] pstore: backend (erst) writing error (-28)
[   62.881675] RIP: 0010:buffer_check_dirty_writeback+0x51/0x60
[   62.886849] Code: 4c 89 c0 80 22 01 48 8b 08 83 e1 04 74 03 c6 02
01 48 8b 08 83 e1 02 74 03 c6 06 01 48 8b 40 08 4c 3
9 c0 75 e1 c3 cc cc cc cc <0f> 0b 90 66 66 2e 0f 1f 84 00 00 00 00 00
90 90 90 90 90 90 90 90
[   62.893380] RSP: 0000:ffff9fa76e6df5f0 EFLAGS: 00010046
[   62.894350] RAX: 0017ffffc8004030 RBX: 0000000000000001 RCX: 000000000000003c
[   62.899583] RDX: ffff9fa76e6df69f RSI: ffff9fa76e6df69e RDI: ffffe921c509cf40
[   62.904867] RBP: 0000000000000000 R08: ffffe921c517f608 R09: 0000000000000002
[   62.910095] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   62.915340] R13: 0000000000000001 R14: ffffe921c509cf48 R15: ffff8cb18f892040
[   62.920592] FS:  00007ff078918ac0(0000) GS:ffff8cd07f280000(0000)
knlGS:0000000000000000
[   62.921651] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   62.926840] CR2: 00007ff076e7ec54 CR3: 0000002087118004 CR4: 00000000001706f0
[   62.932098] note: cc1[5617] exited with irqs disabled
[   62.933045] note: cc1[5617] exited with preempt_count 1
[   62.934010] ------------[ cut here ]------------
[   62.939102] WARNING: CPU: 13 PID: 5617 at kernel/exit.c:886
do_exit+0x8b1/0xab0
[   62.944305] Modules linked in:
[   62.949282] CPU: 13 UID: 1000 PID: 5617 Comm: cc1 Tainted: G S    D
           6.11.0-rc6+ #51
[   62.954634] Tainted: [S]=CPU_OUT_OF_SPEC, [D]=DIE
[   62.959761] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360
Gen9, BIOS P89 09/21/2023
[   62.960841] RIP: 0010:do_exit+0x8b1/0xab0
[   62.961690] Code: f6 e8 a3 e2 ff ff e9 9b fd ff ff 4c 89 e6 bf 05
06 00 00 e8 a1 19 01 00 e9 69 f8 ff ff 48 89 df e8 5
4 ac 13 00 e9 55 f9 ff ff <0f> 0b e9 c3 f7 ff ff 0f 0b e9 80 f7 ff ff
4c 89 e6 48 89 df e8 76
[   62.968265] RSP: 0000:ffff9fa76e6dfed8 EFLAGS: 00010286
[   62.969296] RAX: 0000000400000000 RBX: ffff8cb18a95a8c0 RCX: 0000000000000000
[   62.974605] RDX: 0000000000000001 RSI: 0000000000002710 RDI: ffff8cb19015b9c0
[   62.979927] RBP: ffff8cb18e5cba80 R08: 0000000000000000 R09: 0000000000000008
[   62.985270] R10: ffff8cd106481000 R11: ffff9fa74f1dd840 R12: 000000000000000b
[   62.990621] R13: ffff8cb19015b9c0 R14: ffff9fa76e6df548 R15: ffffffffa95c2121
[   62.995941] FS:  00007ff078918ac0(0000) GS:ffff8cd07f280000(0000)
knlGS:0000000000000000
[   62.997154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   63.002432] CR2: 00007ff076e7ec54 CR3: 0000002087118004 CR4: 00000000001706f0
[   63.007760] Call Trace:
[   63.008607]  <TASK>
[   63.013649]  ? do_exit+0x8b1/0xab0
[   63.018759]  ? __warn.cold+0x8e/0xe8
[   63.019709]  ? do_exit+0x8b1/0xab0
[   63.024862]  ? report_bug+0xff/0x140
[   63.025803]  ? handle_bug+0x3c/0x80
[   63.030906]  ? exc_invalid_op+0x17/0x70
[   63.031879]  ? asm_exc_invalid_op+0x1a/0x20
[   63.032808]  ? buffer_check_dirty_writeback+0x51/0x60
[   63.033809]  ? do_exit+0x8b1/0xab0
[   63.038939]  ? do_exit+0x6d/0xab0
[   63.044028]  ? buffer_check_dirty_writeback+0x51/0x60
[   63.045029]  make_task_dead+0x90/0x90
[   63.046003]  rewind_stack_and_make_dead+0x16/0x20
[   63.051229] RIP: 0033:0x1406d76
[   63.056352] Code: 01 41 8d 4b ff 41 89 cc 45 21 ec 44 89 e0 49 8b
1c c2 48 85 db 0f 84 79 01 00 00 48 83 fb ff 0f 84 0
9 04 00 00 44 89 5c 24 08 <44> 39 6b 0c 0f 84 f0 00 00 00 45 89 ef 41
c1 e7 04 45 01 ef 41 21
[   63.063029] RSP: 002b:00007ffffb6a0730 EFLAGS: 00010213
[   63.064123] RAX: 000000000000153a RBX: 00007ff076e7ec48 RCX: 000000000000ffff
[   63.069494] RDX: 000000000000000c RSI: 000000001ffd7c35 RDI: 000000001fdf30f0
[   63.074950] RBP: 000000001fdf30f0 R08: 0000000000000001 R09: 000000001ffd7c35
[   63.080336] R10: 00007ff076b7f010 R11: 0000000000010000 R12: 000000000000153a
[   63.085714] R13: 0000000054d9153a R14: 000000000000000c R15: 000000001fdee620
[   63.091358]  </TASK>
[   63.092278] ---[ end trace 0000000000000000 ]---
[   63.125554] Oops: general protection fault, probably for
non-canonical address 0x17ffffc0000008: 0000 [#2] PREEMPT SMP
 PTI
[   63.165649] CPU: 11 UID: 1000 PID: 5617 Comm: cc1 Tainted: G S    D
W          6.11.0-rc6+ #51
[   63.189553] Tainted: [S]=CPU_OUT_OF_SPEC, [D]=DIE, [W]=WARN
[   63.190579] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360
Gen9, BIOS P89 09/21/2023
[   63.191724] RIP: 0010:__list_del_entry_valid_or_report+0x4f/0x80
[   63.197029] Code: ad de 48 39 c2 0f 84 2c 3e 9a 00 48 b8 22 01 00
00 00 00 ad de 48 39 c1 0f 84 05 3e 9a 00 48 8b 31 4
8 39 fe 0f 85 e5 3d 9a 00 <48> 8b 42 08 48 39 c6 0f 85 c4 3d 9a 00 b8
01 00 00 00 c3 cc cc cc
[   63.203751] RSP: 0018:ffff9fa76e6dfbb8 EFLAGS: 00010246
[   63.204810] RAX: dead000000000122 RBX: ffff9fa76e6dfbc8 RCX: ffff9fa76e6dfbc8
[   63.210158] RDX: 0017ffffc0000000 RSI: ffffe921c5042e40 RDI: ffffe921c5042e40
[   63.215545] RBP: 0000000000000001 R08: ffff8cb191d064c8 R09: 000000008040003f
[   63.220934] R10: 000000008040003f R11: 0000000000000001 R12: dead000000000122
[   63.226294] R13: dead000000000100 R14: ffff9fa76e6df988 R15: ffffe921c5042e40

Some notes on my bisecting process. I need to revert
684826f8271ad97580b138b9ffd462005e470b99 ("zram: free secondary
algorithms names")
during the bisect. That commit can cause my swap stress test to fail,
oom killed but there are no other kernel panics.

With 684826f8271ad97580b138b9ffd462005e470b99 reverted, the previous
commit d675c821b65f0c496df1d33150619b7635827e89 can pass the swap
stress test.

Chris



On Fri, Sep 13, 2024 at 1:45 AM Jingxiang Zeng
<jingxiangzeng.cas@gmail.com> wrote:
>
> From: Zeng Jingxiang <linuszeng@tencent.com>
>
> Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> removed the opportunity to wake up flushers during the MGLRU page
> reclamation process can lead to an increased likelihood of triggering OOM
> when encountering many dirty pages during reclamation on MGLRU.
>
> This leads to premature OOM if there are too many dirty pages in cgroup:
> Killed
>
> dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> order=0, oom_score_adj=0
>
> Call Trace:
>   <TASK>
>   dump_stack_lvl+0x5f/0x80
>   dump_stack+0x14/0x20
>   dump_header+0x46/0x1b0
>   oom_kill_process+0x104/0x220
>   out_of_memory+0x112/0x5a0
>   mem_cgroup_out_of_memory+0x13b/0x150
>   try_charge_memcg+0x44f/0x5c0
>   charge_memcg+0x34/0x50
>   __mem_cgroup_charge+0x31/0x90
>   filemap_add_folio+0x4b/0xf0
>   __filemap_get_folio+0x1a4/0x5b0
>   ? srso_return_thunk+0x5/0x5f
>   ? __block_commit_write+0x82/0xb0
>   ext4_da_write_begin+0xe5/0x270
>   generic_perform_write+0x134/0x2b0
>   ext4_buffered_write_iter+0x57/0xd0
>   ext4_file_write_iter+0x76/0x7d0
>   ? selinux_file_permission+0x119/0x150
>   ? srso_return_thunk+0x5/0x5f
>   ? srso_return_thunk+0x5/0x5f
>   vfs_write+0x30c/0x440
>   ksys_write+0x65/0xe0
>   __x64_sys_write+0x1e/0x30
>   x64_sys_call+0x11c2/0x1d50
>   do_syscall_64+0x47/0x110
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
>  memory: usage 308224kB, limit 308224kB, failcnt 2589
>  swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>
>   ...
>   file_dirty 303247360
>   file_writeback 0
>   ...
>
> oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
> anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> oom_score_adj:0
>
> The flusher wake up was removed to decrease SSD wearing, but if we are
> seeing all dirty folios at the tail of an LRU, not waking up the flusher
> could lead to thrashing easily. So wake it up when a mem cgroups is
> about to OOM due to dirty caches.
>
> Link: https://lkml.kernel.org/r/20240829102543.189453-1-jingxiangzeng.cas@gmail.com
> Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Cc: T.J. Mercier <tjmercier@google.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Yu Zhao <yuzhao@google.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>  mm/vmscan.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 749cdc110c74..ce471d686a88 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4284,6 +4284,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>                        int tier_idx)
>  {
>         bool success;
> +       bool dirty, writeback;
>         int gen = folio_lru_gen(folio);
>         int type = folio_is_file_lru(folio);
>         int zone = folio_zonenum(folio);
> @@ -4332,6 +4333,9 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>         /* waiting for writeback */
>         if (folio_test_locked(folio) || folio_test_writeback(folio) ||
>             (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> +               folio_check_dirty_writeback(folio, &dirty, &writeback);
> +               if (dirty && !writeback)
> +                       sc->nr.unqueued_dirty += delta;
>                 gen = folio_inc_gen(lruvec, folio, true);
>                 list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
>                 return true;
> @@ -4448,6 +4452,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
>                                 scanned, skipped, isolated,
>                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>
> +       sc->nr.taken += isolated;
>         /*
>          * There might not be eligible folios due to reclaim_idx. Check the
>          * remaining to prevent livelock if it's not making progress.
> @@ -4920,6 +4925,14 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
>         if (try_to_shrink_lruvec(lruvec, sc))
>                 lru_gen_rotate_memcg(lruvec, MEMCG_LRU_YOUNG);
>
> +       /*
> +        * If too many pages failed to evict due to page being dirty,
> +        * memory pressure have pushed dirty pages to oldest gen,
> +        * wake up flusher.
> +        */
> +       if (sc->nr.unqueued_dirty > sc->nr.taken)
> +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> +
>         clear_mm_walk();
>
>         blk_finish_plug(&plug);
> --
> 2.43.5
>
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
  2024-09-24 12:13 ` [PATCH V3] " Jingxiang Zeng
@ 2024-09-25  2:07   ` Chris Li
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Li @ 2024-09-25  2:07 UTC (permalink / raw)
  To: Jingxiang Zeng
  Cc: linux-mm, akpm, kasong, linuszeng, linux-kernel, tjmercier,
	weixugc, yuzhao

Hi Jingxiang,

I just tested your v3 patch, apply on top of
d675c821b65f0c496df1d33150619b7635827e89("mm-memcontrol-add-per-memcg-pgpgin-pswpin-counter-v2")
with 684826f8271ad97580b138b9ffd462005e470b99(""zram: free secondary
algorithms names") reverted.

Without your v3 patch it can pass the swap stress test in less than 5 mins.
With your V3 patch it is running over 30 minutes and still can't complete.
It does not produce kernel panic though, just extremely slow at the
linking phase.

Here is the top shows:
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+
COMMAND
  33895 ...    20   0    8872   1780   1780 R  99.3   0.0  33:18.70 as
  34115 ...    20   0   10568   4692   2964 R   1.0   0.0   0:00.97
top

V3 also has regression on my swap stress test.

Chris

On Tue, Sep 24, 2024 at 5:14 AM Jingxiang Zeng
<jingxiangzeng.cas@gmail.com> wrote:
>
> From: Zeng Jingxiang <linuszeng@tencent.com>
>
> Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> removed the opportunity to wake up flushers during the MGLRU page
> reclamation process can lead to an increased likelihood of triggering OOM
> when encountering many dirty pages during reclamation on MGLRU.
>
> This leads to premature OOM if there are too many dirty pages in cgroup:
> Killed
>
> dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> order=0, oom_score_adj=0
>
> Call Trace:
>   <TASK>
>   dump_stack_lvl+0x5f/0x80
>   dump_stack+0x14/0x20
>   dump_header+0x46/0x1b0
>   oom_kill_process+0x104/0x220
>   out_of_memory+0x112/0x5a0
>   mem_cgroup_out_of_memory+0x13b/0x150
>   try_charge_memcg+0x44f/0x5c0
>   charge_memcg+0x34/0x50
>   __mem_cgroup_charge+0x31/0x90
>   filemap_add_folio+0x4b/0xf0
>   __filemap_get_folio+0x1a4/0x5b0
>   ? srso_return_thunk+0x5/0x5f
>   ? __block_commit_write+0x82/0xb0
>   ext4_da_write_begin+0xe5/0x270
>   generic_perform_write+0x134/0x2b0
>   ext4_buffered_write_iter+0x57/0xd0
>   ext4_file_write_iter+0x76/0x7d0
>   ? selinux_file_permission+0x119/0x150
>   ? srso_return_thunk+0x5/0x5f
>   ? srso_return_thunk+0x5/0x5f
>   vfs_write+0x30c/0x440
>   ksys_write+0x65/0xe0
>   __x64_sys_write+0x1e/0x30
>   x64_sys_call+0x11c2/0x1d50
>   do_syscall_64+0x47/0x110
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
>  memory: usage 308224kB, limit 308224kB, failcnt 2589
>  swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>
>   ...
>   file_dirty 303247360
>   file_writeback 0
>   ...
>
> oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
> anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> oom_score_adj:0
>
> The flusher wake up was removed to decrease SSD wearing, but if we are
> seeing all dirty folios at the tail of an LRU, not waking up the flusher
> could lead to thrashing easily.  So wake it up when a mem cgroups is about
> to OOM due to dirty caches.
>
> Link: https://lkml.kernel.org/r/20240829102543.189453-1-jingxiangzeng.cas@gmail.com
> Link: https://lkml.kernel.org/r/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com
> Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Cc: T.J. Mercier <tjmercier@google.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Yu Zhao <yuzhao@google.com>
> ---
> Changes from v2:
> - Acquire the lock before calling the folio_check_dirty_writeback
>   function.
> - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com/
> Changes from v1:
> - Add code to count the number of unqueued_dirty in the sort_folio
>   function.
> - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail.com/
> ---
>  mm/vmscan.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 749cdc110c74..12c285a96353 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4290,6 +4290,8 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>         int delta = folio_nr_pages(folio);
>         int refs = folio_lru_refs(folio);
>         int tier = lru_tier_from_refs(refs);
> +       bool dirty = folio_test_dirty(folio);
> +       bool writeback = folio_test_writeback(folio);
>         struct lru_gen_folio *lrugen = &lruvec->lrugen;
>
>         VM_WARN_ON_ONCE_FOLIO(gen >= MAX_NR_GENS, folio);
> @@ -4330,8 +4332,10 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>         }
>
>         /* waiting for writeback */
> -       if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> -           (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> +       if (folio_test_locked(folio) || dirty ||
> +           (type == LRU_GEN_FILE && writeback)) {
> +               if (type == LRU_GEN_FILE && dirty && !writeback)
> +                       sc->nr.unqueued_dirty += delta;
>                 gen = folio_inc_gen(lruvec, folio, true);
>                 list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
>                 return true;
> @@ -4448,6 +4452,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
>                                 scanned, skipped, isolated,
>                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>
> +       sc->nr.taken += isolated;
>         /*
>          * There might not be eligible folios due to reclaim_idx. Check the
>          * remaining to prevent livelock if it's not making progress.
> @@ -4920,6 +4925,13 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
>         if (try_to_shrink_lruvec(lruvec, sc))
>                 lru_gen_rotate_memcg(lruvec, MEMCG_LRU_YOUNG);
>
> +       /*
> +        * If too many pages in the coldest generation that cannot
> +        * be isolated, wake up flusher.
> +        */
> +       if (sc->nr.unqueued_dirty > sc->nr.taken)
> +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> +
>         clear_mm_walk();
>
>         blk_finish_plug(&plug);
> --
> 2.43.5
>
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] [PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup  OOM
       [not found] <Jingxiang Zeng <linuszeng@tencent.com>
                   ` (2 preceding siblings ...)
  2024-09-24 12:13 ` [PATCH V3] " Jingxiang Zeng
@ 2024-09-29 11:26 ` Jingxiang Zeng
  2024-09-29 11:47   ` jingxiang zeng
  2024-09-29 11:30 ` Jingxiang Zeng
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Jingxiang Zeng @ 2024-09-29 11:26 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, kasong, linuszeng, linux-kernel, tjmercier, weixugc, yuzhao

From: Zeng Jingxiang <linuszeng@tencent.com>

Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
removed the opportunity to wake up flushers during the MGLRU page
reclamation process can lead to an increased likelihood of triggering OOM
when encountering many dirty pages during reclamation on MGLRU.

This leads to premature OOM if there are too many dirty pages in cgroup:
Killed

dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
order=0, oom_score_adj=0

Call Trace:
  <TASK>
  dump_stack_lvl+0x5f/0x80
  dump_stack+0x14/0x20
  dump_header+0x46/0x1b0
  oom_kill_process+0x104/0x220
  out_of_memory+0x112/0x5a0
  mem_cgroup_out_of_memory+0x13b/0x150
  try_charge_memcg+0x44f/0x5c0
  charge_memcg+0x34/0x50
  __mem_cgroup_charge+0x31/0x90
  filemap_add_folio+0x4b/0xf0
  __filemap_get_folio+0x1a4/0x5b0
  ? srso_return_thunk+0x5/0x5f
  ? __block_commit_write+0x82/0xb0
  ext4_da_write_begin+0xe5/0x270
  generic_perform_write+0x134/0x2b0
  ext4_buffered_write_iter+0x57/0xd0
  ext4_file_write_iter+0x76/0x7d0
  ? selinux_file_permission+0x119/0x150
  ? srso_return_thunk+0x5/0x5f
  ? srso_return_thunk+0x5/0x5f
  vfs_write+0x30c/0x440
  ksys_write+0x65/0xe0
  __x64_sys_write+0x1e/0x30
  x64_sys_call+0x11c2/0x1d50
  do_syscall_64+0x47/0x110
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

 memory: usage 308224kB, limit 308224kB, failcnt 2589
 swap: usage 0kB, limit 9007199254740988kB, failcnt 0

  ...
  file_dirty 303247360
  file_writeback 0
  ...

oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
oom_score_adj:0

The flusher wake up was removed to decrease SSD wearing, but if we are
seeing all dirty folios at the tail of an LRU, not waking up the flusher
could lead to thrashing easily.  So wake it up when a mem cgroups is about
to OOM due to dirty caches.

---
Changes from v3:
- Avoid taking lock and reduce overhead on folio isolation by
  checking the right flags and rework wake up condition, fixing the
  performance regression reported by Chris Li.
  [Chris Li, Kairui Song]
- Move the wake up check to try_to_shrink_lruvec to cover kswapd
  case as well, and update comments. [Kairui Song]
- Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/
Changes from v2:
- Acquire the lock before calling the folio_check_dirty_writeback
  function. [Wei Xu, Jingxiang Zeng]
- Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com/
Changes from v1:
- Add code to count the number of unqueued_dirty in the sort_folio
  function. [Wei Xu, Jingxiang Zeng]
- Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail.com/
---

Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yu Zhao <yuzhao@google.com>
---
 mm/vmscan.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index dc7a285b256b..2a5c2fe81467 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		       int tier_idx)
 {
 	bool success;
+	bool dirty, writeback;
 	int gen = folio_lru_gen(folio);
 	int type = folio_is_file_lru(folio);
 	int zone = folio_zonenum(folio);
@@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		return true;
 	}
 
+	dirty = folio_test_dirty(folio);
+	writeback = folio_test_writeback(folio);
+	if (type == LRU_GEN_FILE && dirty && !writeback)
+		sc->nr.unqueued_dirty += delta;
+
 	/* waiting for writeback */
-	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
-	    (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
+	if (folio_test_locked(folio) || writeback ||
+	    (type == LRU_GEN_FILE && dirty)) {
 		gen = folio_inc_gen(lruvec, folio, true);
 		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
@@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
-
+	sc->nr.taken += scanned;
 	/*
 	 * There might not be eligible folios due to reclaim_idx. Check the
 	 * remaining to prevent livelock if it's not making progress.
@@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		cond_resched();
 	}
 
+	/*
+	 * If too many file cache in the coldest generation can't be evicted
+	 * due to being dirty, wake up the flusher.
+	 */
+	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.taken)
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+
 	/* whether this lruvec should be rotated */
 	return nr_to_scan < 0;
 }
-- 
2.43.5



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup  OOM
       [not found] <Jingxiang Zeng <linuszeng@tencent.com>
                   ` (3 preceding siblings ...)
  2024-09-29 11:26 ` [PATCH] [PATCH v4] " Jingxiang Zeng
@ 2024-09-29 11:30 ` Jingxiang Zeng
  2024-10-08  1:56 ` [RESEND][PATCH " Jingxiang Zeng
  2024-10-18  5:29 ` [PATCH v5] " Jingxiang Zeng
  6 siblings, 0 replies; 24+ messages in thread
From: Jingxiang Zeng @ 2024-09-29 11:30 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, kasong, linuszeng, linux-kernel, tjmercier, weixugc, yuzhao

From: Zeng Jingxiang <linuszeng@tencent.com>

Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
removed the opportunity to wake up flushers during the MGLRU page
reclamation process can lead to an increased likelihood of triggering OOM
when encountering many dirty pages during reclamation on MGLRU.

This leads to premature OOM if there are too many dirty pages in cgroup:
Killed

dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
order=0, oom_score_adj=0

Call Trace:
  <TASK>
  dump_stack_lvl+0x5f/0x80
  dump_stack+0x14/0x20
  dump_header+0x46/0x1b0
  oom_kill_process+0x104/0x220
  out_of_memory+0x112/0x5a0
  mem_cgroup_out_of_memory+0x13b/0x150
  try_charge_memcg+0x44f/0x5c0
  charge_memcg+0x34/0x50
  __mem_cgroup_charge+0x31/0x90
  filemap_add_folio+0x4b/0xf0
  __filemap_get_folio+0x1a4/0x5b0
  ? srso_return_thunk+0x5/0x5f
  ? __block_commit_write+0x82/0xb0
  ext4_da_write_begin+0xe5/0x270
  generic_perform_write+0x134/0x2b0
  ext4_buffered_write_iter+0x57/0xd0
  ext4_file_write_iter+0x76/0x7d0
  ? selinux_file_permission+0x119/0x150
  ? srso_return_thunk+0x5/0x5f
  ? srso_return_thunk+0x5/0x5f
  vfs_write+0x30c/0x440
  ksys_write+0x65/0xe0
  __x64_sys_write+0x1e/0x30
  x64_sys_call+0x11c2/0x1d50
  do_syscall_64+0x47/0x110
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

 memory: usage 308224kB, limit 308224kB, failcnt 2589
 swap: usage 0kB, limit 9007199254740988kB, failcnt 0

  ...
  file_dirty 303247360
  file_writeback 0
  ...

oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
oom_score_adj:0

The flusher wake up was removed to decrease SSD wearing, but if we are
seeing all dirty folios at the tail of an LRU, not waking up the flusher
could lead to thrashing easily.  So wake it up when a mem cgroups is about
to OOM due to dirty caches.

---
Changes from v3:
- Avoid taking lock and reduce overhead on folio isolation by
  checking the right flags and rework wake up condition, fixing the
  performance regression reported by Chris Li.
  [Chris Li, Kairui Song]
- Move the wake up check to try_to_shrink_lruvec to cover kswapd
  case as well, and update comments. [Kairui Song]
- Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/
Changes from v2:
- Acquire the lock before calling the folio_check_dirty_writeback
  function. [Wei Xu, Jingxiang Zeng]
- Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com/
Changes from v1:
- Add code to count the number of unqueued_dirty in the sort_folio
  function. [Wei Xu, Jingxiang Zeng]
- Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail.com/
---

Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yu Zhao <yuzhao@google.com>
---
 mm/vmscan.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index dc7a285b256b..2a5c2fe81467 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		       int tier_idx)
 {
 	bool success;
+	bool dirty, writeback;
 	int gen = folio_lru_gen(folio);
 	int type = folio_is_file_lru(folio);
 	int zone = folio_zonenum(folio);
@@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		return true;
 	}
 
+	dirty = folio_test_dirty(folio);
+	writeback = folio_test_writeback(folio);
+	if (type == LRU_GEN_FILE && dirty && !writeback)
+		sc->nr.unqueued_dirty += delta;
+
 	/* waiting for writeback */
-	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
-	    (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
+	if (folio_test_locked(folio) || writeback ||
+	    (type == LRU_GEN_FILE && dirty)) {
 		gen = folio_inc_gen(lruvec, folio, true);
 		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
@@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
-
+	sc->nr.taken += scanned;
 	/*
 	 * There might not be eligible folios due to reclaim_idx. Check the
 	 * remaining to prevent livelock if it's not making progress.
@@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		cond_resched();
 	}
 
+	/*
+	 * If too many file cache in the coldest generation can't be evicted
+	 * due to being dirty, wake up the flusher.
+	 */
+	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.taken)
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+
 	/* whether this lruvec should be rotated */
 	return nr_to_scan < 0;
 }
-- 
2.43.5



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] [PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
  2024-09-29 11:26 ` [PATCH] [PATCH v4] " Jingxiang Zeng
@ 2024-09-29 11:47   ` jingxiang zeng
  0 siblings, 0 replies; 24+ messages in thread
From: jingxiang zeng @ 2024-09-29 11:47 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, kasong, linuszeng, linux-kernel, tjmercier, weixugc, yuzhao

On Sun, 29 Sept 2024 at 19:26, Jingxiang Zeng
<jingxiangzeng.cas@gmail.com> wrote:
>
> From: Zeng Jingxiang <linuszeng@tencent.com>

There are some formatting issues with the title of this patch, please refer to
the latest updated patch:
https://lore.kernel.org/all/20240929113050.76079-1-jingxiangzeng.cas@gmail.com/
>
> Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> removed the opportunity to wake up flushers during the MGLRU page
> reclamation process can lead to an increased likelihood of triggering OOM
> when encountering many dirty pages during reclamation on MGLRU.
>
> This leads to premature OOM if there are too many dirty pages in cgroup:
> Killed
>
> dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> order=0, oom_score_adj=0
>
> Call Trace:
>   <TASK>
>   dump_stack_lvl+0x5f/0x80
>   dump_stack+0x14/0x20
>   dump_header+0x46/0x1b0
>   oom_kill_process+0x104/0x220
>   out_of_memory+0x112/0x5a0
>   mem_cgroup_out_of_memory+0x13b/0x150
>   try_charge_memcg+0x44f/0x5c0
>   charge_memcg+0x34/0x50
>   __mem_cgroup_charge+0x31/0x90
>   filemap_add_folio+0x4b/0xf0
>   __filemap_get_folio+0x1a4/0x5b0
>   ? srso_return_thunk+0x5/0x5f
>   ? __block_commit_write+0x82/0xb0
>   ext4_da_write_begin+0xe5/0x270
>   generic_perform_write+0x134/0x2b0
>   ext4_buffered_write_iter+0x57/0xd0
>   ext4_file_write_iter+0x76/0x7d0
>   ? selinux_file_permission+0x119/0x150
>   ? srso_return_thunk+0x5/0x5f
>   ? srso_return_thunk+0x5/0x5f
>   vfs_write+0x30c/0x440
>   ksys_write+0x65/0xe0
>   __x64_sys_write+0x1e/0x30
>   x64_sys_call+0x11c2/0x1d50
>   do_syscall_64+0x47/0x110
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
>  memory: usage 308224kB, limit 308224kB, failcnt 2589
>  swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>
>   ...
>   file_dirty 303247360
>   file_writeback 0
>   ...
>
> oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
> anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> oom_score_adj:0
>
> The flusher wake up was removed to decrease SSD wearing, but if we are
> seeing all dirty folios at the tail of an LRU, not waking up the flusher
> could lead to thrashing easily.  So wake it up when a mem cgroups is about
> to OOM due to dirty caches.
>
> ---
> Changes from v3:
> - Avoid taking lock and reduce overhead on folio isolation by
>   checking the right flags and rework wake up condition, fixing the
>   performance regression reported by Chris Li.
>   [Chris Li, Kairui Song]
> - Move the wake up check to try_to_shrink_lruvec to cover kswapd
>   case as well, and update comments. [Kairui Song]
> - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/
> Changes from v2:
> - Acquire the lock before calling the folio_check_dirty_writeback
>   function. [Wei Xu, Jingxiang Zeng]
> - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com/
> Changes from v1:
> - Add code to count the number of unqueued_dirty in the sort_folio
>   function. [Wei Xu, Jingxiang Zeng]
> - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail.com/
> ---
>
> Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Cc: T.J. Mercier <tjmercier@google.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Yu Zhao <yuzhao@google.com>
> ---
>  mm/vmscan.c | 19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index dc7a285b256b..2a5c2fe81467 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>                        int tier_idx)
>  {
>         bool success;
> +       bool dirty, writeback;
>         int gen = folio_lru_gen(folio);
>         int type = folio_is_file_lru(folio);
>         int zone = folio_zonenum(folio);
> @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>                 return true;
>         }
>
> +       dirty = folio_test_dirty(folio);
> +       writeback = folio_test_writeback(folio);
> +       if (type == LRU_GEN_FILE && dirty && !writeback)
> +               sc->nr.unqueued_dirty += delta;
> +
>         /* waiting for writeback */
> -       if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> -           (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> +       if (folio_test_locked(folio) || writeback ||
> +           (type == LRU_GEN_FILE && dirty)) {
>                 gen = folio_inc_gen(lruvec, folio, true);
>                 list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
>                 return true;
> @@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
>         trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH,
>                                 scanned, skipped, isolated,
>                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> -
> +       sc->nr.taken += scanned;
>         /*
>          * There might not be eligible folios due to reclaim_idx. Check the
>          * remaining to prevent livelock if it's not making progress.
> @@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>                 cond_resched();
>         }
>
> +       /*
> +        * If too many file cache in the coldest generation can't be evicted
> +        * due to being dirty, wake up the flusher.
> +        */
> +       if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.taken)
> +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> +
>         /* whether this lruvec should be rotated */
>         return nr_to_scan < 0;
>  }
> --
> 2.43.5
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RESEND][PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup  OOM
       [not found] <Jingxiang Zeng <linuszeng@tencent.com>
                   ` (4 preceding siblings ...)
  2024-09-29 11:30 ` Jingxiang Zeng
@ 2024-10-08  1:56 ` Jingxiang Zeng
  2024-10-08  2:15   ` Andrew Morton
                     ` (2 more replies)
  2024-10-18  5:29 ` [PATCH v5] " Jingxiang Zeng
  6 siblings, 3 replies; 24+ messages in thread
From: Jingxiang Zeng @ 2024-10-08  1:56 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, kasong, linuszeng, linux-kernel, tjmercier, weixugc,
	yuzhao, chrisl

From: Jingxiang Zeng <linuszeng@tencent.com>

Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
removed the opportunity to wake up flushers during the MGLRU page
reclamation process can lead to an increased likelihood of triggering OOM
when encountering many dirty pages during reclamation on MGLRU.

This leads to premature OOM if there are too many dirty pages in cgroup:
Killed

dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
order=0, oom_score_adj=0

Call Trace:
  <TASK>
  dump_stack_lvl+0x5f/0x80
  dump_stack+0x14/0x20
  dump_header+0x46/0x1b0
  oom_kill_process+0x104/0x220
  out_of_memory+0x112/0x5a0
  mem_cgroup_out_of_memory+0x13b/0x150
  try_charge_memcg+0x44f/0x5c0
  charge_memcg+0x34/0x50
  __mem_cgroup_charge+0x31/0x90
  filemap_add_folio+0x4b/0xf0
  __filemap_get_folio+0x1a4/0x5b0
  ? srso_return_thunk+0x5/0x5f
  ? __block_commit_write+0x82/0xb0
  ext4_da_write_begin+0xe5/0x270
  generic_perform_write+0x134/0x2b0
  ext4_buffered_write_iter+0x57/0xd0
  ext4_file_write_iter+0x76/0x7d0
  ? selinux_file_permission+0x119/0x150
  ? srso_return_thunk+0x5/0x5f
  ? srso_return_thunk+0x5/0x5f
  vfs_write+0x30c/0x440
  ksys_write+0x65/0xe0
  __x64_sys_write+0x1e/0x30
  x64_sys_call+0x11c2/0x1d50
  do_syscall_64+0x47/0x110
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

 memory: usage 308224kB, limit 308224kB, failcnt 2589
 swap: usage 0kB, limit 9007199254740988kB, failcnt 0

  ...
  file_dirty 303247360
  file_writeback 0
  ...

oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
oom_score_adj:0

The flusher wake up was removed to decrease SSD wearing, but if we are
seeing all dirty folios at the tail of an LRU, not waking up the flusher
could lead to thrashing easily.  So wake it up when a mem cgroups is about
to OOM due to dirty caches.

---
Changes from v3:
- Avoid taking lock and reduce overhead on folio isolation by
  checking the right flags and rework wake up condition, fixing the
  performance regression reported by Chris Li.
  [Chris Li, Kairui Song]
- Move the wake up check to try_to_shrink_lruvec to cover kswapd
  case as well, and update comments. [Kairui Song]
- Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/
Changes from v2:
- Acquire the lock before calling the folio_check_dirty_writeback
  function. [Wei Xu, Jingxiang Zeng]
- Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com/
Changes from v1:
- Add code to count the number of unqueued_dirty in the sort_folio
  function. [Wei Xu, Jingxiang Zeng]
- Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail.com/
---

Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yu Zhao <yuzhao@google.com>
---
 mm/vmscan.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index dc7a285b256b..2a5c2fe81467 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		       int tier_idx)
 {
 	bool success;
+	bool dirty, writeback;
 	int gen = folio_lru_gen(folio);
 	int type = folio_is_file_lru(folio);
 	int zone = folio_zonenum(folio);
@@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		return true;
 	}
 
+	dirty = folio_test_dirty(folio);
+	writeback = folio_test_writeback(folio);
+	if (type == LRU_GEN_FILE && dirty && !writeback)
+		sc->nr.unqueued_dirty += delta;
+
 	/* waiting for writeback */
-	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
-	    (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
+	if (folio_test_locked(folio) || writeback ||
+	    (type == LRU_GEN_FILE && dirty)) {
 		gen = folio_inc_gen(lruvec, folio, true);
 		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
@@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
-
+	sc->nr.taken += scanned;
 	/*
 	 * There might not be eligible folios due to reclaim_idx. Check the
 	 * remaining to prevent livelock if it's not making progress.
@@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		cond_resched();
 	}
 
+	/*
+	 * If too many file cache in the coldest generation can't be evicted
+	 * due to being dirty, wake up the flusher.
+	 */
+	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.taken)
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+
 	/* whether this lruvec should be rotated */
 	return nr_to_scan < 0;
 }
-- 
2.43.5



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND][PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup  OOM
  2024-10-08  1:56 ` [RESEND][PATCH " Jingxiang Zeng
@ 2024-10-08  2:15   ` Andrew Morton
  2024-10-08  3:26   ` Wei Xu
  2024-10-08 17:12   ` Chris Li
  2 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2024-10-08  2:15 UTC (permalink / raw)
  To: Jingxiang Zeng
  Cc: linux-mm, kasong, linuszeng, linux-kernel, tjmercier, weixugc,
	yuzhao, chrisl

On Tue,  8 Oct 2024 09:56:35 +0800 Jingxiang Zeng <jingxiangzeng.cas@gmail.com> wrote:

> From: Jingxiang Zeng <linuszeng@tencent.com>
> 
> Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> removed the opportunity to wake up flushers during the MGLRU page
> reclamation process can lead to an increased likelihood of triggering OOM
> when encountering many dirty pages during reclamation on MGLRU.
> 
> This leads to premature OOM if there are too many dirty pages in cgroup:
> 
> ...
> 
> The flusher wake up was removed to decrease SSD wearing, but if we are
> seeing all dirty folios at the tail of an LRU, not waking up the flusher
> could lead to thrashing easily.  So wake it up when a mem cgroups is about
> to OOM due to dirty caches.
> 

Thanks for persisting.

This patch has a somewhat difficult past and not a lot of review. 
I'll await feedback from other MGLRU developers before proceeding.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND][PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
  2024-10-08  1:56 ` [RESEND][PATCH " Jingxiang Zeng
  2024-10-08  2:15   ` Andrew Morton
@ 2024-10-08  3:26   ` Wei Xu
  2024-10-09  4:52     ` jingxiang zeng
  2024-10-08 17:12   ` Chris Li
  2 siblings, 1 reply; 24+ messages in thread
From: Wei Xu @ 2024-10-08  3:26 UTC (permalink / raw)
  To: Jingxiang Zeng
  Cc: linux-mm, akpm, kasong, linuszeng, linux-kernel, tjmercier,
	yuzhao, chrisl

On Mon, Oct 7, 2024 at 6:57 PM Jingxiang Zeng
<jingxiangzeng.cas@gmail.com> wrote:
>
> From: Jingxiang Zeng <linuszeng@tencent.com>
>
> Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> removed the opportunity to wake up flushers during the MGLRU page
> reclamation process can lead to an increased likelihood of triggering OOM
> when encountering many dirty pages during reclamation on MGLRU.
>
> This leads to premature OOM if there are too many dirty pages in cgroup:
> Killed
>
> dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> order=0, oom_score_adj=0
>
> Call Trace:
>   <TASK>
>   dump_stack_lvl+0x5f/0x80
>   dump_stack+0x14/0x20
>   dump_header+0x46/0x1b0
>   oom_kill_process+0x104/0x220
>   out_of_memory+0x112/0x5a0
>   mem_cgroup_out_of_memory+0x13b/0x150
>   try_charge_memcg+0x44f/0x5c0
>   charge_memcg+0x34/0x50
>   __mem_cgroup_charge+0x31/0x90
>   filemap_add_folio+0x4b/0xf0
>   __filemap_get_folio+0x1a4/0x5b0
>   ? srso_return_thunk+0x5/0x5f
>   ? __block_commit_write+0x82/0xb0
>   ext4_da_write_begin+0xe5/0x270
>   generic_perform_write+0x134/0x2b0
>   ext4_buffered_write_iter+0x57/0xd0
>   ext4_file_write_iter+0x76/0x7d0
>   ? selinux_file_permission+0x119/0x150
>   ? srso_return_thunk+0x5/0x5f
>   ? srso_return_thunk+0x5/0x5f
>   vfs_write+0x30c/0x440
>   ksys_write+0x65/0xe0
>   __x64_sys_write+0x1e/0x30
>   x64_sys_call+0x11c2/0x1d50
>   do_syscall_64+0x47/0x110
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
>  memory: usage 308224kB, limit 308224kB, failcnt 2589
>  swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>
>   ...
>   file_dirty 303247360
>   file_writeback 0
>   ...
>
> oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
> anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> oom_score_adj:0
>
> The flusher wake up was removed to decrease SSD wearing, but if we are
> seeing all dirty folios at the tail of an LRU, not waking up the flusher
> could lead to thrashing easily.  So wake it up when a mem cgroups is about
> to OOM due to dirty caches.
>
> ---
> Changes from v3:
> - Avoid taking lock and reduce overhead on folio isolation by
>   checking the right flags and rework wake up condition, fixing the
>   performance regression reported by Chris Li.
>   [Chris Li, Kairui Song]
> - Move the wake up check to try_to_shrink_lruvec to cover kswapd
>   case as well, and update comments. [Kairui Song]
> - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/
> Changes from v2:
> - Acquire the lock before calling the folio_check_dirty_writeback
>   function. [Wei Xu, Jingxiang Zeng]
> - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com/
> Changes from v1:
> - Add code to count the number of unqueued_dirty in the sort_folio
>   function. [Wei Xu, Jingxiang Zeng]
> - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail.com/
> ---
>
> Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Cc: T.J. Mercier <tjmercier@google.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Yu Zhao <yuzhao@google.com>
> ---
>  mm/vmscan.c | 19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index dc7a285b256b..2a5c2fe81467 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>                        int tier_idx)
>  {
>         bool success;
> +       bool dirty, writeback;
>         int gen = folio_lru_gen(folio);
>         int type = folio_is_file_lru(folio);
>         int zone = folio_zonenum(folio);
> @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>                 return true;
>         }
>
> +       dirty = folio_test_dirty(folio);
> +       writeback = folio_test_writeback(folio);
> +       if (type == LRU_GEN_FILE && dirty && !writeback)
> +               sc->nr.unqueued_dirty += delta;
> +

This sounds good.  BTW, when shrink_folio_list() in evict_folios()
returns, we should add stat.nr_unqueued_dirty to sc->nr.unqueued_dirty
there as well.

>         /* waiting for writeback */
> -       if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> -           (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> +       if (folio_test_locked(folio) || writeback ||
> +           (type == LRU_GEN_FILE && dirty)) {
>                 gen = folio_inc_gen(lruvec, folio, true);
>                 list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
>                 return true;
> @@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
>         trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH,
>                                 scanned, skipped, isolated,
>                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> -
> +       sc->nr.taken += scanned;

I think we should only include file pages (in sort_folio) and isolated
pages into sc->nr.taken, instead of all scanned pages. For example, if
there are only unevictable and unqueued dirty pages, we would still
like to wake up the flusher threads, but because nr.taken counts
unevictable pages as well, the wakeup condition in
try_to_shrink_lruvec() won't be met.

>         /*
>          * There might not be eligible folios due to reclaim_idx. Check the
>          * remaining to prevent livelock if it's not making progress.
> @@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>                 cond_resched();
>         }
>
> +       /*
> +        * If too many file cache in the coldest generation can't be evicted
> +        * due to being dirty, wake up the flusher.
> +        */
> +       if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.taken)
> +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> +

try_to_shrink_lruvec() can be called from shrink_node() for global
reclaim as well. We need to reset sc->nr before calling
lru_gen_shrink_node() there. MGLRU didn't need that because it didn't
use sc->nr until this change.

>         /* whether this lruvec should be rotated */
>         return nr_to_scan < 0;
>  }
> --
> 2.43.5
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND][PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
  2024-10-08  1:56 ` [RESEND][PATCH " Jingxiang Zeng
  2024-10-08  2:15   ` Andrew Morton
  2024-10-08  3:26   ` Wei Xu
@ 2024-10-08 17:12   ` Chris Li
  2024-10-09  5:29     ` jingxiang zeng
  2024-10-09  5:34     ` jingxiang zeng
  2 siblings, 2 replies; 24+ messages in thread
From: Chris Li @ 2024-10-08 17:12 UTC (permalink / raw)
  To: Jingxiang Zeng
  Cc: linux-mm, akpm, kasong, linuszeng, linux-kernel, tjmercier,
	weixugc, yuzhao

Hi Jingxiang,

I did run the same swap stress test on V4 and it is much better than V3.
V3 test was hang there (time out). V4 did not hang any more, it
finishes in about the same time.

If we look closer of V4, it seems suggest that v4 system time is slightly worse.
Is that kind of expected or might be the noise of my test? Just trying
to understand it better, it is not a NACK by any means.

Here is the number on mm-unstable c121617e3606be6575cdacfdb63cc8d67b46a568:
Without (10 times):
user    2688.328
system  6059.021 : 6031.57 6043.61 6044.35 6045.01 6052.46 6053.75
6057.21 6063.31 6075.76 6123.18
real    277.145

With V4:
First run (10 times):
user    2688.537
system  6180.907 : 6128.4 6145.47 6160.25 6167.09 6193.31 6195.93
6197.26 6202.98 6204.64 6213.74
real    280.174
Second run (10 times):
user    2771.498
system  6199.043 : 6165.39 6173.49 6179.97 6189.03 6193.13 6199.33
6204.03 6212.9 6216.32 6256.84
real    284.854

Chris

On Mon, Oct 7, 2024 at 6:57 PM Jingxiang Zeng
<jingxiangzeng.cas@gmail.com> wrote:
>
> From: Jingxiang Zeng <linuszeng@tencent.com>
>
> Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> removed the opportunity to wake up flushers during the MGLRU page
> reclamation process can lead to an increased likelihood of triggering OOM
> when encountering many dirty pages during reclamation on MGLRU.
>
> This leads to premature OOM if there are too many dirty pages in cgroup:
> Killed
>
> dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> order=0, oom_score_adj=0
>
> Call Trace:
>   <TASK>
>   dump_stack_lvl+0x5f/0x80
>   dump_stack+0x14/0x20
>   dump_header+0x46/0x1b0
>   oom_kill_process+0x104/0x220
>   out_of_memory+0x112/0x5a0
>   mem_cgroup_out_of_memory+0x13b/0x150
>   try_charge_memcg+0x44f/0x5c0
>   charge_memcg+0x34/0x50
>   __mem_cgroup_charge+0x31/0x90
>   filemap_add_folio+0x4b/0xf0
>   __filemap_get_folio+0x1a4/0x5b0
>   ? srso_return_thunk+0x5/0x5f
>   ? __block_commit_write+0x82/0xb0
>   ext4_da_write_begin+0xe5/0x270
>   generic_perform_write+0x134/0x2b0
>   ext4_buffered_write_iter+0x57/0xd0
>   ext4_file_write_iter+0x76/0x7d0
>   ? selinux_file_permission+0x119/0x150
>   ? srso_return_thunk+0x5/0x5f
>   ? srso_return_thunk+0x5/0x5f
>   vfs_write+0x30c/0x440
>   ksys_write+0x65/0xe0
>   __x64_sys_write+0x1e/0x30
>   x64_sys_call+0x11c2/0x1d50
>   do_syscall_64+0x47/0x110
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
>  memory: usage 308224kB, limit 308224kB, failcnt 2589
>  swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>
>   ...
>   file_dirty 303247360
>   file_writeback 0
>   ...
>
> oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
> anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> oom_score_adj:0
>
> The flusher wake up was removed to decrease SSD wearing, but if we are
> seeing all dirty folios at the tail of an LRU, not waking up the flusher
> could lead to thrashing easily.  So wake it up when a mem cgroups is about
> to OOM due to dirty caches.
>
> ---
> Changes from v3:
> - Avoid taking lock and reduce overhead on folio isolation by
>   checking the right flags and rework wake up condition, fixing the
>   performance regression reported by Chris Li.
>   [Chris Li, Kairui Song]
> - Move the wake up check to try_to_shrink_lruvec to cover kswapd
>   case as well, and update comments. [Kairui Song]
> - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/
> Changes from v2:
> - Acquire the lock before calling the folio_check_dirty_writeback
>   function. [Wei Xu, Jingxiang Zeng]
> - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com/
> Changes from v1:
> - Add code to count the number of unqueued_dirty in the sort_folio
>   function. [Wei Xu, Jingxiang Zeng]
> - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail.com/
> ---
>
> Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Cc: T.J. Mercier <tjmercier@google.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Yu Zhao <yuzhao@google.com>
> ---
>  mm/vmscan.c | 19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index dc7a285b256b..2a5c2fe81467 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>                        int tier_idx)
>  {
>         bool success;
> +       bool dirty, writeback;
>         int gen = folio_lru_gen(folio);
>         int type = folio_is_file_lru(folio);
>         int zone = folio_zonenum(folio);
> @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>                 return true;
>         }
>
> +       dirty = folio_test_dirty(folio);
> +       writeback = folio_test_writeback(folio);
> +       if (type == LRU_GEN_FILE && dirty && !writeback)
> +               sc->nr.unqueued_dirty += delta;
> +
>         /* waiting for writeback */
> -       if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> -           (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> +       if (folio_test_locked(folio) || writeback ||
> +           (type == LRU_GEN_FILE && dirty)) {
>                 gen = folio_inc_gen(lruvec, folio, true);
>                 list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
>                 return true;
> @@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
>         trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH,
>                                 scanned, skipped, isolated,
>                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> -
> +       sc->nr.taken += scanned;
>         /*
>          * There might not be eligible folios due to reclaim_idx. Check the
>          * remaining to prevent livelock if it's not making progress.
> @@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>                 cond_resched();
>         }
>
> +       /*
> +        * If too many file cache in the coldest generation can't be evicted
> +        * due to being dirty, wake up the flusher.
> +        */
> +       if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.taken)
> +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> +
>         /* whether this lruvec should be rotated */
>         return nr_to_scan < 0;
>  }
> --
> 2.43.5
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND][PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
  2024-10-08  3:26   ` Wei Xu
@ 2024-10-09  4:52     ` jingxiang zeng
  2024-10-09  6:21       ` Wei Xu
  0 siblings, 1 reply; 24+ messages in thread
From: jingxiang zeng @ 2024-10-09  4:52 UTC (permalink / raw)
  To: Wei Xu
  Cc: linux-mm, akpm, kasong, linuszeng, linux-kernel, tjmercier,
	yuzhao, chrisl

On Tue, 8 Oct 2024 at 11:26, Wei Xu <weixugc@google.com> wrote:
>
> On Mon, Oct 7, 2024 at 6:57 PM Jingxiang Zeng
> <jingxiangzeng.cas@gmail.com> wrote:
> >
> > From: Jingxiang Zeng <linuszeng@tencent.com>
> >
> > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > removed the opportunity to wake up flushers during the MGLRU page
> > reclamation process can lead to an increased likelihood of triggering OOM
> > when encountering many dirty pages during reclamation on MGLRU.
> >
> > This leads to premature OOM if there are too many dirty pages in cgroup:
> > Killed
> >
> > dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> > order=0, oom_score_adj=0
> >
> > Call Trace:
> >   <TASK>
> >   dump_stack_lvl+0x5f/0x80
> >   dump_stack+0x14/0x20
> >   dump_header+0x46/0x1b0
> >   oom_kill_process+0x104/0x220
> >   out_of_memory+0x112/0x5a0
> >   mem_cgroup_out_of_memory+0x13b/0x150
> >   try_charge_memcg+0x44f/0x5c0
> >   charge_memcg+0x34/0x50
> >   __mem_cgroup_charge+0x31/0x90
> >   filemap_add_folio+0x4b/0xf0
> >   __filemap_get_folio+0x1a4/0x5b0
> >   ? srso_return_thunk+0x5/0x5f
> >   ? __block_commit_write+0x82/0xb0
> >   ext4_da_write_begin+0xe5/0x270
> >   generic_perform_write+0x134/0x2b0
> >   ext4_buffered_write_iter+0x57/0xd0
> >   ext4_file_write_iter+0x76/0x7d0
> >   ? selinux_file_permission+0x119/0x150
> >   ? srso_return_thunk+0x5/0x5f
> >   ? srso_return_thunk+0x5/0x5f
> >   vfs_write+0x30c/0x440
> >   ksys_write+0x65/0xe0
> >   __x64_sys_write+0x1e/0x30
> >   x64_sys_call+0x11c2/0x1d50
> >   do_syscall_64+0x47/0x110
> >   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >
> >  memory: usage 308224kB, limit 308224kB, failcnt 2589
> >  swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> >
> >   ...
> >   file_dirty 303247360
> >   file_writeback 0
> >   ...
> >
> > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> > mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> > Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
> > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> > oom_score_adj:0
> >
> > The flusher wake up was removed to decrease SSD wearing, but if we are
> > seeing all dirty folios at the tail of an LRU, not waking up the flusher
> > could lead to thrashing easily.  So wake it up when a mem cgroups is about
> > to OOM due to dirty caches.
> >
> > ---
> > Changes from v3:
> > - Avoid taking lock and reduce overhead on folio isolation by
> >   checking the right flags and rework wake up condition, fixing the
> >   performance regression reported by Chris Li.
> >   [Chris Li, Kairui Song]
> > - Move the wake up check to try_to_shrink_lruvec to cover kswapd
> >   case as well, and update comments. [Kairui Song]
> > - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/
> > Changes from v2:
> > - Acquire the lock before calling the folio_check_dirty_writeback
> >   function. [Wei Xu, Jingxiang Zeng]
> > - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com/
> > Changes from v1:
> > - Add code to count the number of unqueued_dirty in the sort_folio
> >   function. [Wei Xu, Jingxiang Zeng]
> > - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail.com/
> > ---
> >
> > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > Cc: T.J. Mercier <tjmercier@google.com>
> > Cc: Wei Xu <weixugc@google.com>
> > Cc: Yu Zhao <yuzhao@google.com>
> > ---
> >  mm/vmscan.c | 19 ++++++++++++++++---
> >  1 file changed, 16 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index dc7a285b256b..2a5c2fe81467 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >                        int tier_idx)
> >  {
> >         bool success;
> > +       bool dirty, writeback;
> >         int gen = folio_lru_gen(folio);
> >         int type = folio_is_file_lru(folio);
> >         int zone = folio_zonenum(folio);
> > @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >                 return true;
> >         }
> >
> > +       dirty = folio_test_dirty(folio);
> > +       writeback = folio_test_writeback(folio);
> > +       if (type == LRU_GEN_FILE && dirty && !writeback)
> > +               sc->nr.unqueued_dirty += delta;
> > +
>
> This sounds good.  BTW, when shrink_folio_list() in evict_folios()
> returns, we should add stat.nr_unqueued_dirty to sc->nr.unqueued_dirty
> there as well.

Thank you for your valuable feedback, I will implement it in the next version.
>
> >         /* waiting for writeback */
> > -       if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> > -           (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > +       if (folio_test_locked(folio) || writeback ||
> > +           (type == LRU_GEN_FILE && dirty)) {
> >                 gen = folio_inc_gen(lruvec, folio, true);
> >                 list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> >                 return true;
> > @@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
> >         trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH,
> >                                 scanned, skipped, isolated,
> >                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> > -
> > +       sc->nr.taken += scanned;
>
> I think we should only include file pages (in sort_folio) and isolated
> pages into sc->nr.taken, instead of all scanned pages. For example, if

Unqueued dirty pages are not isolated, but promoted to the newer generation
in the sort_folio function. So I tend to wake up the flusher thread when the
number of scanned pages is equal to the number of unqueued dirty pages.

> there are only unevictable and unqueued dirty pages, we would still
> like to wake up the flusher threads, but because nr.taken counts
> unevictable pages as well, the wakeup condition in
> try_to_shrink_lruvec() won't be met.

The situation you mentioned will not happen because the number of
scanned pages does not include unevicatble pages.
However, there is another situation that needs attention. When the
scanned pages contain anonymous pages and unqueued dirty pages,
the flusher cannot be woken up. I will fix this situation in the next version.

>
> >         /*
> >          * There might not be eligible folios due to reclaim_idx. Check the
> >          * remaining to prevent livelock if it's not making progress.
> > @@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >                 cond_resched();
> >         }
> >
> > +       /*
> > +        * If too many file cache in the coldest generation can't be evicted
> > +        * due to being dirty, wake up the flusher.
> > +        */
> > +       if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.taken)
> > +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> > +
>
> try_to_shrink_lruvec() can be called from shrink_node() for global
> reclaim as well. We need to reset sc->nr before calling
> lru_gen_shrink_node() there. MGLRU didn't need that because it didn't
> use sc->nr until this change.

Thank you for your valuable feedback, I will implement it in the next version.
>
> >         /* whether this lruvec should be rotated */
> >         return nr_to_scan < 0;
> >  }
> > --
> > 2.43.5
> >


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND][PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
  2024-10-08 17:12   ` Chris Li
@ 2024-10-09  5:29     ` jingxiang zeng
  2024-10-09  5:43       ` jingxiang zeng
  2024-10-09  5:34     ` jingxiang zeng
  1 sibling, 1 reply; 24+ messages in thread
From: jingxiang zeng @ 2024-10-09  5:29 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, akpm, kasong, linuszeng, linux-kernel, tjmercier,
	weixugc, yuzhao

On Wed, 9 Oct 2024 at 01:12, Chris Li <chrisl@kernel.org> wrote:
>
> Hi Jingxiang,
>
> I did run the same swap stress test on V4 and it is much better than V3.
> V3 test was hang there (time out). V4 did not hang any more, it
> finishes in about the same time.
>
> If we look closer of V4, it seems suggest that v4 system time is slightly worse.
> Is that kind of expected or might be the noise of my test? Just trying
> to understand it better, it is not a NACK by any means.
>
> Here is the number on mm-unstable c121617e3606be6575cdacfdb63cc8d67b46a568:
> Without (10 times):
> user    2688.328
> system  6059.021 : 6031.57 6043.61 6044.35 6045.01 6052.46 6053.75
> 6057.21 6063.31 6075.76 6123.18
> real    277.145
>
> With V4:
> First run (10 times):
> user    2688.537
> system  6180.907 : 6128.4 6145.47 6160.25 6167.09 6193.31 6195.93
> 6197.26 6202.98 6204.64 6213.74
> real    280.174
> Second run (10 times):
> user    2771.498
> system  6199.043 : 6165.39 6173.49 6179.97 6189.03 6193.13 6199.33
> 6204.03 6212.9 6216.32 6256.84
> real    284.854
>
> Chris
>

Hi Chris,

Before I released the V4 version, I also ran the swap stress test you gave me,
with -j32, 1G memcg on my local branch:

Without the patch:
1952.07user 1768.35system 4:51.89elapsed 1274%CPU (0avgtext+0avgdata
920100maxresident)k

With the patch:
1957.83user 1757.06system 4:51.15elapsed 1275%CPU (0avgtext+0avgdata
919880maxresident)k

My test results are the same as yours. This should not be test noise. I am
trying to analyze whether it can be further optimized.

Jingxiang Zeng

> On Mon, Oct 7, 2024 at 6:57 PM Jingxiang Zeng
> <jingxiangzeng.cas@gmail.com> wrote:
> >
> > From: Jingxiang Zeng <linuszeng@tencent.com>
> >
> > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > removed the opportunity to wake up flushers during the MGLRU page
> > reclamation process can lead to an increased likelihood of triggering OOM
> > when encountering many dirty pages during reclamation on MGLRU.
> >
> > This leads to premature OOM if there are too many dirty pages in cgroup:
> > Killed
> >
> > dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> > order=0, oom_score_adj=0
> >
> > Call Trace:
> >   <TASK>
> >   dump_stack_lvl+0x5f/0x80
> >   dump_stack+0x14/0x20
> >   dump_header+0x46/0x1b0
> >   oom_kill_process+0x104/0x220
> >   out_of_memory+0x112/0x5a0
> >   mem_cgroup_out_of_memory+0x13b/0x150
> >   try_charge_memcg+0x44f/0x5c0
> >   charge_memcg+0x34/0x50
> >   __mem_cgroup_charge+0x31/0x90
> >   filemap_add_folio+0x4b/0xf0
> >   __filemap_get_folio+0x1a4/0x5b0
> >   ? srso_return_thunk+0x5/0x5f
> >   ? __block_commit_write+0x82/0xb0
> >   ext4_da_write_begin+0xe5/0x270
> >   generic_perform_write+0x134/0x2b0
> >   ext4_buffered_write_iter+0x57/0xd0
> >   ext4_file_write_iter+0x76/0x7d0
> >   ? selinux_file_permission+0x119/0x150
> >   ? srso_return_thunk+0x5/0x5f
> >   ? srso_return_thunk+0x5/0x5f
> >   vfs_write+0x30c/0x440
> >   ksys_write+0x65/0xe0
> >   __x64_sys_write+0x1e/0x30
> >   x64_sys_call+0x11c2/0x1d50
> >   do_syscall_64+0x47/0x110
> >   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >
> >  memory: usage 308224kB, limit 308224kB, failcnt 2589
> >  swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> >
> >   ...
> >   file_dirty 303247360
> >   file_writeback 0
> >   ...
> >
> > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> > mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> > Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
> > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> > oom_score_adj:0
> >
> > The flusher wake up was removed to decrease SSD wearing, but if we are
> > seeing all dirty folios at the tail of an LRU, not waking up the flusher
> > could lead to thrashing easily.  So wake it up when a mem cgroups is about
> > to OOM due to dirty caches.
> >
> > ---
> > Changes from v3:
> > - Avoid taking lock and reduce overhead on folio isolation by
> >   checking the right flags and rework wake up condition, fixing the
> >   performance regression reported by Chris Li.
> >   [Chris Li, Kairui Song]
> > - Move the wake up check to try_to_shrink_lruvec to cover kswapd
> >   case as well, and update comments. [Kairui Song]
> > - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/
> > Changes from v2:
> > - Acquire the lock before calling the folio_check_dirty_writeback
> >   function. [Wei Xu, Jingxiang Zeng]
> > - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com/
> > Changes from v1:
> > - Add code to count the number of unqueued_dirty in the sort_folio
> >   function. [Wei Xu, Jingxiang Zeng]
> > - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail.com/
> > ---
> >
> > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > Cc: T.J. Mercier <tjmercier@google.com>
> > Cc: Wei Xu <weixugc@google.com>
> > Cc: Yu Zhao <yuzhao@google.com>
> > ---
> >  mm/vmscan.c | 19 ++++++++++++++++---
> >  1 file changed, 16 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index dc7a285b256b..2a5c2fe81467 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >                        int tier_idx)
> >  {
> >         bool success;
> > +       bool dirty, writeback;
> >         int gen = folio_lru_gen(folio);
> >         int type = folio_is_file_lru(folio);
> >         int zone = folio_zonenum(folio);
> > @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >                 return true;
> >         }
> >
> > +       dirty = folio_test_dirty(folio);
> > +       writeback = folio_test_writeback(folio);
> > +       if (type == LRU_GEN_FILE && dirty && !writeback)
> > +               sc->nr.unqueued_dirty += delta;
> > +
> >         /* waiting for writeback */
> > -       if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> > -           (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > +       if (folio_test_locked(folio) || writeback ||
> > +           (type == LRU_GEN_FILE && dirty)) {
> >                 gen = folio_inc_gen(lruvec, folio, true);
> >                 list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> >                 return true;
> > @@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
> >         trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH,
> >                                 scanned, skipped, isolated,
> >                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> > -
> > +       sc->nr.taken += scanned;
> >         /*
> >          * There might not be eligible folios due to reclaim_idx. Check the
> >          * remaining to prevent livelock if it's not making progress.
> > @@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >                 cond_resched();
> >         }
> >
> > +       /*
> > +        * If too many file cache in the coldest generation can't be evicted
> > +        * due to being dirty, wake up the flusher.
> > +        */
> > +       if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.taken)
> > +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> > +
> >         /* whether this lruvec should be rotated */
> >         return nr_to_scan < 0;
> >  }
> > --
> > 2.43.5
> >


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND][PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
  2024-10-08 17:12   ` Chris Li
  2024-10-09  5:29     ` jingxiang zeng
@ 2024-10-09  5:34     ` jingxiang zeng
  1 sibling, 0 replies; 24+ messages in thread
From: jingxiang zeng @ 2024-10-09  5:34 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, akpm, kasong, linuszeng, linux-kernel, tjmercier,
	weixugc, yuzhao

Hi Chris,

Before I released the V4 version, I also ran the swap stress test you gave me,
with -j32, 1G memcg on my local branch:

With the V4 patch:
1952.07user 1768.35system 4:51.89elapsed 1274%CPU (0avgtext+0avgdata
920100maxresident)k

Without the patch:
1957.83user 1757.06system 4:51.15elapsed 1275%CPU (0avgtext+0avgdata
919880maxresident)k

My test results are the same as yours. This should not be test noise. I am
trying to analyze whether it can be further optimized.

Jingxiang Zeng

On Wed, 9 Oct 2024 at 01:12, Chris Li <chrisl@kernel.org> wrote:
>
> Hi Jingxiang,
>
> I did run the same swap stress test on V4 and it is much better than V3.
> V3 test was hang there (time out). V4 did not hang any more, it
> finishes in about the same time.
>
> If we look closer of V4, it seems suggest that v4 system time is slightly worse.
> Is that kind of expected or might be the noise of my test? Just trying
> to understand it better, it is not a NACK by any means.
>
> Here is the number on mm-unstable c121617e3606be6575cdacfdb63cc8d67b46a568:
> Without (10 times):
> user    2688.328
> system  6059.021 : 6031.57 6043.61 6044.35 6045.01 6052.46 6053.75
> 6057.21 6063.31 6075.76 6123.18
> real    277.145
>
> With V4:
> First run (10 times):
> user    2688.537
> system  6180.907 : 6128.4 6145.47 6160.25 6167.09 6193.31 6195.93
> 6197.26 6202.98 6204.64 6213.74
> real    280.174
> Second run (10 times):
> user    2771.498
> system  6199.043 : 6165.39 6173.49 6179.97 6189.03 6193.13 6199.33
> 6204.03 6212.9 6216.32 6256.84
> real    284.854
>
> Chris
>
> On Mon, Oct 7, 2024 at 6:57 PM Jingxiang Zeng
> <jingxiangzeng.cas@gmail.com> wrote:
> >
> > From: Jingxiang Zeng <linuszeng@tencent.com>
> >
> > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > removed the opportunity to wake up flushers during the MGLRU page
> > reclamation process can lead to an increased likelihood of triggering OOM
> > when encountering many dirty pages during reclamation on MGLRU.
> >
> > This leads to premature OOM if there are too many dirty pages in cgroup:
> > Killed
> >
> > dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> > order=0, oom_score_adj=0
> >
> > Call Trace:
> >   <TASK>
> >   dump_stack_lvl+0x5f/0x80
> >   dump_stack+0x14/0x20
> >   dump_header+0x46/0x1b0
> >   oom_kill_process+0x104/0x220
> >   out_of_memory+0x112/0x5a0
> >   mem_cgroup_out_of_memory+0x13b/0x150
> >   try_charge_memcg+0x44f/0x5c0
> >   charge_memcg+0x34/0x50
> >   __mem_cgroup_charge+0x31/0x90
> >   filemap_add_folio+0x4b/0xf0
> >   __filemap_get_folio+0x1a4/0x5b0
> >   ? srso_return_thunk+0x5/0x5f
> >   ? __block_commit_write+0x82/0xb0
> >   ext4_da_write_begin+0xe5/0x270
> >   generic_perform_write+0x134/0x2b0
> >   ext4_buffered_write_iter+0x57/0xd0
> >   ext4_file_write_iter+0x76/0x7d0
> >   ? selinux_file_permission+0x119/0x150
> >   ? srso_return_thunk+0x5/0x5f
> >   ? srso_return_thunk+0x5/0x5f
> >   vfs_write+0x30c/0x440
> >   ksys_write+0x65/0xe0
> >   __x64_sys_write+0x1e/0x30
> >   x64_sys_call+0x11c2/0x1d50
> >   do_syscall_64+0x47/0x110
> >   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >
> >  memory: usage 308224kB, limit 308224kB, failcnt 2589
> >  swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> >
> >   ...
> >   file_dirty 303247360
> >   file_writeback 0
> >   ...
> >
> > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> > mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> > Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
> > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> > oom_score_adj:0
> >
> > The flusher wake up was removed to decrease SSD wearing, but if we are
> > seeing all dirty folios at the tail of an LRU, not waking up the flusher
> > could lead to thrashing easily.  So wake it up when a mem cgroups is about
> > to OOM due to dirty caches.
> >
> > ---
> > Changes from v3:
> > - Avoid taking lock and reduce overhead on folio isolation by
> >   checking the right flags and rework wake up condition, fixing the
> >   performance regression reported by Chris Li.
> >   [Chris Li, Kairui Song]
> > - Move the wake up check to try_to_shrink_lruvec to cover kswapd
> >   case as well, and update comments. [Kairui Song]
> > - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/
> > Changes from v2:
> > - Acquire the lock before calling the folio_check_dirty_writeback
> >   function. [Wei Xu, Jingxiang Zeng]
> > - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com/
> > Changes from v1:
> > - Add code to count the number of unqueued_dirty in the sort_folio
> >   function. [Wei Xu, Jingxiang Zeng]
> > - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail.com/
> > ---
> >
> > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > Cc: T.J. Mercier <tjmercier@google.com>
> > Cc: Wei Xu <weixugc@google.com>
> > Cc: Yu Zhao <yuzhao@google.com>
> > ---
> >  mm/vmscan.c | 19 ++++++++++++++++---
> >  1 file changed, 16 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index dc7a285b256b..2a5c2fe81467 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >                        int tier_idx)
> >  {
> >         bool success;
> > +       bool dirty, writeback;
> >         int gen = folio_lru_gen(folio);
> >         int type = folio_is_file_lru(folio);
> >         int zone = folio_zonenum(folio);
> > @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >                 return true;
> >         }
> >
> > +       dirty = folio_test_dirty(folio);
> > +       writeback = folio_test_writeback(folio);
> > +       if (type == LRU_GEN_FILE && dirty && !writeback)
> > +               sc->nr.unqueued_dirty += delta;
> > +
> >         /* waiting for writeback */
> > -       if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> > -           (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > +       if (folio_test_locked(folio) || writeback ||
> > +           (type == LRU_GEN_FILE && dirty)) {
> >                 gen = folio_inc_gen(lruvec, folio, true);
> >                 list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> >                 return true;
> > @@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
> >         trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH,
> >                                 scanned, skipped, isolated,
> >                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> > -
> > +       sc->nr.taken += scanned;
> >         /*
> >          * There might not be eligible folios due to reclaim_idx. Check the
> >          * remaining to prevent livelock if it's not making progress.
> > @@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >                 cond_resched();
> >         }
> >
> > +       /*
> > +        * If too many file cache in the coldest generation can't be evicted
> > +        * due to being dirty, wake up the flusher.
> > +        */
> > +       if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.taken)
> > +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> > +
> >         /* whether this lruvec should be rotated */
> >         return nr_to_scan < 0;
> >  }
> > --
> > 2.43.5
> >


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND][PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
  2024-10-09  5:29     ` jingxiang zeng
@ 2024-10-09  5:43       ` jingxiang zeng
  0 siblings, 0 replies; 24+ messages in thread
From: jingxiang zeng @ 2024-10-09  5:43 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, akpm, kasong, linuszeng, linux-kernel, tjmercier,
	weixugc, yuzhao

[-- Attachment #1: Type: text/plain, Size: 8623 bytes --]

On Wed, 9 Oct 2024 at 13:29, jingxiang zeng <jingxiangzeng.cas@gmail.com>
wrote:
>
> On Wed, 9 Oct 2024 at 01:12, Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Jingxiang,
> >
> > I did run the same swap stress test on V4 and it is much better than V3.
> > V3 test was hang there (time out). V4 did not hang any more, it
> > finishes in about the same time.
> >
> > If we look closer of V4, it seems suggest that v4 system time is
slightly worse.
> > Is that kind of expected or might be the noise of my test? Just trying
> > to understand it better, it is not a NACK by any means.
> >
> > Here is the number on mm-unstable
c121617e3606be6575cdacfdb63cc8d67b46a568:
> > Without (10 times):
> > user    2688.328
> > system  6059.021 : 6031.57 6043.61 6044.35 6045.01 6052.46 6053.75
> > 6057.21 6063.31 6075.76 6123.18
> > real    277.145
> >
> > With V4:
> > First run (10 times):
> > user    2688.537
> > system  6180.907 : 6128.4 6145.47 6160.25 6167.09 6193.31 6195.93
> > 6197.26 6202.98 6204.64 6213.74
> > real    280.174
> > Second run (10 times):
> > user    2771.498
> > system  6199.043 : 6165.39 6173.49 6179.97 6189.03 6193.13 6199.33
> > 6204.03 6212.9 6216.32 6256.84
> > real    284.854
> >
> > Chris
> >
>
> Hi Chris,
>
> Before I released the V4 version, I also ran the swap stress test you
gave me,
> with -j32, 1G memcg on my local branch:
>
> Without the patch:
> 1952.07user 1768.35system 4:51.89elapsed 1274%CPU (0avgtext+0avgdata
> 920100maxresident)k
>
> With the patch:
> 1957.83user 1757.06system 4:51.15elapsed 1275%CPU (0avgtext+0avgdata
> 919880maxresident)k

The result here is written in reverse, please refer to another email:
https://lore.kernel.org/all/CAJqJ8ijzAhVxuE16-oawhhs3YmHKWmmoo0ca5KyaLGME7aoXjw@mail.gmail.com/
>
> My test results are the same as yours. This should not be test noise. I am
> trying to analyze whether it can be further optimized.
>
> Jingxiang Zeng
>
> > On Mon, Oct 7, 2024 at 6:57 PM Jingxiang Zeng
> > <jingxiangzeng.cas@gmail.com> wrote:
> > >
> > > From: Jingxiang Zeng <linuszeng@tencent.com>
> > >
> > > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > > removed the opportunity to wake up flushers during the MGLRU page
> > > reclamation process can lead to an increased likelihood of triggering
OOM
> > > when encountering many dirty pages during reclamation on MGLRU.
> > >
> > > This leads to premature OOM if there are too many dirty pages in
cgroup:
> > > Killed
> > >
> > > dd invoked oom-killer:
gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> > > order=0, oom_score_adj=0
> > >
> > > Call Trace:
> > >   <TASK>
> > >   dump_stack_lvl+0x5f/0x80
> > >   dump_stack+0x14/0x20
> > >   dump_header+0x46/0x1b0
> > >   oom_kill_process+0x104/0x220
> > >   out_of_memory+0x112/0x5a0
> > >   mem_cgroup_out_of_memory+0x13b/0x150
> > >   try_charge_memcg+0x44f/0x5c0
> > >   charge_memcg+0x34/0x50
> > >   __mem_cgroup_charge+0x31/0x90
> > >   filemap_add_folio+0x4b/0xf0
> > >   __filemap_get_folio+0x1a4/0x5b0
> > >   ? srso_return_thunk+0x5/0x5f
> > >   ? __block_commit_write+0x82/0xb0
> > >   ext4_da_write_begin+0xe5/0x270
> > >   generic_perform_write+0x134/0x2b0
> > >   ext4_buffered_write_iter+0x57/0xd0
> > >   ext4_file_write_iter+0x76/0x7d0
> > >   ? selinux_file_permission+0x119/0x150
> > >   ? srso_return_thunk+0x5/0x5f
> > >   ? srso_return_thunk+0x5/0x5f
> > >   vfs_write+0x30c/0x440
> > >   ksys_write+0x65/0xe0
> > >   __x64_sys_write+0x1e/0x30
> > >   x64_sys_call+0x11c2/0x1d50
> > >   do_syscall_64+0x47/0x110
> > >   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >
> > >  memory: usage 308224kB, limit 308224kB, failcnt 2589
> > >  swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> > >
> > >   ...
> > >   file_dirty 303247360
> > >   file_writeback 0
> > >   ...
> > >
> > > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> > > mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> > > Memory cgroup out of memory: Killed process 4404 (dd)
total-vm:10512kB,
> > > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> > > oom_score_adj:0
> > >
> > > The flusher wake up was removed to decrease SSD wearing, but if we are
> > > seeing all dirty folios at the tail of an LRU, not waking up the
flusher
> > > could lead to thrashing easily.  So wake it up when a mem cgroups is
about
> > > to OOM due to dirty caches.
> > >
> > > ---
> > > Changes from v3:
> > > - Avoid taking lock and reduce overhead on folio isolation by
> > >   checking the right flags and rework wake up condition, fixing the
> > >   performance regression reported by Chris Li.
> > >   [Chris Li, Kairui Song]
> > > - Move the wake up check to try_to_shrink_lruvec to cover kswapd
> > >   case as well, and update comments. [Kairui Song]
> > > - Link to v3:
https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/
> > > Changes from v2:
> > > - Acquire the lock before calling the folio_check_dirty_writeback
> > >   function. [Wei Xu, Jingxiang Zeng]
> > > - Link to v2:
https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com/
> > > Changes from v1:
> > > - Add code to count the number of unqueued_dirty in the sort_folio
> > >   function. [Wei Xu, Jingxiang Zeng]
> > > - Link to v1:
https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail.com/
> > > ---
> > >
> > > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > > Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > Cc: T.J. Mercier <tjmercier@google.com>
> > > Cc: Wei Xu <weixugc@google.com>
> > > Cc: Yu Zhao <yuzhao@google.com>
> > > ---
> > >  mm/vmscan.c | 19 ++++++++++++++++---
> > >  1 file changed, 16 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index dc7a285b256b..2a5c2fe81467 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec,
struct folio *folio, struct scan_c
> > >                        int tier_idx)
> > >  {
> > >         bool success;
> > > +       bool dirty, writeback;
> > >         int gen = folio_lru_gen(folio);
> > >         int type = folio_is_file_lru(folio);
> > >         int zone = folio_zonenum(folio);
> > > @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec,
struct folio *folio, struct scan_c
> > >                 return true;
> > >         }
> > >
> > > +       dirty = folio_test_dirty(folio);
> > > +       writeback = folio_test_writeback(folio);
> > > +       if (type == LRU_GEN_FILE && dirty && !writeback)
> > > +               sc->nr.unqueued_dirty += delta;
> > > +
> > >         /* waiting for writeback */
> > > -       if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> > > -           (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > > +       if (folio_test_locked(folio) || writeback ||
> > > +           (type == LRU_GEN_FILE && dirty)) {
> > >                 gen = folio_inc_gen(lruvec, folio, true);
> > >                 list_move(&folio->lru,
&lrugen->folios[gen][type][zone]);
> > >                 return true;
> > > @@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec,
struct scan_control *sc,
> > >         trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order,
MAX_LRU_BATCH,
> > >                                 scanned, skipped, isolated,
> > >                                 type ? LRU_INACTIVE_FILE :
LRU_INACTIVE_ANON);
> > > -
> > > +       sc->nr.taken += scanned;
> > >         /*
> > >          * There might not be eligible folios due to reclaim_idx.
Check the
> > >          * remaining to prevent livelock if it's not making progress.
> > > @@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec
*lruvec, struct scan_control *sc)
> > >                 cond_resched();
> > >         }
> > >
> > > +       /*
> > > +        * If too many file cache in the coldest generation can't be
evicted
> > > +        * due to being dirty, wake up the flusher.
> > > +        */
> > > +       if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty ==
sc->nr.taken)
> > > +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> > > +
> > >         /* whether this lruvec should be rotated */
> > >         return nr_to_scan < 0;
> > >  }
> > > --
> > > 2.43.5
> > >

[-- Attachment #2: Type: text/html, Size: 11860 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND][PATCH v4] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
  2024-10-09  4:52     ` jingxiang zeng
@ 2024-10-09  6:21       ` Wei Xu
  0 siblings, 0 replies; 24+ messages in thread
From: Wei Xu @ 2024-10-09  6:21 UTC (permalink / raw)
  To: jingxiang zeng
  Cc: linux-mm, akpm, kasong, linuszeng, linux-kernel, tjmercier,
	yuzhao, chrisl

On Tue, Oct 8, 2024 at 9:52 PM jingxiang zeng
<jingxiangzeng.cas@gmail.com> wrote:
>
> On Tue, 8 Oct 2024 at 11:26, Wei Xu <weixugc@google.com> wrote:
> >
> > On Mon, Oct 7, 2024 at 6:57 PM Jingxiang Zeng
> > <jingxiangzeng.cas@gmail.com> wrote:
> > >
> > > From: Jingxiang Zeng <linuszeng@tencent.com>
> > >
> > > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > > removed the opportunity to wake up flushers during the MGLRU page
> > > reclamation process can lead to an increased likelihood of triggering OOM
> > > when encountering many dirty pages during reclamation on MGLRU.
> > >
> > > This leads to premature OOM if there are too many dirty pages in cgroup:
> > > Killed
> > >
> > > dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> > > order=0, oom_score_adj=0
> > >
> > > Call Trace:
> > >   <TASK>
> > >   dump_stack_lvl+0x5f/0x80
> > >   dump_stack+0x14/0x20
> > >   dump_header+0x46/0x1b0
> > >   oom_kill_process+0x104/0x220
> > >   out_of_memory+0x112/0x5a0
> > >   mem_cgroup_out_of_memory+0x13b/0x150
> > >   try_charge_memcg+0x44f/0x5c0
> > >   charge_memcg+0x34/0x50
> > >   __mem_cgroup_charge+0x31/0x90
> > >   filemap_add_folio+0x4b/0xf0
> > >   __filemap_get_folio+0x1a4/0x5b0
> > >   ? srso_return_thunk+0x5/0x5f
> > >   ? __block_commit_write+0x82/0xb0
> > >   ext4_da_write_begin+0xe5/0x270
> > >   generic_perform_write+0x134/0x2b0
> > >   ext4_buffered_write_iter+0x57/0xd0
> > >   ext4_file_write_iter+0x76/0x7d0
> > >   ? selinux_file_permission+0x119/0x150
> > >   ? srso_return_thunk+0x5/0x5f
> > >   ? srso_return_thunk+0x5/0x5f
> > >   vfs_write+0x30c/0x440
> > >   ksys_write+0x65/0xe0
> > >   __x64_sys_write+0x1e/0x30
> > >   x64_sys_call+0x11c2/0x1d50
> > >   do_syscall_64+0x47/0x110
> > >   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >
> > >  memory: usage 308224kB, limit 308224kB, failcnt 2589
> > >  swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> > >
> > >   ...
> > >   file_dirty 303247360
> > >   file_writeback 0
> > >   ...
> > >
> > > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> > > mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> > > Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
> > > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> > > oom_score_adj:0
> > >
> > > The flusher wake up was removed to decrease SSD wearing, but if we are
> > > seeing all dirty folios at the tail of an LRU, not waking up the flusher
> > > could lead to thrashing easily.  So wake it up when a mem cgroups is about
> > > to OOM due to dirty caches.
> > >
> > > ---
> > > Changes from v3:
> > > - Avoid taking lock and reduce overhead on folio isolation by
> > >   checking the right flags and rework wake up condition, fixing the
> > >   performance regression reported by Chris Li.
> > >   [Chris Li, Kairui Song]
> > > - Move the wake up check to try_to_shrink_lruvec to cover kswapd
> > >   case as well, and update comments. [Kairui Song]
> > > - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/
> > > Changes from v2:
> > > - Acquire the lock before calling the folio_check_dirty_writeback
> > >   function. [Wei Xu, Jingxiang Zeng]
> > > - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com/
> > > Changes from v1:
> > > - Add code to count the number of unqueued_dirty in the sort_folio
> > >   function. [Wei Xu, Jingxiang Zeng]
> > > - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail.com/
> > > ---
> > >
> > > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > > Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > Cc: T.J. Mercier <tjmercier@google.com>
> > > Cc: Wei Xu <weixugc@google.com>
> > > Cc: Yu Zhao <yuzhao@google.com>
> > > ---
> > >  mm/vmscan.c | 19 ++++++++++++++++---
> > >  1 file changed, 16 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index dc7a285b256b..2a5c2fe81467 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -4291,6 +4291,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > >                        int tier_idx)
> > >  {
> > >         bool success;
> > > +       bool dirty, writeback;
> > >         int gen = folio_lru_gen(folio);
> > >         int type = folio_is_file_lru(folio);
> > >         int zone = folio_zonenum(folio);
> > > @@ -4336,9 +4337,14 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > >                 return true;
> > >         }
> > >
> > > +       dirty = folio_test_dirty(folio);
> > > +       writeback = folio_test_writeback(folio);
> > > +       if (type == LRU_GEN_FILE && dirty && !writeback)
> > > +               sc->nr.unqueued_dirty += delta;
> > > +
> >
> > This sounds good.  BTW, when shrink_folio_list() in evict_folios()
> > returns, we should add stat.nr_unqueued_dirty to sc->nr.unqueued_dirty
> > there as well.
>
> Thank you for your valuable feedback, I will implement it in the next version.
> >
> > >         /* waiting for writeback */
> > > -       if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> > > -           (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > > +       if (folio_test_locked(folio) || writeback ||
> > > +           (type == LRU_GEN_FILE && dirty)) {
> > >                 gen = folio_inc_gen(lruvec, folio, true);
> > >                 list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > >                 return true;
> > > @@ -4454,7 +4460,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
> > >         trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH,
> > >                                 scanned, skipped, isolated,
> > >                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> > > -
> > > +       sc->nr.taken += scanned;
> >
> > I think we should only include file pages (in sort_folio) and isolated
> > pages into sc->nr.taken, instead of all scanned pages. For example, if
>
> Unqueued dirty pages are not isolated, but promoted to the newer generation
> in the sort_folio function. So I tend to wake up the flusher thread when the
> number of scanned pages is equal to the number of unqueued dirty pages.
>
> > there are only unevictable and unqueued dirty pages, we would still
> > like to wake up the flusher threads, but because nr.taken counts
> > unevictable pages as well, the wakeup condition in
> > try_to_shrink_lruvec() won't be met.
>
> The situation you mentioned will not happen because the number of
> scanned pages does not include unevicatble pages.
> However, there is another situation that needs attention. When the
> scanned pages contain anonymous pages and unqueued dirty pages,
> the flusher cannot be woken up. I will fix this situation in the next version.
>

Pages in unevictable LRUs are not scanned, but unevictable pages that
are not yet moved to unevictable LRUs can be included in scanned
pages. See the !folio_evictable(folio) check in sort_folio().

> >
> > >         /*
> > >          * There might not be eligible folios due to reclaim_idx. Check the
> > >          * remaining to prevent livelock if it's not making progress.
> > > @@ -4796,6 +4802,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> > >                 cond_resched();
> > >         }
> > >
> > > +       /*
> > > +        * If too many file cache in the coldest generation can't be evicted
> > > +        * due to being dirty, wake up the flusher.
> > > +        */
> > > +       if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.taken)
> > > +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> > > +
> >
> > try_to_shrink_lruvec() can be called from shrink_node() for global
> > reclaim as well. We need to reset sc->nr before calling
> > lru_gen_shrink_node() there. MGLRU didn't need that because it didn't
> > use sc->nr until this change.
>
> Thank you for your valuable feedback, I will implement it in the next version.
> >
> > >         /* whether this lruvec should be rotated */
> > >         return nr_to_scan < 0;
> > >  }
> > > --
> > > 2.43.5
> > >


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2] mm/memcontrol: add per-memcg pgpgin/pswpin counter
  2024-09-13  8:44 ` [PATCH V2] mm/memcontrol: add per-memcg pgpgin/pswpin counter Jingxiang Zeng
  2024-09-16  7:41   ` Andrew Morton
@ 2024-10-11 20:01   ` Nhat Pham
  1 sibling, 0 replies; 24+ messages in thread
From: Nhat Pham @ 2024-10-11 20:01 UTC (permalink / raw)
  To: Jingxiang Zeng
  Cc: linux-mm, akpm, cgroups, hannes, linuszeng, linux-kernel, mhocko,
	muchun.song, roman.gushchin, shakeel.butt, yosryahmed

On Fri, Sep 13, 2024 at 8:21 AM Jingxiang Zeng
<jingxiangzeng.cas@gmail.com> wrote:
>
> From: Jingxiang Zeng <linuszeng@tencent.com>
>
> In proactive memory reclamation scenarios, it is necessary to estimate the
> pswpin and pswpout metrics of the cgroup to determine whether to continue
> reclaiming anonymous pages in the current batch.  This patch will collect
> these metrics and expose them.

+1 - this is also useful for zswap shrinker enablement, after which an
anon page can be loaded back in either from swap or zswap.
Differentiating these two situations helps a lot with performance
regression diagnostics.

We have host level metrics, but they become less useful when we
combine workloads with different characteristics in the same host.

>
> Link: https://lkml.kernel.org/r/20240830082244.156923-1-jingxiangzeng.cas@gmail.com
> Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>  mm/memcontrol.c | 2 ++
>  mm/page_io.c    | 4 ++++
>  2 files changed, 6 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6efbfc9399d0..dbc1d43a5c4c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -418,6 +418,8 @@ static const unsigned int memcg_vm_event_stat[] = {
>         PGPGIN,
>         PGPGOUT,
>  #endif
> +       PSWPIN,
> +       PSWPOUT,
>         PGSCAN_KSWAPD,
>         PGSCAN_DIRECT,
>         PGSCAN_KHUGEPAGED,
> diff --git a/mm/page_io.c b/mm/page_io.c
> index b6f1519d63b0..4bc77d1c6bfa 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -310,6 +310,7 @@ static inline void count_swpout_vm_event(struct folio *folio)
>         }
>         count_mthp_stat(folio_order(folio), MTHP_STAT_SWPOUT);
>  #endif
> +       count_memcg_folio_events(folio, PSWPOUT, folio_nr_pages(folio));
>         count_vm_events(PSWPOUT, folio_nr_pages(folio));
>  }
>
> @@ -505,6 +506,7 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
>                 for (p = 0; p < sio->pages; p++) {
>                         struct folio *folio = page_folio(sio->bvec[p].bv_page);
>
> +                       count_memcg_folio_events(folio, PSWPIN, folio_nr_pages(folio));
>                         folio_mark_uptodate(folio);
>                         folio_unlock(folio);
>                 }
> @@ -588,6 +590,7 @@ static void swap_read_folio_bdev_sync(struct folio *folio,
>          * attempt to access it in the page fault retry time check.
>          */
>         get_task_struct(current);
> +       count_memcg_folio_events(folio, PSWPIN, folio_nr_pages(folio));
>         count_vm_event(PSWPIN);
>         submit_bio_wait(&bio);
>         __end_swap_bio_read(&bio);
> @@ -603,6 +606,7 @@ static void swap_read_folio_bdev_async(struct folio *folio,
>         bio->bi_iter.bi_sector = swap_folio_sector(folio);
>         bio->bi_end_io = end_swap_bio_read;
>         bio_add_folio_nofail(bio, folio, folio_size(folio), 0);
> +       count_memcg_folio_events(folio, PSWPIN, folio_nr_pages(folio));
>         count_vm_event(PSWPIN);

Not related to this patch, but why does the global stats not take into
account large folios here... `count_vm_event(PSWPIN);`?

Acked-by: Nhat Pham <nphamcs@gmail.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v5] mm/vmscan: wake up flushers conditionally to avoid cgroup  OOM
       [not found] <Jingxiang Zeng <linuszeng@tencent.com>
                   ` (5 preceding siblings ...)
  2024-10-08  1:56 ` [RESEND][PATCH " Jingxiang Zeng
@ 2024-10-18  5:29 ` Jingxiang Zeng
  2024-10-18  6:36   ` Wei Xu
  6 siblings, 1 reply; 24+ messages in thread
From: Jingxiang Zeng @ 2024-10-18  5:29 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, kasong, linuszeng, linux-kernel, tjmercier, weixugc,
	yuzhao, chrisl

From: Zeng Jingxiang <linuszeng@tencent.com>

Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
removed the opportunity to wake up flushers during the MGLRU page
reclamation process can lead to an increased likelihood of triggering OOM
when encountering many dirty pages during reclamation on MGLRU.

This leads to premature OOM if there are too many dirty pages in cgroup:
Killed

dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
order=0, oom_score_adj=0

Call Trace:
  <TASK>
  dump_stack_lvl+0x5f/0x80
  dump_stack+0x14/0x20
  dump_header+0x46/0x1b0
  oom_kill_process+0x104/0x220
  out_of_memory+0x112/0x5a0
  mem_cgroup_out_of_memory+0x13b/0x150
  try_charge_memcg+0x44f/0x5c0
  charge_memcg+0x34/0x50
  __mem_cgroup_charge+0x31/0x90
  filemap_add_folio+0x4b/0xf0
  __filemap_get_folio+0x1a4/0x5b0
  ? srso_return_thunk+0x5/0x5f
  ? __block_commit_write+0x82/0xb0
  ext4_da_write_begin+0xe5/0x270
  generic_perform_write+0x134/0x2b0
  ext4_buffered_write_iter+0x57/0xd0
  ext4_file_write_iter+0x76/0x7d0
  ? selinux_file_permission+0x119/0x150
  ? srso_return_thunk+0x5/0x5f
  ? srso_return_thunk+0x5/0x5f
  vfs_write+0x30c/0x440
  ksys_write+0x65/0xe0
  __x64_sys_write+0x1e/0x30
  x64_sys_call+0x11c2/0x1d50
  do_syscall_64+0x47/0x110
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

 memory: usage 308224kB, limit 308224kB, failcnt 2589
 swap: usage 0kB, limit 9007199254740988kB, failcnt 0

  ...
  file_dirty 303247360
  file_writeback 0
  ...

oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
oom_score_adj:0

The flusher wake up was removed to decrease SSD wearing, but if we are
seeing all dirty folios at the tail of an LRU, not waking up the flusher
could lead to thrashing easily.  So wake it up when a mem cgroups is about
to OOM due to dirty caches.

I did run the build kernel test[1] on V5, with -j16 1G memcg on my local
branch:

Without the patch(10 times):
user 1473.29
system 347.87 339.17 345.28 354.64 352.46 355.63 358.80 359.40 358.28
350.95 (avg 352.248)
real 166.651

With the V5 patch(10 times):
user 1470.7
system 339.13 350.58 350.07 355.58 348.96 344.83 351.78 336.39 350.45
343.31 (avg 347.108)
real 165.821

Test results show that this patch has about 1% performance improvement,
which should be caused by noise.

---
Changes from v4:
- Add the number of unqueued dirty pages in the shrink_folio_list function
  to sc->nr.unqueued_dirty. [Wei Xu, Jingxiang Zeng]
- Reset sc->nr before calling lru_gen_shrink_node function.
  [Wei Xu, Jingxiang Zeng]
- Modified the conditions for waking up the flusher thread to avoid
  interference from unevictable and anonymous pages.
  [Wei Xu, Jingxiang Zeng]
- Link to v4: https://lore.kernel.org/all/20240929113050.76079-1-jingxiangzeng.cas@gmail.com/
Changes from v3:
- Avoid taking lock and reduce overhead on folio isolation by
  checking the right flags and rework wake up condition, fixing the
  performance regression reported by Chris Li.
  [Chris Li, Kairui Song]
- Move the wake up check to try_to_shrink_lruvec to cover kswapd
  case as well, and update comments. [Kairui Song]
- Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/
Changes from v2:
- Acquire the lock before calling the folio_check_dirty_writeback
  function. [Wei Xu, Jingxiang Zeng]
- Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com/
Changes from v1:
- Add code to count the number of unqueued_dirty in the sort_folio
  function. [Wei Xu, Jingxiang Zeng]
- Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail.com/
---

Link: https://lore.kernel.org/all/CACePvbV4L-gRN9UKKuUnksfVJjOTq_5Sti2-e=pb_w51kucLKQ@mail.gmail.com/ [1]
Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yu Zhao <yuzhao@google.com>
---
 mm/vmscan.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2d0486189804..97e0af338ee0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4292,6 +4292,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		       int tier_idx)
 {
 	bool success;
+	bool dirty, writeback;
 	int gen = folio_lru_gen(folio);
 	int type = folio_is_file_lru(folio);
 	int zone = folio_zonenum(folio);
@@ -4337,9 +4338,14 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		return true;
 	}
 
+	dirty = folio_test_dirty(folio);
+	writeback = folio_test_writeback(folio);
+	if (type == LRU_GEN_FILE && dirty && !writeback)
+		sc->nr.unqueued_dirty += delta;
+
 	/* waiting for writeback */
-	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
-	    (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
+	if (folio_test_locked(folio) || writeback ||
+	    (type == LRU_GEN_FILE && dirty)) {
 		gen = folio_inc_gen(lruvec, folio, true);
 		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
@@ -4455,7 +4461,8 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
-
+	if (type == LRU_GEN_FILE)
+		sc->nr.taken += isolated;
 	/*
 	 * There might not be eligible folios due to reclaim_idx. Check the
 	 * remaining to prevent livelock if it's not making progress.
@@ -4589,6 +4596,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 		return scanned;
 retry:
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false);
+	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr_reclaimed += reclaimed;
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			scanned, reclaimed, &stat, sc->priority,
@@ -4797,6 +4805,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		cond_resched();
 	}
 
+	/*
+	 * If too many file cache in the coldest generation can't be evicted
+	 * due to being dirty, wake up the flusher.
+	 */
+	if (sc->nr.unqueued_dirty && !sc->nr.taken)
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+
 	/* whether this lruvec should be rotated */
 	return nr_to_scan < 0;
 }
@@ -5942,6 +5957,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	bool reclaimable = false;
 
 	if (lru_gen_enabled() && root_reclaim(sc)) {
+		memset(&sc->nr, 0, sizeof(sc->nr));
 		lru_gen_shrink_node(pgdat, sc);
 		return;
 	}
-- 
2.43.5



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
  2024-10-18  5:29 ` [PATCH v5] " Jingxiang Zeng
@ 2024-10-18  6:36   ` Wei Xu
  0 siblings, 0 replies; 24+ messages in thread
From: Wei Xu @ 2024-10-18  6:36 UTC (permalink / raw)
  To: Jingxiang Zeng
  Cc: linux-mm, akpm, kasong, linuszeng, linux-kernel, tjmercier,
	yuzhao, chrisl

On Thu, Oct 17, 2024 at 10:29 PM Jingxiang Zeng
<jingxiangzeng.cas@gmail.com> wrote:
>
> From: Zeng Jingxiang <linuszeng@tencent.com>
>
> Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> removed the opportunity to wake up flushers during the MGLRU page
> reclamation process can lead to an increased likelihood of triggering OOM
> when encountering many dirty pages during reclamation on MGLRU.
>
> This leads to premature OOM if there are too many dirty pages in cgroup:
> Killed
>
> dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> order=0, oom_score_adj=0
>
> Call Trace:
>   <TASK>
>   dump_stack_lvl+0x5f/0x80
>   dump_stack+0x14/0x20
>   dump_header+0x46/0x1b0
>   oom_kill_process+0x104/0x220
>   out_of_memory+0x112/0x5a0
>   mem_cgroup_out_of_memory+0x13b/0x150
>   try_charge_memcg+0x44f/0x5c0
>   charge_memcg+0x34/0x50
>   __mem_cgroup_charge+0x31/0x90
>   filemap_add_folio+0x4b/0xf0
>   __filemap_get_folio+0x1a4/0x5b0
>   ? srso_return_thunk+0x5/0x5f
>   ? __block_commit_write+0x82/0xb0
>   ext4_da_write_begin+0xe5/0x270
>   generic_perform_write+0x134/0x2b0
>   ext4_buffered_write_iter+0x57/0xd0
>   ext4_file_write_iter+0x76/0x7d0
>   ? selinux_file_permission+0x119/0x150
>   ? srso_return_thunk+0x5/0x5f
>   ? srso_return_thunk+0x5/0x5f
>   vfs_write+0x30c/0x440
>   ksys_write+0x65/0xe0
>   __x64_sys_write+0x1e/0x30
>   x64_sys_call+0x11c2/0x1d50
>   do_syscall_64+0x47/0x110
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
>  memory: usage 308224kB, limit 308224kB, failcnt 2589
>  swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>
>   ...
>   file_dirty 303247360
>   file_writeback 0
>   ...
>
> oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
> anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> oom_score_adj:0
>
> The flusher wake up was removed to decrease SSD wearing, but if we are
> seeing all dirty folios at the tail of an LRU, not waking up the flusher
> could lead to thrashing easily.  So wake it up when a mem cgroups is about
> to OOM due to dirty caches.
>
> I did run the build kernel test[1] on V5, with -j16 1G memcg on my local
> branch:
>
> Without the patch(10 times):
> user 1473.29
> system 347.87 339.17 345.28 354.64 352.46 355.63 358.80 359.40 358.28
> 350.95 (avg 352.248)
> real 166.651
>
> With the V5 patch(10 times):
> user 1470.7
> system 339.13 350.58 350.07 355.58 348.96 344.83 351.78 336.39 350.45
> 343.31 (avg 347.108)
> real 165.821
>
> Test results show that this patch has about 1% performance improvement,
> which should be caused by noise.
>
> ---
> Changes from v4:
> - Add the number of unqueued dirty pages in the shrink_folio_list function
>   to sc->nr.unqueued_dirty. [Wei Xu, Jingxiang Zeng]
> - Reset sc->nr before calling lru_gen_shrink_node function.
>   [Wei Xu, Jingxiang Zeng]
> - Modified the conditions for waking up the flusher thread to avoid
>   interference from unevictable and anonymous pages.
>   [Wei Xu, Jingxiang Zeng]
> - Link to v4: https://lore.kernel.org/all/20240929113050.76079-1-jingxiangzeng.cas@gmail.com/
> Changes from v3:
> - Avoid taking lock and reduce overhead on folio isolation by
>   checking the right flags and rework wake up condition, fixing the
>   performance regression reported by Chris Li.
>   [Chris Li, Kairui Song]
> - Move the wake up check to try_to_shrink_lruvec to cover kswapd
>   case as well, and update comments. [Kairui Song]
> - Link to v3: https://lore.kernel.org/all/20240924121358.30685-1-jingxiangzeng.cas@gmail.com/
> Changes from v2:
> - Acquire the lock before calling the folio_check_dirty_writeback
>   function. [Wei Xu, Jingxiang Zeng]
> - Link to v2: https://lore.kernel.org/all/20240913084506.3606292-1-jingxiangzeng.cas@gmail.com/
> Changes from v1:
> - Add code to count the number of unqueued_dirty in the sort_folio
>   function. [Wei Xu, Jingxiang Zeng]
> - Link to v1: https://lore.kernel.org/all/20240829102543.189453-1-jingxiangzeng.cas@gmail.com/
> ---
>
> Link: https://lore.kernel.org/all/CACePvbV4L-gRN9UKKuUnksfVJjOTq_5Sti2-e=pb_w51kucLKQ@mail.gmail.com/ [1]
> Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Cc: T.J. Mercier <tjmercier@google.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Yu Zhao <yuzhao@google.com>
> ---
>  mm/vmscan.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2d0486189804..97e0af338ee0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4292,6 +4292,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>                        int tier_idx)
>  {
>         bool success;
> +       bool dirty, writeback;
>         int gen = folio_lru_gen(folio);
>         int type = folio_is_file_lru(folio);
>         int zone = folio_zonenum(folio);
> @@ -4337,9 +4338,14 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>                 return true;
>         }
>
> +       dirty = folio_test_dirty(folio);
> +       writeback = folio_test_writeback(folio);
> +       if (type == LRU_GEN_FILE && dirty && !writeback)
> +               sc->nr.unqueued_dirty += delta;
> +
>         /* waiting for writeback */
> -       if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> -           (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> +       if (folio_test_locked(folio) || writeback ||
> +           (type == LRU_GEN_FILE && dirty)) {
>                 gen = folio_inc_gen(lruvec, folio, true);
>                 list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
>                 return true;
> @@ -4455,7 +4461,8 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
>         trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH,
>                                 scanned, skipped, isolated,
>                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> -
> +       if (type == LRU_GEN_FILE)
> +               sc->nr.taken += isolated;
>         /*
>          * There might not be eligible folios due to reclaim_idx. Check the
>          * remaining to prevent livelock if it's not making progress.
> @@ -4589,6 +4596,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>                 return scanned;
>  retry:
>         reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false);
> +       sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
>         sc->nr_reclaimed += reclaimed;
>         trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>                         scanned, reclaimed, &stat, sc->priority,
> @@ -4797,6 +4805,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>                 cond_resched();
>         }
>
> +       /*
> +        * If too many file cache in the coldest generation can't be evicted
> +        * due to being dirty, wake up the flusher.
> +        */
> +       if (sc->nr.unqueued_dirty && !sc->nr.taken)
> +               wakeup_flusher_threads(WB_REASON_VMSCAN);

This wakeup condition is too relaxed: We can wake up the the flusher
even when there are only one unqueued dirty page and many clean file
pages.

I'd suggest that we add dirty file pages to sc->nr.taken in
sort_folio() as well, in addition to isolated file pages in
scan_folios(). Then we can change the wakeup condition to be:

    if (sc->nr.taken && sc->nr.unqueued_dirty == sc->nr.taken)

Also, given that only file pages are counted here, it would be better
to add sc->nr.file_taken as the new counter and use it in this patch
for more clarity.

> +
>         /* whether this lruvec should be rotated */
>         return nr_to_scan < 0;
>  }
> @@ -5942,6 +5957,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>         bool reclaimable = false;
>
>         if (lru_gen_enabled() && root_reclaim(sc)) {
> +               memset(&sc->nr, 0, sizeof(sc->nr));
>                 lru_gen_shrink_node(pgdat, sc);
>                 return;
>         }
> --
> 2.43.5
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-10-18  6:36 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Jingxiang Zeng <linuszeng@tencent.com>
2024-09-13  8:44 ` [PATCH V2] mm/memcontrol: add per-memcg pgpgin/pswpin counter Jingxiang Zeng
2024-09-16  7:41   ` Andrew Morton
2024-09-21  7:51     ` jingxiang zeng
2024-10-11 20:01   ` Nhat Pham
2024-09-13  8:45 ` [PATCH V2] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM Jingxiang Zeng
2024-09-16 16:11   ` Wei Xu
2024-09-23  1:44     ` jingxiang zeng
2024-09-25  0:41   ` Chris Li
2024-09-24 12:13 ` [PATCH V3] " Jingxiang Zeng
2024-09-25  2:07   ` Chris Li
2024-09-29 11:26 ` [PATCH] [PATCH v4] " Jingxiang Zeng
2024-09-29 11:47   ` jingxiang zeng
2024-09-29 11:30 ` Jingxiang Zeng
2024-10-08  1:56 ` [RESEND][PATCH " Jingxiang Zeng
2024-10-08  2:15   ` Andrew Morton
2024-10-08  3:26   ` Wei Xu
2024-10-09  4:52     ` jingxiang zeng
2024-10-09  6:21       ` Wei Xu
2024-10-08 17:12   ` Chris Li
2024-10-09  5:29     ` jingxiang zeng
2024-10-09  5:43       ` jingxiang zeng
2024-10-09  5:34     ` jingxiang zeng
2024-10-18  5:29 ` [PATCH v5] " Jingxiang Zeng
2024-10-18  6:36   ` Wei Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox