[PATCH mm-unstable v1 1/2] mm/mglru: fix div-by-zero in vmpressure_calc

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH mm-unstable v1 1/2] mm/mglru: fix div-by-zero in vmpressure_calc_level()
@ 2024-07-11 19:19 Yu Zhao
  2024-07-11 19:19 ` [PATCH mm-unstable v1 2/2] mm/mglru: fix overshooting shrinker memory Yu Zhao
  0 siblings, 1 reply; 2+ messages in thread
From: Yu Zhao @ 2024-07-11 19:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Yu Zhao, Wei Xu, stable

evict_folios() uses a second pass to reclaim folios that have gone
through page writeback and become clean before it finishes the first
pass, since folio_rotate_reclaimable() cannot handle those folios due
to the isolation.

The second pass tries to avoid potential double counting by deducting
scan_control->nr_scanned. However, this can result in underflow of
nr_scanned, under a condition where shrink_folio_list() does not
increment nr_scanned, i.e., when folio_trylock() fails.

The underflow can cause the divisor, i.e., scale=scanned+reclaimed in
vmpressure_calc_level(), to become zero, resulting in the following
crash:

  [exception RIP: vmpressure_work_fn+101]
  process_one_work at ffffffffa3313f2b

Since scan_control->nr_scanned has no established semantics, the
potential double counting has minimal risks. Therefore, fix the
problem by not deducting scan_control->nr_scanned in evict_folios().

Reported-by: Wei Xu <weixugc@google.com>
Fixes: 359a5e1416ca ("mm: multi-gen LRU: retry folios written back while isolated")
Cc: stable@vger.kernel.org
Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 mm/vmscan.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0761f91b407f..6403038c776e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4597,7 +4597,6 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap

 		/* retry folios that may have missed folio_rotate_reclaimable() */
 		list_move(&folio->lru, &clean);
-		sc->nr_scanned -= folio_nr_pages(folio);
 	}

 	spin_lock_irq(&lruvec->lru_lock);
-- 
2.45.2.993.g49e7a77208-goog

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [PATCH mm-unstable v1 2/2] mm/mglru: fix overshooting shrinker memory
  2024-07-11 19:19 [PATCH mm-unstable v1 1/2] mm/mglru: fix div-by-zero in vmpressure_calc_level() Yu Zhao
@ 2024-07-11 19:19 ` Yu Zhao
  0 siblings, 0 replies; 2+ messages in thread
From: Yu Zhao @ 2024-07-11 19:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Yu Zhao, Alexander Motin, stable

set_initial_priority() tries to jump-start global reclaim by
estimating the priority based on cold/hot LRU pages. The estimation
does not account for shrinker objects, and it cannot do so because
their sizes can be in different units other than page.

If shrinker objects are the majority, e.g., on TrueNAS SCALE 24.04.0
where ZFS ARC can use almost all system memory,
set_initial_priority() can vastly underestimate how much memory ARC
shrinker can evict and assign extreme low values to
scan_control->priority, resulting in overshoots of shrinker objects.

To reproduce the problem, using TrueNAS SCALE 24.04.0 with 32GB DRAM,
a test ZFS pool and the following commands:

  fio --name=mglru.file --numjobs=36 --ioengine=io_uring \
      --directory=/root/test-zfs-pool/ --size=1024m --buffered=1 \
      --rw=randread --random_distribution=random \
      --time_based --runtime=1h &

  for ((i = 0; i < 20; i++))
  do
    sleep 120
    fio --name=mglru.anon --numjobs=16 --ioengine=mmap \
      --filename=/dev/zero --size=1024m --fadvise_hint=0 \
      --rw=randrw --random_distribution=random \
      --time_based --runtime=1m
  done

To fix the problem:
1. Cap scan_control->priority at or above DEF_PRIORITY/2, to prevent
   the jump-start from being overly aggressive.
2. Account for the progress from mm_account_reclaimed_pages(), to
   prevent kswapd_shrink_node() from raising the priority
   unnecessarily.

Reported-by: Alexander Motin <mav@ixsystems.com>
Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Cc: stable@vger.kernel.org
Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 mm/vmscan.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6403038c776e..6216d79edb7f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4930,7 +4930,11 @@ static void set_initial_priority(struct pglist_data *pgdat, struct scan_control
 	/* round down reclaimable and round up sc->nr_to_reclaim */
 	priority = fls_long(reclaimable) - 1 - fls_long(sc->nr_to_reclaim - 1);
 
-	sc->priority = clamp(priority, 0, DEF_PRIORITY);
+	/*
+	 * The estimation is based on LRU pages only, so cap it to prevent
+	 * overshoots of shrinker objects by large margins.
+	 */
+	sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY);
 }
 
 static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *sc)
@@ -6754,6 +6758,7 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
 {
 	struct zone *zone;
 	int z;
+	unsigned long nr_reclaimed = sc->nr_reclaimed;
 
 	/* Reclaim a number of pages proportional to the number of zones */
 	sc->nr_to_reclaim = 0;
@@ -6781,7 +6786,8 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
 	if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
 		sc->order = 0;
 
-	return sc->nr_scanned >= sc->nr_to_reclaim;
+	/* account for progress from mm_account_reclaimed_pages() */
+	return max(sc->nr_scanned, sc->nr_reclaimed - nr_reclaimed) >= sc->nr_to_reclaim;
 }
 
 /* Page allocator PCP high watermark is lowered if reclaim is active. */
-- 
2.45.2.993.g49e7a77208-goog



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-07-11 19:20 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-11 19:19 [PATCH mm-unstable v1 1/2] mm/mglru: fix div-by-zero in vmpressure_calc_level() Yu Zhao
2024-07-11 19:19 ` [PATCH mm-unstable v1 2/2] mm/mglru: fix overshooting shrinker memory Yu Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox