linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 mm-unstable] mm: vmscan: retry folios written back while isolated for traditional LRU
@ 2025-01-11  9:15 Chen Ridong
  2025-01-11 22:12 ` Yu Zhao
  2025-01-13 15:52 ` Johannes Weiner
  0 siblings, 2 replies; 6+ messages in thread
From: Chen Ridong @ 2025-01-11  9:15 UTC (permalink / raw)
  To: akpm, mhocko, hannes, yosryahmed, yuzhao, david, willy,
	ryan.roberts, baohua, 21cnbao, wangkefeng.wang
  Cc: linux-mm, linux-kernel, chenridong, wangweiyang2, xieym_ict

From: Chen Ridong <chenridong@huawei.com>

As commit 359a5e1416ca ("mm: multi-gen LRU: retry folios written back
while isolated") mentioned:

  The page reclaim isolates a batch of folios from the tail of one of the
  LRU lists and works on those folios one by one.  For a suitable
  swap-backed folio, if the swap device is async, it queues that folio for
  writeback.  After the page reclaim finishes an entire batch, it puts back
  the folios it queued for writeback to the head of the original LRU list.

  In the meantime, the page writeback flushes the queued folios also by
  batches.  Its batching logic is independent from that of the page
  reclaim. For each of the folios it writes back, the page writeback calls
  folio_rotate_reclaimable() which tries to rotate a folio to the tail.

  folio_rotate_reclaimable() only works for a folio after the page reclaim
  has put it back.  If an async swap device is fast enough, the page
  writeback can finish with that folio while the page reclaim is still
  working on the rest of the batch containing it.  In this case, that folio
  will remain at the head and the page reclaim will not retry it before
  reaching there".

The commit 359a5e1416ca ("mm: multi-gen LRU: retry folios written back
while isolated") only fixed the issue for mglru. However, this issue
also exists in the traditional active/inactive LRU and was found at [1].

It can be reproduced with below steps:

1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
2. Mount memcg v1, and create memcg named test_memcg and set
   limit_in_bytes=1G, memsw.limit_in_bytes=2G.
3. Create a 1G swap file, and allocate 1.05G anon memory in test_memcg.

It was found that:

  cat memory.limit_in_bytes
  1073741824
  cat memory.memsw.limit_in_bytes
  2147483648
  cat memory.usage_in_bytes
  1073664000
  cat memory.memsw.usage_in_bytes
  1129840640

  free -h
                total        used        free
  Mem:           31Gi       1.2Gi        28Gi
  Swap:         1.0Gi       1.0Gi       2.0Mi

As shown above, the test_memcg used about 50M swap, but almost 1G swap
memory was used, which means that 900M+ may be wasted because other memcgs
can not use these swap memory.

This issue should be fixed in the same way as mglru. Therefore, the common
logic was extracted to the 'find_folios_written_back' function firstly,
which is then reused in the 'shrink_inactive_list' function. Finally,
retry reclaiming those folios that may have missed the rotation for
traditional LRU.

After change, the same test case. only 54M swap was used.

  cat memory.usage_in_bytes
  1073463296
  cat memory.memsw.usage_in_bytes
  1129828352

  free -h
                total        used        free
  Mem:           31Gi       1.2Gi        28Gi
  Swap:         1.0Gi        54Mi       969Mi

[1] https://lore.kernel.org/linux-kernel/20241010081802.290893-1-chenridong@huaweicloud.com/
[2] https://lore.kernel.org/linux-kernel/CAGsJ_4zqL8ZHNRZ44o_CC69kE7DBVXvbZfvmQxMGiFqRxqHQdA@mail.gmail.com/
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---

v6->v7:
 - fix conflict based on mm-unstable.
 - update the commit message(quote from YU's commit message, and add
   improvements after change.)
 - restore 'is_retrying' to 'skip_retry' to keep original semantics.

v6: https://lore.kernel.org/linux-kernel/20241223082004.3759152-1-chenridong@huaweicloud.com/

 mm/vmscan.c | 114 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 76 insertions(+), 38 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 01dce6f26..6861b6937 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -183,6 +183,9 @@ struct scan_control {
 	struct reclaim_state reclaim_state;
 };
 
+static inline void find_folios_written_back(struct list_head *list,
+		struct list_head *clean, struct lruvec *lruvec, int type, bool is_retrying);
+
 #ifdef ARCH_HAS_PREFETCHW
 #define prefetchw_prev_lru_folio(_folio, _base, _field)			\
 	do {								\
@@ -1960,14 +1963,18 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 		enum lru_list lru)
 {
 	LIST_HEAD(folio_list);
+	LIST_HEAD(clean_list);
 	unsigned long nr_scanned;
-	unsigned int nr_reclaimed = 0;
+	unsigned int nr_reclaimed, total_reclaimed = 0;
+	unsigned int nr_pageout = 0;
+	unsigned int nr_unqueued_dirty = 0;
 	unsigned long nr_taken;
 	struct reclaim_stat stat;
 	bool file = is_file_lru(lru);
 	enum vm_event_item item;
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	bool stalled = false;
+	bool skip_retry = false;
 
 	while (unlikely(too_many_isolated(pgdat, file, sc))) {
 		if (stalled)
@@ -2001,22 +2008,47 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	if (nr_taken == 0)
 		return 0;
 
+retry:
 	nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false);
 
+	sc->nr.dirty += stat.nr_dirty;
+	sc->nr.congested += stat.nr_congested;
+	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
+	sc->nr.writeback += stat.nr_writeback;
+	sc->nr.immediate += stat.nr_immediate;
+	total_reclaimed += nr_reclaimed;
+	nr_pageout += stat.nr_pageout;
+	nr_unqueued_dirty += stat.nr_unqueued_dirty;
+
+	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
+			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
+
+	find_folios_written_back(&folio_list, &clean_list, lruvec, 0, skip_retry);
+
 	spin_lock_irq(&lruvec->lru_lock);
 	move_folios_to_lru(lruvec, &folio_list);
 
 	__mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(),
 					stat.nr_demoted);
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 	item = PGSTEAL_KSWAPD + reclaimer_offset();
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
+
+	if (!list_empty(&clean_list)) {
+		list_splice_init(&clean_list, &folio_list);
+		skip_retry = true;
+		spin_unlock_irq(&lruvec->lru_lock);
+		goto retry;
+	}
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 	spin_unlock_irq(&lruvec->lru_lock);
+	sc->nr.taken += nr_taken;
+	if (file)
+		sc->nr.file_taken += nr_taken;
 
-	lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed);
+	lru_note_cost(lruvec, file, nr_pageout, nr_scanned - total_reclaimed);
 
 	/*
 	 * If dirty folios are scanned that are not queued for IO, it
@@ -2029,7 +2061,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	 * the flushers simply cannot keep up with the allocation
 	 * rate. Nudge the flusher threads in case they are asleep.
 	 */
-	if (stat.nr_unqueued_dirty == nr_taken) {
+	if (nr_unqueued_dirty == nr_taken) {
 		wakeup_flusher_threads(WB_REASON_VMSCAN);
 		/*
 		 * For cgroupv1 dirty throttling is achieved by waking up
@@ -2044,18 +2076,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
 	}
 
-	sc->nr.dirty += stat.nr_dirty;
-	sc->nr.congested += stat.nr_congested;
-	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
-	sc->nr.writeback += stat.nr_writeback;
-	sc->nr.immediate += stat.nr_immediate;
-	sc->nr.taken += nr_taken;
-	if (file)
-		sc->nr.file_taken += nr_taken;
-
-	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
-			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
-	return nr_reclaimed;
+	return total_reclaimed;
 }
 
 /*
@@ -4637,8 +4658,6 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 	int reclaimed;
 	LIST_HEAD(list);
 	LIST_HEAD(clean);
-	struct folio *folio;
-	struct folio *next;
 	enum vm_event_item item;
 	struct reclaim_stat stat;
 	struct lru_gen_mm_walk *walk;
@@ -4668,26 +4687,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 			scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
-	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
-		DEFINE_MIN_SEQ(lruvec);
-
-		if (!folio_evictable(folio)) {
-			list_del(&folio->lru);
-			folio_putback_lru(folio);
-			continue;
-		}
-
-		/* retry folios that may have missed folio_rotate_reclaimable() */
-		if (!skip_retry && !folio_test_active(folio) && !folio_mapped(folio) &&
-		    !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
-			list_move(&folio->lru, &clean);
-			continue;
-		}
-
-		/* don't add rejected folios to the oldest generation */
-		if (lru_gen_folio_seq(lruvec, folio, false) == min_seq[type])
-			set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
-	}
+	find_folios_written_back(&list, &clean, lruvec, type, skip_retry);
 
 	spin_lock_irq(&lruvec->lru_lock);
 
@@ -5706,6 +5706,44 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
 
 #endif /* CONFIG_LRU_GEN */
 
+/**
+ * find_folios_written_back - Find and move the written back folios to a new list.
+ * @list: filios list
+ * @clean: the written back folios list
+ * @lruvec: the lruvec
+ * @type: LRU_GEN_ANON/LRU_GEN_FILE, only for multi-gen LRU
+ * @skip_retry: whether skip retry.
+ */
+static inline void find_folios_written_back(struct list_head *list,
+		struct list_head *clean, struct lruvec *lruvec, int type, bool skip_retry)
+{
+	struct folio *folio;
+	struct folio *next;
+
+	list_for_each_entry_safe_reverse(folio, next, list, lru) {
+#ifdef CONFIG_LRU_GEN
+		DEFINE_MIN_SEQ(lruvec);
+#endif
+		if (!folio_evictable(folio)) {
+			list_del(&folio->lru);
+			folio_putback_lru(folio);
+			continue;
+		}
+
+		/* retry folios that may have missed folio_rotate_reclaimable() */
+		if (!skip_retry && !folio_test_active(folio) && !folio_mapped(folio) &&
+		    !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
+			list_move(&folio->lru, clean);
+			continue;
+		}
+#ifdef CONFIG_LRU_GEN
+		/* don't add rejected folios to the oldest generation */
+		if (lru_gen_enabled() && lru_gen_folio_seq(lruvec, folio, false) == min_seq[type])
+			set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
+#endif
+	}
+}
+
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
-- 
2.34.1



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v7 mm-unstable] mm: vmscan: retry folios written back while isolated for traditional LRU
  2025-01-11  9:15 [PATCH v7 mm-unstable] mm: vmscan: retry folios written back while isolated for traditional LRU Chen Ridong
@ 2025-01-11 22:12 ` Yu Zhao
  2025-01-13  8:35   ` chenridong
  2025-01-13 15:52 ` Johannes Weiner
  1 sibling, 1 reply; 6+ messages in thread
From: Yu Zhao @ 2025-01-11 22:12 UTC (permalink / raw)
  To: Chen Ridong, Wei Xu
  Cc: akpm, mhocko, hannes, yosryahmed, david, willy, ryan.roberts,
	baohua, 21cnbao, wangkefeng.wang, linux-mm, linux-kernel,
	chenridong, wangweiyang2, xieym_ict

On Sat, Jan 11, 2025 at 2:25 AM Chen Ridong <chenridong@huaweicloud.com> wrote:
>
> From: Chen Ridong <chenridong@huawei.com>
>
> As commit 359a5e1416ca ("mm: multi-gen LRU: retry folios written back
> while isolated") mentioned:
>
>   The page reclaim isolates a batch of folios from the tail of one of the
>   LRU lists and works on those folios one by one.  For a suitable
>   swap-backed folio, if the swap device is async, it queues that folio for
>   writeback.  After the page reclaim finishes an entire batch, it puts back
>   the folios it queued for writeback to the head of the original LRU list.
>
>   In the meantime, the page writeback flushes the queued folios also by
>   batches.  Its batching logic is independent from that of the page
>   reclaim. For each of the folios it writes back, the page writeback calls
>   folio_rotate_reclaimable() which tries to rotate a folio to the tail.
>
>   folio_rotate_reclaimable() only works for a folio after the page reclaim
>   has put it back.  If an async swap device is fast enough, the page
>   writeback can finish with that folio while the page reclaim is still
>   working on the rest of the batch containing it.  In this case, that folio
>   will remain at the head and the page reclaim will not retry it before
>   reaching there".
>
> The commit 359a5e1416ca ("mm: multi-gen LRU: retry folios written back
> while isolated") only fixed the issue for mglru. However, this issue
> also exists in the traditional active/inactive LRU and was found at [1].

The active/inactive LRU needs more careful thoughts due to its
complexity. Details below.

> It can be reproduced with below steps:
>
> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
> 2. Mount memcg v1, and create memcg named test_memcg and set
>    limit_in_bytes=1G, memsw.limit_in_bytes=2G.
> 3. Create a 1G swap file, and allocate 1.05G anon memory in test_memcg.
>
> It was found that:
>
>   cat memory.limit_in_bytes
>   1073741824
>   cat memory.memsw.limit_in_bytes
>   2147483648
>   cat memory.usage_in_bytes
>   1073664000
>   cat memory.memsw.usage_in_bytes
>   1129840640
>
>   free -h
>                 total        used        free
>   Mem:           31Gi       1.2Gi        28Gi
>   Swap:         1.0Gi       1.0Gi       2.0Mi
>
> As shown above, the test_memcg used about 50M swap, but almost 1G swap
> memory was used, which means that 900M+ may be wasted because other memcgs
> can not use these swap memory.
>
> This issue should be fixed in the same way as mglru. Therefore, the common
> logic was extracted to the 'find_folios_written_back' function firstly,
> which is then reused in the 'shrink_inactive_list' function. Finally,
> retry reclaiming those folios that may have missed the rotation for
> traditional LRU.
>
> After change, the same test case. only 54M swap was used.
>
>   cat memory.usage_in_bytes
>   1073463296
>   cat memory.memsw.usage_in_bytes
>   1129828352
>
>   free -h
>                 total        used        free
>   Mem:           31Gi       1.2Gi        28Gi
>   Swap:         1.0Gi        54Mi       969Mi
>
> [1] https://lore.kernel.org/linux-kernel/20241010081802.290893-1-chenridong@huaweicloud.com/
> [2] https://lore.kernel.org/linux-kernel/CAGsJ_4zqL8ZHNRZ44o_CC69kE7DBVXvbZfvmQxMGiFqRxqHQdA@mail.gmail.com/
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> ---
>
> v6->v7:
>  - fix conflict based on mm-unstable.
>  - update the commit message(quote from YU's commit message, and add
>    improvements after change.)
>  - restore 'is_retrying' to 'skip_retry' to keep original semantics.
>
> v6: https://lore.kernel.org/linux-kernel/20241223082004.3759152-1-chenridong@huaweicloud.com/
>
>  mm/vmscan.c | 114 ++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 76 insertions(+), 38 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 01dce6f26..6861b6937 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -183,6 +183,9 @@ struct scan_control {
>         struct reclaim_state reclaim_state;
>  };
>
> +static inline void find_folios_written_back(struct list_head *list,
> +               struct list_head *clean, struct lruvec *lruvec, int type, bool is_retrying);
> +
>  #ifdef ARCH_HAS_PREFETCHW
>  #define prefetchw_prev_lru_folio(_folio, _base, _field)                        \
>         do {                                                            \
> @@ -1960,14 +1963,18 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>                 enum lru_list lru)
>  {
>         LIST_HEAD(folio_list);
> +       LIST_HEAD(clean_list);
>         unsigned long nr_scanned;
> -       unsigned int nr_reclaimed = 0;
> +       unsigned int nr_reclaimed, total_reclaimed = 0;
> +       unsigned int nr_pageout = 0;
> +       unsigned int nr_unqueued_dirty = 0;
>         unsigned long nr_taken;
>         struct reclaim_stat stat;
>         bool file = is_file_lru(lru);
>         enum vm_event_item item;
>         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         bool stalled = false;
> +       bool skip_retry = false;
>
>         while (unlikely(too_many_isolated(pgdat, file, sc))) {
>                 if (stalled)
> @@ -2001,22 +2008,47 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>         if (nr_taken == 0)
>                 return 0;
>
> +retry:
>         nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false);
>
> +       sc->nr.dirty += stat.nr_dirty;
> +       sc->nr.congested += stat.nr_congested;
> +       sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
> +       sc->nr.writeback += stat.nr_writeback;

I think this change breaks the tests on the stats above, e.g.,
wakeup_flusher_threads(), because the same dirty/writeback folio can
be counted twice. The reason for that is that
folio_test_dirty/writeback() can't account for dirty/writeback buffer
heads, which can only be done by folio_check_dirty_writeback().

For MGLRU, it has been broken since day 1 and commit 1bc542c6a0d1
("mm/vmscan: wake up flushers conditionally to avoid cgroup OOM")
doesn't account for this either. I'll get around to that.

> +       sc->nr.immediate += stat.nr_immediate;
> +       total_reclaimed += nr_reclaimed;
> +       nr_pageout += stat.nr_pageout;
> +       nr_unqueued_dirty += stat.nr_unqueued_dirty;
> +
> +       trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> +                       nr_scanned, nr_reclaimed, &stat, sc->priority, file);
> +
> +       find_folios_written_back(&folio_list, &clean_list, lruvec, 0, skip_retry);
> +
>         spin_lock_irq(&lruvec->lru_lock);
>         move_folios_to_lru(lruvec, &folio_list);
>
>         __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(),
>                                         stat.nr_demoted);
> -       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>         item = PGSTEAL_KSWAPD + reclaimer_offset();
>         if (!cgroup_reclaim(sc))
>                 __count_vm_events(item, nr_reclaimed);
>         __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
>         __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
> +
> +       if (!list_empty(&clean_list)) {
> +               list_splice_init(&clean_list, &folio_list);
> +               skip_retry = true;
> +               spin_unlock_irq(&lruvec->lru_lock);
> +               goto retry;
> +       }
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>         spin_unlock_irq(&lruvec->lru_lock);
> +       sc->nr.taken += nr_taken;
> +       if (file)
> +               sc->nr.file_taken += nr_taken;
>
> -       lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed);
> +       lru_note_cost(lruvec, file, nr_pageout, nr_scanned - total_reclaimed);
>
>         /*
>          * If dirty folios are scanned that are not queued for IO, it
> @@ -2029,7 +2061,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>          * the flushers simply cannot keep up with the allocation
>          * rate. Nudge the flusher threads in case they are asleep.
>          */
> -       if (stat.nr_unqueued_dirty == nr_taken) {
> +       if (nr_unqueued_dirty == nr_taken) {
>                 wakeup_flusher_threads(WB_REASON_VMSCAN);
>                 /*
>                  * For cgroupv1 dirty throttling is achieved by waking up
> @@ -2044,18 +2076,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>                         reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
>         }
>
> -       sc->nr.dirty += stat.nr_dirty;
> -       sc->nr.congested += stat.nr_congested;
> -       sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
> -       sc->nr.writeback += stat.nr_writeback;
> -       sc->nr.immediate += stat.nr_immediate;
> -       sc->nr.taken += nr_taken;
> -       if (file)
> -               sc->nr.file_taken += nr_taken;
> -
> -       trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> -                       nr_scanned, nr_reclaimed, &stat, sc->priority, file);
> -       return nr_reclaimed;
> +       return total_reclaimed;
>  }
>
>  /*
> @@ -4637,8 +4658,6 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>         int reclaimed;
>         LIST_HEAD(list);
>         LIST_HEAD(clean);
> -       struct folio *folio;
> -       struct folio *next;
>         enum vm_event_item item;
>         struct reclaim_stat stat;
>         struct lru_gen_mm_walk *walk;
> @@ -4668,26 +4687,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>                         scanned, reclaimed, &stat, sc->priority,
>                         type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>
> -       list_for_each_entry_safe_reverse(folio, next, &list, lru) {
> -               DEFINE_MIN_SEQ(lruvec);
> -
> -               if (!folio_evictable(folio)) {
> -                       list_del(&folio->lru);
> -                       folio_putback_lru(folio);
> -                       continue;
> -               }
> -
> -               /* retry folios that may have missed folio_rotate_reclaimable() */
> -               if (!skip_retry && !folio_test_active(folio) && !folio_mapped(folio) &&
> -                   !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
> -                       list_move(&folio->lru, &clean);
> -                       continue;
> -               }
> -
> -               /* don't add rejected folios to the oldest generation */
> -               if (lru_gen_folio_seq(lruvec, folio, false) == min_seq[type])
> -                       set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
> -       }
> +       find_folios_written_back(&list, &clean, lruvec, type, skip_retry);
>
>         spin_lock_irq(&lruvec->lru_lock);
>
> @@ -5706,6 +5706,44 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
>
>  #endif /* CONFIG_LRU_GEN */
>
> +/**
> + * find_folios_written_back - Find and move the written back folios to a new list.
> + * @list: filios list
> + * @clean: the written back folios list
> + * @lruvec: the lruvec
> + * @type: LRU_GEN_ANON/LRU_GEN_FILE, only for multi-gen LRU
> + * @skip_retry: whether skip retry.
> + */
> +static inline void find_folios_written_back(struct list_head *list,
> +               struct list_head *clean, struct lruvec *lruvec, int type, bool skip_retry)
> +{
> +       struct folio *folio;
> +       struct folio *next;
> +
> +       list_for_each_entry_safe_reverse(folio, next, list, lru) {
> +#ifdef CONFIG_LRU_GEN
> +               DEFINE_MIN_SEQ(lruvec);
> +#endif
> +               if (!folio_evictable(folio)) {
> +                       list_del(&folio->lru);
> +                       folio_putback_lru(folio);
> +                       continue;
> +               }
> +
> +               /* retry folios that may have missed folio_rotate_reclaimable() */
> +               if (!skip_retry && !folio_test_active(folio) && !folio_mapped(folio) &&
> +                   !folio_test_dirty(folio) && !folio_test_writeback(folio)) {

Have you verified that this condition also holds for the
active/inactive LRU or did you just assume it? IOW, how do we know the
active/inactive LRU doesn't think this folio should be kept (and put
back to the head of the inactive LRU list).


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v7 mm-unstable] mm: vmscan: retry folios written back while isolated for traditional LRU
  2025-01-11 22:12 ` Yu Zhao
@ 2025-01-13  8:35   ` chenridong
  0 siblings, 0 replies; 6+ messages in thread
From: chenridong @ 2025-01-13  8:35 UTC (permalink / raw)
  To: Yu Zhao, Chen Ridong, Wei Xu
  Cc: akpm, mhocko, hannes, yosryahmed, david, willy, ryan.roberts,
	baohua, 21cnbao, wangkefeng.wang, linux-mm, linux-kernel,
	wangweiyang2, xieym_ict



On 2025/1/12 6:12, Yu Zhao wrote:
> On Sat, Jan 11, 2025 at 2:25 AM Chen Ridong <chenridong@huaweicloud.com> wrote:
>>
>> From: Chen Ridong <chenridong@huawei.com>
>>
>> As commit 359a5e1416ca ("mm: multi-gen LRU: retry folios written back
>> while isolated") mentioned:
>>
>>   The page reclaim isolates a batch of folios from the tail of one of the
>>   LRU lists and works on those folios one by one.  For a suitable
>>   swap-backed folio, if the swap device is async, it queues that folio for
>>   writeback.  After the page reclaim finishes an entire batch, it puts back
>>   the folios it queued for writeback to the head of the original LRU list.
>>
>>   In the meantime, the page writeback flushes the queued folios also by
>>   batches.  Its batching logic is independent from that of the page
>>   reclaim. For each of the folios it writes back, the page writeback calls
>>   folio_rotate_reclaimable() which tries to rotate a folio to the tail.
>>
>>   folio_rotate_reclaimable() only works for a folio after the page reclaim
>>   has put it back.  If an async swap device is fast enough, the page
>>   writeback can finish with that folio while the page reclaim is still
>>   working on the rest of the batch containing it.  In this case, that folio
>>   will remain at the head and the page reclaim will not retry it before
>>   reaching there".
>>
>> The commit 359a5e1416ca ("mm: multi-gen LRU: retry folios written back
>> while isolated") only fixed the issue for mglru. However, this issue
>> also exists in the traditional active/inactive LRU and was found at [1].
> 
> The active/inactive LRU needs more careful thoughts due to its
> complexity. Details below.
> 
>> It can be reproduced with below steps:
>>
>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
>> 2. Mount memcg v1, and create memcg named test_memcg and set
>>    limit_in_bytes=1G, memsw.limit_in_bytes=2G.
>> 3. Create a 1G swap file, and allocate 1.05G anon memory in test_memcg.
>>
>> It was found that:
>>
>>   cat memory.limit_in_bytes
>>   1073741824
>>   cat memory.memsw.limit_in_bytes
>>   2147483648
>>   cat memory.usage_in_bytes
>>   1073664000
>>   cat memory.memsw.usage_in_bytes
>>   1129840640
>>
>>   free -h
>>                 total        used        free
>>   Mem:           31Gi       1.2Gi        28Gi
>>   Swap:         1.0Gi       1.0Gi       2.0Mi
>>
>> As shown above, the test_memcg used about 50M swap, but almost 1G swap
>> memory was used, which means that 900M+ may be wasted because other memcgs
>> can not use these swap memory.
>>
>> This issue should be fixed in the same way as mglru. Therefore, the common
>> logic was extracted to the 'find_folios_written_back' function firstly,
>> which is then reused in the 'shrink_inactive_list' function. Finally,
>> retry reclaiming those folios that may have missed the rotation for
>> traditional LRU.
>>
>> After change, the same test case. only 54M swap was used.
>>
>>   cat memory.usage_in_bytes
>>   1073463296
>>   cat memory.memsw.usage_in_bytes
>>   1129828352
>>
>>   free -h
>>                 total        used        free
>>   Mem:           31Gi       1.2Gi        28Gi
>>   Swap:         1.0Gi        54Mi       969Mi
>>
>> [1] https://lore.kernel.org/linux-kernel/20241010081802.290893-1-chenridong@huaweicloud.com/
>> [2] https://lore.kernel.org/linux-kernel/CAGsJ_4zqL8ZHNRZ44o_CC69kE7DBVXvbZfvmQxMGiFqRxqHQdA@mail.gmail.com/
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>> ---
>>
>> v6->v7:
>>  - fix conflict based on mm-unstable.
>>  - update the commit message(quote from YU's commit message, and add
>>    improvements after change.)
>>  - restore 'is_retrying' to 'skip_retry' to keep original semantics.
>>
>> v6: https://lore.kernel.org/linux-kernel/20241223082004.3759152-1-chenridong@huaweicloud.com/
>>
>>  mm/vmscan.c | 114 ++++++++++++++++++++++++++++++++++------------------
>>  1 file changed, 76 insertions(+), 38 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 01dce6f26..6861b6937 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -183,6 +183,9 @@ struct scan_control {
>>         struct reclaim_state reclaim_state;
>>  };
>>
>> +static inline void find_folios_written_back(struct list_head *list,
>> +               struct list_head *clean, struct lruvec *lruvec, int type, bool is_retrying);
>> +
>>  #ifdef ARCH_HAS_PREFETCHW
>>  #define prefetchw_prev_lru_folio(_folio, _base, _field)                        \
>>         do {                                                            \
>> @@ -1960,14 +1963,18 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>>                 enum lru_list lru)
>>  {
>>         LIST_HEAD(folio_list);
>> +       LIST_HEAD(clean_list);
>>         unsigned long nr_scanned;
>> -       unsigned int nr_reclaimed = 0;
>> +       unsigned int nr_reclaimed, total_reclaimed = 0;
>> +       unsigned int nr_pageout = 0;
>> +       unsigned int nr_unqueued_dirty = 0;
>>         unsigned long nr_taken;
>>         struct reclaim_stat stat;
>>         bool file = is_file_lru(lru);
>>         enum vm_event_item item;
>>         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>         bool stalled = false;
>> +       bool skip_retry = false;
>>
>>         while (unlikely(too_many_isolated(pgdat, file, sc))) {
>>                 if (stalled)
>> @@ -2001,22 +2008,47 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>>         if (nr_taken == 0)
>>                 return 0;
>>
>> +retry:
>>         nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false);
>>
>> +       sc->nr.dirty += stat.nr_dirty;
>> +       sc->nr.congested += stat.nr_congested;
>> +       sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
>> +       sc->nr.writeback += stat.nr_writeback;
> 
> I think this change breaks the tests on the stats above, e.g.,
> wakeup_flusher_threads(), because the same dirty/writeback folio can
> be counted twice. The reason for that is that
> folio_test_dirty/writeback() can't account for dirty/writeback buffer
> heads, which can only be done by folio_check_dirty_writeback().
> 
> For MGLRU, it has been broken since day 1 and commit 1bc542c6a0d1
> ("mm/vmscan: wake up flushers conditionally to avoid cgroup OOM")
> doesn't account for this either. I'll get around to that.

Hi, Yu, thank you for your review.
Maybe nr_reclaimed is the only value we need to accumulate? We only want
to retry folios that may have missed folio_rotate_reclaimable(), and
these folios should be reclaimed and freed. Therefore, we need to
accumulate nr_reclaimed. For the other fields in the stat, we should
just keep the values that were obtained the first time they were shrunk.
But I'm not sure if I'm missing something.

> 
>> +       sc->nr.immediate += stat.nr_immediate;
>> +       total_reclaimed += nr_reclaimed;
>> +       nr_pageout += stat.nr_pageout;
>> +       nr_unqueued_dirty += stat.nr_unqueued_dirty;
>> +
>> +       trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>> +                       nr_scanned, nr_reclaimed, &stat, sc->priority, file);
>> +
>> +       find_folios_written_back(&folio_list, &clean_list, lruvec, 0, skip_retry);
>> +
>>         spin_lock_irq(&lruvec->lru_lock);
>>         move_folios_to_lru(lruvec, &folio_list);
>>
>>         __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(),
>>                                         stat.nr_demoted);
>> -       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>>         item = PGSTEAL_KSWAPD + reclaimer_offset();
>>         if (!cgroup_reclaim(sc))
>>                 __count_vm_events(item, nr_reclaimed);
>>         __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
>>         __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
>> +
>> +       if (!list_empty(&clean_list)) {
>> +               list_splice_init(&clean_list, &folio_list);
>> +               skip_retry = true;
>> +               spin_unlock_irq(&lruvec->lru_lock);
>> +               goto retry;
>> +       }
>> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>>         spin_unlock_irq(&lruvec->lru_lock);
>> +       sc->nr.taken += nr_taken;
>> +       if (file)
>> +               sc->nr.file_taken += nr_taken;
>>
>> -       lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed);
>> +       lru_note_cost(lruvec, file, nr_pageout, nr_scanned - total_reclaimed);
>>
>>         /*
>>          * If dirty folios are scanned that are not queued for IO, it
>> @@ -2029,7 +2061,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>>          * the flushers simply cannot keep up with the allocation
>>          * rate. Nudge the flusher threads in case they are asleep.
>>          */
>> -       if (stat.nr_unqueued_dirty == nr_taken) {
>> +       if (nr_unqueued_dirty == nr_taken) {
>>                 wakeup_flusher_threads(WB_REASON_VMSCAN);
>>                 /*
>>                  * For cgroupv1 dirty throttling is achieved by waking up
>> @@ -2044,18 +2076,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>>                         reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
>>         }
>>
>> -       sc->nr.dirty += stat.nr_dirty;
>> -       sc->nr.congested += stat.nr_congested;
>> -       sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
>> -       sc->nr.writeback += stat.nr_writeback;
>> -       sc->nr.immediate += stat.nr_immediate;
>> -       sc->nr.taken += nr_taken;
>> -       if (file)
>> -               sc->nr.file_taken += nr_taken;
>> -
>> -       trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>> -                       nr_scanned, nr_reclaimed, &stat, sc->priority, file);
>> -       return nr_reclaimed;
>> +       return total_reclaimed;
>>  }
>>
>>  /*
>> @@ -4637,8 +4658,6 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>>         int reclaimed;
>>         LIST_HEAD(list);
>>         LIST_HEAD(clean);
>> -       struct folio *folio;
>> -       struct folio *next;
>>         enum vm_event_item item;
>>         struct reclaim_stat stat;
>>         struct lru_gen_mm_walk *walk;
>> @@ -4668,26 +4687,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>>                         scanned, reclaimed, &stat, sc->priority,
>>                         type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>>
>> -       list_for_each_entry_safe_reverse(folio, next, &list, lru) {
>> -               DEFINE_MIN_SEQ(lruvec);
>> -
>> -               if (!folio_evictable(folio)) {
>> -                       list_del(&folio->lru);
>> -                       folio_putback_lru(folio);
>> -                       continue;
>> -               }
>> -
>> -               /* retry folios that may have missed folio_rotate_reclaimable() */
>> -               if (!skip_retry && !folio_test_active(folio) && !folio_mapped(folio) &&
>> -                   !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
>> -                       list_move(&folio->lru, &clean);
>> -                       continue;
>> -               }
>> -
>> -               /* don't add rejected folios to the oldest generation */
>> -               if (lru_gen_folio_seq(lruvec, folio, false) == min_seq[type])
>> -                       set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
>> -       }
>> +       find_folios_written_back(&list, &clean, lruvec, type, skip_retry);
>>
>>         spin_lock_irq(&lruvec->lru_lock);
>>
>> @@ -5706,6 +5706,44 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
>>
>>  #endif /* CONFIG_LRU_GEN */
>>
>> +/**
>> + * find_folios_written_back - Find and move the written back folios to a new list.
>> + * @list: filios list
>> + * @clean: the written back folios list
>> + * @lruvec: the lruvec
>> + * @type: LRU_GEN_ANON/LRU_GEN_FILE, only for multi-gen LRU
>> + * @skip_retry: whether skip retry.
>> + */
>> +static inline void find_folios_written_back(struct list_head *list,
>> +               struct list_head *clean, struct lruvec *lruvec, int type, bool skip_retry)
>> +{
>> +       struct folio *folio;
>> +       struct folio *next;
>> +
>> +       list_for_each_entry_safe_reverse(folio, next, list, lru) {
>> +#ifdef CONFIG_LRU_GEN
>> +               DEFINE_MIN_SEQ(lruvec);
>> +#endif
>> +               if (!folio_evictable(folio)) {
>> +                       list_del(&folio->lru);
>> +                       folio_putback_lru(folio);
>> +                       continue;
>> +               }
>> +
>> +               /* retry folios that may have missed folio_rotate_reclaimable() */
>> +               if (!skip_retry && !folio_test_active(folio) && !folio_mapped(folio) &&
>> +                   !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
> 
> Have you verified that this condition also holds for the
> active/inactive LRU or did you just assume it? IOW, how do we know the
> active/inactive LRU doesn't think this folio should be kept (and put
> back to the head of the inactive LRU list).
> 

As the message shows, I tested my case and it worked for my case. I
added logs, and they could identify the folios that have missed
folio_rotate_reclaimable(). I think it's the same for both MGLRU and
active/inactive LRU to identify the folios that may have missed
folio_rotate_reclaimable(). Or did I miss something again?

Thank you again.

Best regards,
Ridong


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v7 mm-unstable] mm: vmscan: retry folios written back while isolated for traditional LRU
  2025-01-11  9:15 [PATCH v7 mm-unstable] mm: vmscan: retry folios written back while isolated for traditional LRU Chen Ridong
  2025-01-11 22:12 ` Yu Zhao
@ 2025-01-13 15:52 ` Johannes Weiner
  2025-01-15  3:59   ` Yu Zhao
  1 sibling, 1 reply; 6+ messages in thread
From: Johannes Weiner @ 2025-01-13 15:52 UTC (permalink / raw)
  To: Chen Ridong
  Cc: akpm, mhocko, yosryahmed, yuzhao, david, willy, ryan.roberts,
	baohua, 21cnbao, wangkefeng.wang, linux-mm, linux-kernel,
	chenridong, wangweiyang2, xieym_ict

On Sat, Jan 11, 2025 at 09:15:04AM +0000, Chen Ridong wrote:
> @@ -5706,6 +5706,44 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
>  
>  #endif /* CONFIG_LRU_GEN */
>  
> +/**
> + * find_folios_written_back - Find and move the written back folios to a new list.
> + * @list: filios list
> + * @clean: the written back folios list
> + * @lruvec: the lruvec
> + * @type: LRU_GEN_ANON/LRU_GEN_FILE, only for multi-gen LRU
> + * @skip_retry: whether skip retry.
> + */
> +static inline void find_folios_written_back(struct list_head *list,
> +		struct list_head *clean, struct lruvec *lruvec, int type, bool skip_retry)
> +{
> +	struct folio *folio;
> +	struct folio *next;
> +
> +	list_for_each_entry_safe_reverse(folio, next, list, lru) {
> +#ifdef CONFIG_LRU_GEN
> +		DEFINE_MIN_SEQ(lruvec);
> +#endif
> +		if (!folio_evictable(folio)) {
> +			list_del(&folio->lru);
> +			folio_putback_lru(folio);
> +			continue;
> +		}
> +
> +		/* retry folios that may have missed folio_rotate_reclaimable() */
> +		if (!skip_retry && !folio_test_active(folio) && !folio_mapped(folio) &&
> +		    !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
> +			list_move(&folio->lru, clean);
> +			continue;
> +		}
> +#ifdef CONFIG_LRU_GEN
> +		/* don't add rejected folios to the oldest generation */
> +		if (lru_gen_enabled() && lru_gen_folio_seq(lruvec, folio, false) == min_seq[type])
> +			set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
> +#endif
> +	}

Can't this solved much more easily by acting on the flag in the
generic LRU add/putback path? Instead of walking the list again.

Especially with Kirill's "[PATCH 0/8] mm: Remove PG_reclaim" that
removes the PG_readahead ambiguity.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v7 mm-unstable] mm: vmscan: retry folios written back while isolated for traditional LRU
  2025-01-13 15:52 ` Johannes Weiner
@ 2025-01-15  3:59   ` Yu Zhao
  2025-01-17 16:08     ` Johannes Weiner
  0 siblings, 1 reply; 6+ messages in thread
From: Yu Zhao @ 2025-01-15  3:59 UTC (permalink / raw)
  To: Johannes Weiner, Kirill A . Shutemov
  Cc: Chen Ridong, akpm, mhocko, yosryahmed, david, willy,
	ryan.roberts, baohua, 21cnbao, wangkefeng.wang, linux-mm,
	linux-kernel, chenridong, wangweiyang2, xieym_ict

On Mon, Jan 13, 2025 at 8:52 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Sat, Jan 11, 2025 at 09:15:04AM +0000, Chen Ridong wrote:
> > @@ -5706,6 +5706,44 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
> >
> >  #endif /* CONFIG_LRU_GEN */
> >
> > +/**
> > + * find_folios_written_back - Find and move the written back folios to a new list.
> > + * @list: filios list
> > + * @clean: the written back folios list
> > + * @lruvec: the lruvec
> > + * @type: LRU_GEN_ANON/LRU_GEN_FILE, only for multi-gen LRU
> > + * @skip_retry: whether skip retry.
> > + */
> > +static inline void find_folios_written_back(struct list_head *list,
> > +             struct list_head *clean, struct lruvec *lruvec, int type, bool skip_retry)
> > +{
> > +     struct folio *folio;
> > +     struct folio *next;
> > +
> > +     list_for_each_entry_safe_reverse(folio, next, list, lru) {
> > +#ifdef CONFIG_LRU_GEN
> > +             DEFINE_MIN_SEQ(lruvec);
> > +#endif
> > +             if (!folio_evictable(folio)) {
> > +                     list_del(&folio->lru);
> > +                     folio_putback_lru(folio);
> > +                     continue;
> > +             }
> > +
> > +             /* retry folios that may have missed folio_rotate_reclaimable() */
> > +             if (!skip_retry && !folio_test_active(folio) && !folio_mapped(folio) &&
> > +                 !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
> > +                     list_move(&folio->lru, clean);
> > +                     continue;
> > +             }
> > +#ifdef CONFIG_LRU_GEN
> > +             /* don't add rejected folios to the oldest generation */
> > +             if (lru_gen_enabled() && lru_gen_folio_seq(lruvec, folio, false) == min_seq[type])
> > +                     set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
> > +#endif
> > +     }
>
> Can't this solved much more easily by acting on the flag in the
> generic LRU add/putback path? Instead of walking the list again.
>
> Especially with Kirill's "[PATCH 0/8] mm: Remove PG_reclaim" that
> removes the PG_readahead ambiguity.

I don't follow -- my understanding is that with Kirill's series, there
is no need to do anything for the generic path. (I'll remove the retry
in MGLRU which Kirill left behind.)

This approach actually came up during the discussions while I was
looking at the problem. My concern back then was that shifting the
work from the reclaim path to the writeback path can reduce the
overall writeback throughput. IIRC, there were regressions with how I
implemented it. Let me try finding my notes and see if those
regressions still exist with Jen and Kirill's implementation.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v7 mm-unstable] mm: vmscan: retry folios written back while isolated for traditional LRU
  2025-01-15  3:59   ` Yu Zhao
@ 2025-01-17 16:08     ` Johannes Weiner
  0 siblings, 0 replies; 6+ messages in thread
From: Johannes Weiner @ 2025-01-17 16:08 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Kirill A . Shutemov, Chen Ridong, akpm, mhocko, yosryahmed,
	david, willy, ryan.roberts, baohua, 21cnbao, wangkefeng.wang,
	linux-mm, linux-kernel, chenridong, wangweiyang2, xieym_ict

On Tue, Jan 14, 2025 at 08:59:04PM -0700, Yu Zhao wrote:
> On Mon, Jan 13, 2025 at 8:52 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Sat, Jan 11, 2025 at 09:15:04AM +0000, Chen Ridong wrote:
> > > @@ -5706,6 +5706,44 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
> > >
> > >  #endif /* CONFIG_LRU_GEN */
> > >
> > > +/**
> > > + * find_folios_written_back - Find and move the written back folios to a new list.
> > > + * @list: filios list
> > > + * @clean: the written back folios list
> > > + * @lruvec: the lruvec
> > > + * @type: LRU_GEN_ANON/LRU_GEN_FILE, only for multi-gen LRU
> > > + * @skip_retry: whether skip retry.
> > > + */
> > > +static inline void find_folios_written_back(struct list_head *list,
> > > +             struct list_head *clean, struct lruvec *lruvec, int type, bool skip_retry)
> > > +{
> > > +     struct folio *folio;
> > > +     struct folio *next;
> > > +
> > > +     list_for_each_entry_safe_reverse(folio, next, list, lru) {
> > > +#ifdef CONFIG_LRU_GEN
> > > +             DEFINE_MIN_SEQ(lruvec);
> > > +#endif
> > > +             if (!folio_evictable(folio)) {
> > > +                     list_del(&folio->lru);
> > > +                     folio_putback_lru(folio);
> > > +                     continue;
> > > +             }
> > > +
> > > +             /* retry folios that may have missed folio_rotate_reclaimable() */
> > > +             if (!skip_retry && !folio_test_active(folio) && !folio_mapped(folio) &&
> > > +                 !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
> > > +                     list_move(&folio->lru, clean);
> > > +                     continue;
> > > +             }
> > > +#ifdef CONFIG_LRU_GEN
> > > +             /* don't add rejected folios to the oldest generation */
> > > +             if (lru_gen_enabled() && lru_gen_folio_seq(lruvec, folio, false) == min_seq[type])
> > > +                     set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
> > > +#endif
> > > +     }
> >
> > Can't this solved much more easily by acting on the flag in the
> > generic LRU add/putback path? Instead of walking the list again.
> >
> > Especially with Kirill's "[PATCH 0/8] mm: Remove PG_reclaim" that
> > removes the PG_readahead ambiguity.
> 
> I don't follow -- my understanding is that with Kirill's series, there
> is no need to do anything for the generic path. (I'll remove the retry
> in MGLRU which Kirill left behind.)

Dropbehind trylocks the folio, so there is still a chance of misses
when the IO submission is from reclaim (which also relocks the folio
after IO submission).

Granted, that race window is much smaller compared to the window
between rotation and batched LRU putback. Probably best to retest on
top of all the pending patches.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-01-17 16:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-11  9:15 [PATCH v7 mm-unstable] mm: vmscan: retry folios written back while isolated for traditional LRU Chen Ridong
2025-01-11 22:12 ` Yu Zhao
2025-01-13  8:35   ` chenridong
2025-01-13 15:52 ` Johannes Weiner
2025-01-15  3:59   ` Yu Zhao
2025-01-17 16:08     ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox