linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH mm-unstable v4 0/7] mm/mglru: performance optimizations
@ 2024-12-31  4:35 Yu Zhao
  2024-12-31  4:35 ` [PATCH mm-unstable v4 1/7] mm/mglru: clean up workingset Yu Zhao
                   ` (7 more replies)
  0 siblings, 8 replies; 14+ messages in thread
From: Yu Zhao @ 2024-12-31  4:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Yu Zhao

This series improves performance for some previously reported test
cases. Most of the code changes gathered here has been floating on the
mailing list [1][2]. They are now properly organized and have gone
through various benchmarks on client and server devices, including
Android, FIO, memcached, multiple VMs and MongoDB.

In addition to the syzbot regressions fixed in v2 [3] and v3 [4], this
version fixes two more regressions: one reported by Oliver Sang [5]
and the other by Barry Song.

[1] https://lore.kernel.org/CAOUHufahuWcKf5f1Sg3emnqX+cODuR=2TQo7T4Gr-QYLujn4RA@mail.gmail.com/
[2] https://lore.kernel.org/CAOUHufawNerxqLm7L9Yywp3HJFiYVrYO26ePUb1jH-qxNGWzyA@mail.gmail.com/
[3] https://lore.kernel.org/67294349.050a0220.701a.0010.GAE@google.com/
[4] https://lore.kernel.org/67549eca.050a0220.2477f.001b.GAE@google.com/
[5] https://lore.kernel.org/202412231601.f1eb8f84-lkp@intel.com/

Yu Zhao (7):
  mm/mglru: clean up workingset
  mm/mglru: optimize deactivation
  mm/mglru: rework aging feedback
  mm/mglru: rework type selection
  mm/mglru: rework refault detection
  mm/mglru: rework workingset protection
  mm/mglru: fix PTE-mapped large folios

 include/linux/mm_inline.h |  88 ++++---
 include/linux/mmzone.h    |  99 +++++---
 mm/swap.c                 |  70 ++++--
 mm/vmscan.c               | 515 +++++++++++++++++++-------------------
 mm/workingset.c           |  67 +++--
 5 files changed, 445 insertions(+), 394 deletions(-)

-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH mm-unstable v4 1/7] mm/mglru: clean up workingset
  2024-12-31  4:35 [PATCH mm-unstable v4 0/7] mm/mglru: performance optimizations Yu Zhao
@ 2024-12-31  4:35 ` Yu Zhao
  2024-12-31  4:35 ` [PATCH mm-unstable v4 2/7] mm/mglru: optimize deactivation Yu Zhao
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Yu Zhao @ 2024-12-31  4:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Yu Zhao, Kalesh Singh

Move VM_BUG_ON_FOLIO() to cover both the default and MGLRU paths. Also
use a pair of rcu_read_lock() and rcu_read_unlock() within each path,
to improve readability.

This change should not have any side effects.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
---
 mm/workingset.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index a4705e196545..ad181d1b8cf1 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -428,17 +428,17 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset,
 	struct pglist_data *pgdat;
 	unsigned long eviction;
 
-	rcu_read_lock();
-
 	if (lru_gen_enabled()) {
-		bool recent = lru_gen_test_recent(shadow, file,
-				&eviction_lruvec, &eviction, workingset);
+		bool recent;
 
+		rcu_read_lock();
+		recent = lru_gen_test_recent(shadow, file, &eviction_lruvec,
+					     &eviction, workingset);
 		rcu_read_unlock();
 		return recent;
 	}
 
-
+	rcu_read_lock();
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
 	eviction <<= bucket_order;
 
@@ -459,14 +459,12 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset,
 	 * configurations instead.
 	 */
 	eviction_memcg = mem_cgroup_from_id(memcgid);
-	if (!mem_cgroup_disabled() &&
-	    (!eviction_memcg || !mem_cgroup_tryget(eviction_memcg))) {
-		rcu_read_unlock();
+	if (!mem_cgroup_tryget(eviction_memcg))
+		eviction_memcg = NULL;
+	rcu_read_unlock();
+
+	if (!mem_cgroup_disabled() && !eviction_memcg)
 		return false;
-	}
-
-	rcu_read_unlock();
-
 	/*
 	 * Flush stats (and potentially sleep) outside the RCU read section.
 	 *
@@ -544,6 +542,8 @@ void workingset_refault(struct folio *folio, void *shadow)
 	bool workingset;
 	long nr;
 
+	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+
 	if (lru_gen_enabled()) {
 		lru_gen_refault(folio, shadow);
 		return;
@@ -558,7 +558,6 @@ void workingset_refault(struct folio *folio, void *shadow)
 	 * is actually experiencing the refault event. Make sure the folio is
 	 * locked to guarantee folio_memcg() stability throughout.
 	 */
-	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	nr = folio_nr_pages(folio);
 	memcg = folio_memcg(folio);
 	pgdat = folio_pgdat(folio);
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH mm-unstable v4 2/7] mm/mglru: optimize deactivation
  2024-12-31  4:35 [PATCH mm-unstable v4 0/7] mm/mglru: performance optimizations Yu Zhao
  2024-12-31  4:35 ` [PATCH mm-unstable v4 1/7] mm/mglru: clean up workingset Yu Zhao
@ 2024-12-31  4:35 ` Yu Zhao
  2024-12-31  4:35 ` [PATCH mm-unstable v4 3/7] mm/mglru: rework aging feedback Yu Zhao
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Yu Zhao @ 2024-12-31  4:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Yu Zhao, Bharata B Rao, Kalesh Singh

Do not shuffle a folio in the deactivation paths if it is already in
the oldest generation. This reduces the LRU lock contention.

Before this patch, the contention is reproducible by FIO, e.g.,

  fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \
      -rwmixwrite=30  --norandommap --randrepeat=0 -ioengine=sync \
      -bs=4k -numjobs=400 -runtime=25000 --time_based \
      -group_reporting -name=mglru

  98.96%--_raw_spin_lock_irqsave
          folio_lruvec_lock_irqsave
          |
           --98.78%--folio_batch_move_lru
               |
                --98.63%--deactivate_file_folio
                          mapping_try_invalidate
                          invalidate_mapping_pages
                          invalidate_bdev
                          blkdev_common_ioctl
                          blkdev_ioctl

After this patch, deactivate_file_folio() bails out early without
taking the LRU lock.

A side effect is that a folio can be left at the head of the oldest
generation, rather than the tail. If reclaim happens at the same time,
it cannot reclaim this folio immediately. Since there is no known
correlation between truncation and reclaim, this side effect is
considered insignificant.

Reported-by: Bharata B Rao <bharata@amd.com>
Closes: https://lore.kernel.org/CAOUHufawNerxqLm7L9Yywp3HJFiYVrYO26ePUb1jH-qxNGWzyA@mail.gmail.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
---
 mm/swap.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 41 insertions(+), 7 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 3a01acfd5a89..649ef7f2b74b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -379,7 +379,8 @@ static void __lru_cache_activate_folio(struct folio *folio)
 }
 
 #ifdef CONFIG_LRU_GEN
-static void folio_inc_refs(struct folio *folio)
+
+static void lru_gen_inc_refs(struct folio *folio)
 {
 	unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
 
@@ -406,10 +407,34 @@ static void folio_inc_refs(struct folio *folio)
 		new_flags |= old_flags & ~LRU_REFS_MASK;
 	} while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
 }
-#else
-static void folio_inc_refs(struct folio *folio)
+
+static bool lru_gen_clear_refs(struct folio *folio)
 {
+	struct lru_gen_folio *lrugen;
+	int gen = folio_lru_gen(folio);
+	int type = folio_is_file_lru(folio);
+
+	if (gen < 0)
+		return true;
+
+	set_mask_bits(&folio->flags, LRU_REFS_MASK | LRU_REFS_FLAGS, 0);
+
+	lrugen = &folio_lruvec(folio)->lrugen;
+	/* whether can do without shuffling under the LRU lock */
+	return gen == lru_gen_from_seq(READ_ONCE(lrugen->min_seq[type]));
 }
+
+#else /* !CONFIG_LRU_GEN */
+
+static void lru_gen_inc_refs(struct folio *folio)
+{
+}
+
+static bool lru_gen_clear_refs(struct folio *folio)
+{
+	return false;
+}
+
 #endif /* CONFIG_LRU_GEN */
 
 /**
@@ -428,7 +453,7 @@ static void folio_inc_refs(struct folio *folio)
 void folio_mark_accessed(struct folio *folio)
 {
 	if (lru_gen_enabled()) {
-		folio_inc_refs(folio);
+		lru_gen_inc_refs(folio);
 		return;
 	}
 
@@ -524,7 +549,7 @@ void folio_add_lru_vma(struct folio *folio, struct vm_area_struct *vma)
  */
 static void lru_deactivate_file(struct lruvec *lruvec, struct folio *folio)
 {
-	bool active = folio_test_active(folio);
+	bool active = folio_test_active(folio) || lru_gen_enabled();
 	long nr_pages = folio_nr_pages(folio);
 
 	if (folio_test_unevictable(folio))
@@ -589,7 +614,10 @@ static void lru_lazyfree(struct lruvec *lruvec, struct folio *folio)
 
 	lruvec_del_folio(lruvec, folio);
 	folio_clear_active(folio);
-	folio_clear_referenced(folio);
+	if (lru_gen_enabled())
+		lru_gen_clear_refs(folio);
+	else
+		folio_clear_referenced(folio);
 	/*
 	 * Lazyfree folios are clean anonymous folios.  They have
 	 * the swapbacked flag cleared, to distinguish them from normal
@@ -657,6 +685,9 @@ void deactivate_file_folio(struct folio *folio)
 	if (folio_test_unevictable(folio))
 		return;
 
+	if (lru_gen_enabled() && lru_gen_clear_refs(folio))
+		return;
+
 	folio_batch_add_and_move(folio, lru_deactivate_file, true);
 }
 
@@ -670,7 +701,10 @@ void deactivate_file_folio(struct folio *folio)
  */
 void folio_deactivate(struct folio *folio)
 {
-	if (folio_test_unevictable(folio) || !(folio_test_active(folio) || lru_gen_enabled()))
+	if (folio_test_unevictable(folio))
+		return;
+
+	if (lru_gen_enabled() ? lru_gen_clear_refs(folio) : !folio_test_active(folio))
 		return;
 
 	folio_batch_add_and_move(folio, lru_deactivate, true);
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH mm-unstable v4 3/7] mm/mglru: rework aging feedback
  2024-12-31  4:35 [PATCH mm-unstable v4 0/7] mm/mglru: performance optimizations Yu Zhao
  2024-12-31  4:35 ` [PATCH mm-unstable v4 1/7] mm/mglru: clean up workingset Yu Zhao
  2024-12-31  4:35 ` [PATCH mm-unstable v4 2/7] mm/mglru: optimize deactivation Yu Zhao
@ 2024-12-31  4:35 ` Yu Zhao
  2025-01-07 17:14   ` Kairui Song
  2024-12-31  4:35 ` [PATCH mm-unstable v4 4/7] mm/mglru: rework type selection Yu Zhao
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 14+ messages in thread
From: Yu Zhao @ 2024-12-31  4:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Yu Zhao, David Stevens, Kalesh Singh

The aging feedback is based on both the number of generations and the
distribution of folios in each generation. The number of generations
is currently the distance between max_seq and anon min_seq. This is
because anon min_seq is not allowed to move past file min_seq. The
rationale for that is that file is always evictable whereas anon is
not. However, for use cases where anon is a lot cheaper than file:
1. Anon in the second oldest generation can be a better choice than
   file in the oldest generation.
2. A large amount of file in the oldest generation can skew the
   distribution, making should_run_aging() return false negative.

Allow anon and file min_seq to move independently, and use solely the
number of generations as the feedback for aging. Specifically, when
both anon and file are evictable, anon min_seq can now be greater than
file min_seq, and therefore the number of generations becomes the
distance between max_seq and min(min_seq[0],min_seq[1]). And
should_run_aging() returns true if and only if the number of
generations is less than MAX_NR_GENS.

As the first step to the final optimization, this change by itself
should not have userspace-visiable effects beyond performance. The
next twos patch will take advantage of this change; the last patch in
this series will better distribute folios across MAX_NR_GENS.

Reported-by: David Stevens <stevensd@chromium.org>
Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
---
 include/linux/mmzone.h |  17 ++--
 mm/vmscan.c            | 200 ++++++++++++++++++-----------------------
 2 files changed, 96 insertions(+), 121 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b36124145a16..8245ecb0400b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -421,12 +421,11 @@ enum {
 /*
  * The youngest generation number is stored in max_seq for both anon and file
  * types as they are aged on an equal footing. The oldest generation numbers are
- * stored in min_seq[] separately for anon and file types as clean file pages
- * can be evicted regardless of swap constraints.
- *
- * Normally anon and file min_seq are in sync. But if swapping is constrained,
- * e.g., out of swap space, file min_seq is allowed to advance and leave anon
- * min_seq behind.
+ * stored in min_seq[] separately for anon and file types so that they can be
+ * incremented independently. Ideally min_seq[] are kept in sync when both anon
+ * and file types are evictable. However, to adapt to situations like extreme
+ * swappiness, they are allowed to be out of sync by at most
+ * MAX_NR_GENS-MIN_NR_GENS-1.
  *
  * The number of pages in each generation is eventually consistent and therefore
  * can be transiently negative when reset_batch_size() is pending.
@@ -446,8 +445,8 @@ struct lru_gen_folio {
 	unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
 	/* the exponential moving average of evicted+protected */
 	unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
-	/* the first tier doesn't need protection, hence the minus one */
-	unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
+	/* can only be modified under the LRU lock */
+	unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 	/* can be modified without holding the LRU lock */
 	atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
@@ -498,7 +497,7 @@ struct lru_gen_mm_walk {
 	int mm_stats[NR_MM_STATS];
 	/* total batched items */
 	int batched;
-	bool can_swap;
+	int swappiness;
 	bool force_scan;
 };
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f236db86de8a..f767e3d34e73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2627,11 +2627,17 @@ static bool should_clear_pmd_young(void)
 		READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]),	\
 	}
 
+#define evictable_min_seq(min_seq, swappiness)				\
+	min((min_seq)[!(swappiness)], (min_seq)[(swappiness) != MAX_SWAPPINESS])
+
 #define for_each_gen_type_zone(gen, type, zone)				\
 	for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)			\
 		for ((type) = 0; (type) < ANON_AND_FILE; (type)++)	\
 			for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
 
+#define for_each_evictable_type(type, swappiness)			\
+	for ((type) = !(swappiness); (type) <= ((swappiness) != MAX_SWAPPINESS); (type)++)
+
 #define get_memcg_gen(seq)	((seq) % MEMCG_NR_GENS)
 #define get_memcg_bin(bin)	((bin) % MEMCG_NR_BINS)
 
@@ -2677,10 +2683,16 @@ static int get_nr_gens(struct lruvec *lruvec, int type)
 
 static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
 {
-	/* see the comment on lru_gen_folio */
-	return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS &&
-	       get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, LRU_GEN_ANON) &&
-	       get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
+	int type;
+
+	for (type = 0; type < ANON_AND_FILE; type++) {
+		int n = get_nr_gens(lruvec, type);
+
+		if (n < MIN_NR_GENS || n > MAX_NR_GENS)
+			return false;
+	}
+
+	return true;
 }
 
 /******************************************************************************
@@ -3087,9 +3099,8 @@ static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
 	pos->refaulted = lrugen->avg_refaulted[type][tier] +
 			 atomic_long_read(&lrugen->refaulted[hist][type][tier]);
 	pos->total = lrugen->avg_total[type][tier] +
+		     lrugen->protected[hist][type][tier] +
 		     atomic_long_read(&lrugen->evicted[hist][type][tier]);
-	if (tier)
-		pos->total += lrugen->protected[hist][type][tier - 1];
 	pos->gain = gain;
 }
 
@@ -3116,17 +3127,15 @@ static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)
 			WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
 
 			sum = lrugen->avg_total[type][tier] +
+			      lrugen->protected[hist][type][tier] +
 			      atomic_long_read(&lrugen->evicted[hist][type][tier]);
-			if (tier)
-				sum += lrugen->protected[hist][type][tier - 1];
 			WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
 		}
 
 		if (clear) {
 			atomic_long_set(&lrugen->refaulted[hist][type][tier], 0);
 			atomic_long_set(&lrugen->evicted[hist][type][tier], 0);
-			if (tier)
-				WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0);
+			WRITE_ONCE(lrugen->protected[hist][type][tier], 0);
 		}
 	}
 }
@@ -3261,7 +3270,7 @@ static int should_skip_vma(unsigned long start, unsigned long end, struct mm_wal
 		return true;
 
 	if (vma_is_anonymous(vma))
-		return !walk->can_swap;
+		return !walk->swappiness;
 
 	if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping))
 		return true;
@@ -3271,7 +3280,10 @@ static int should_skip_vma(unsigned long start, unsigned long end, struct mm_wal
 		return true;
 
 	if (shmem_mapping(mapping))
-		return !walk->can_swap;
+		return !walk->swappiness;
+
+	if (walk->swappiness == MAX_SWAPPINESS)
+		return true;
 
 	/* to exclude special mappings like dax, etc. */
 	return !mapping->a_ops->read_folio;
@@ -3359,7 +3371,7 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned
 }
 
 static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
-				   struct pglist_data *pgdat, bool can_swap)
+				   struct pglist_data *pgdat)
 {
 	struct folio *folio;
 
@@ -3370,10 +3382,6 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
 	if (folio_memcg(folio) != memcg)
 		return NULL;
 
-	/* file VMAs can contain anon pages from COW */
-	if (!folio_is_file_lru(folio) && !can_swap)
-		return NULL;
-
 	return folio;
 }
 
@@ -3429,7 +3437,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		if (pfn == -1)
 			continue;
 
-		folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
+		folio = get_pfn_folio(pfn, memcg, pgdat);
 		if (!folio)
 			continue;
 
@@ -3514,7 +3522,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 		if (pfn == -1)
 			goto next;
 
-		folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
+		folio = get_pfn_folio(pfn, memcg, pgdat);
 		if (!folio)
 			goto next;
 
@@ -3726,22 +3734,26 @@ static void clear_mm_walk(void)
 		kfree(walk);
 }
 
-static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
+static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
 {
 	int zone;
 	int remaining = MAX_LRU_BATCH;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
+	int hist = lru_hist_from_seq(lrugen->min_seq[type]);
 	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
 
-	if (type == LRU_GEN_ANON && !can_swap)
+	if (type ? swappiness == MAX_SWAPPINESS : !swappiness)
 		goto done;
 
-	/* prevent cold/hot inversion if force_scan is true */
+	/* prevent cold/hot inversion if the type is evictable */
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		struct list_head *head = &lrugen->folios[old_gen][type][zone];
 
 		while (!list_empty(head)) {
 			struct folio *folio = lru_to_folio(head);
+			int refs = folio_lru_refs(folio);
+			int tier = lru_tier_from_refs(refs);
+			int delta = folio_nr_pages(folio);
 
 			VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
 			VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
@@ -3751,6 +3763,9 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
 			new_gen = folio_inc_gen(lruvec, folio, false);
 			list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
 
+			WRITE_ONCE(lrugen->protected[hist][type][tier],
+				   lrugen->protected[hist][type][tier] + delta);
+
 			if (!--remaining)
 				return false;
 		}
@@ -3762,7 +3777,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
 	return true;
 }
 
-static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
+static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
 {
 	int gen, type, zone;
 	bool success = false;
@@ -3772,7 +3787,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
 	VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
 
 	/* find the oldest populated generation */
-	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+	for_each_evictable_type(type, swappiness) {
 		while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) {
 			gen = lru_gen_from_seq(min_seq[type]);
 
@@ -3788,13 +3803,17 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
 	}
 
 	/* see the comment on lru_gen_folio */
-	if (can_swap) {
-		min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]);
-		min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]);
+	if (swappiness && swappiness != MAX_SWAPPINESS) {
+		unsigned long seq = lrugen->max_seq - MIN_NR_GENS;
+
+		if (min_seq[LRU_GEN_ANON] > seq && min_seq[LRU_GEN_FILE] < seq)
+			min_seq[LRU_GEN_ANON] = seq;
+		else if (min_seq[LRU_GEN_FILE] > seq && min_seq[LRU_GEN_ANON] < seq)
+			min_seq[LRU_GEN_FILE] = seq;
 	}
 
-	for (type = !can_swap; type < ANON_AND_FILE; type++) {
-		if (min_seq[type] == lrugen->min_seq[type])
+	for_each_evictable_type(type, swappiness) {
+		if (min_seq[type] <= lrugen->min_seq[type])
 			continue;
 
 		reset_ctrl_pos(lruvec, type, true);
@@ -3805,8 +3824,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
 	return success;
 }
 
-static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
-			bool can_swap, bool force_scan)
+static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness)
 {
 	bool success;
 	int prev, next;
@@ -3824,13 +3842,11 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
 	if (!success)
 		goto unlock;
 
-	for (type = ANON_AND_FILE - 1; type >= 0; type--) {
+	for (type = 0; type < ANON_AND_FILE; type++) {
 		if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
 			continue;
 
-		VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap));
-
-		if (inc_min_seq(lruvec, type, can_swap))
+		if (inc_min_seq(lruvec, type, swappiness))
 			continue;
 
 		spin_unlock_irq(&lruvec->lru_lock);
@@ -3874,7 +3890,7 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
 }
 
 static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
-			       bool can_swap, bool force_scan)
+			       int swappiness, bool force_scan)
 {
 	bool success;
 	struct lru_gen_mm_walk *walk;
@@ -3885,7 +3901,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
 	VM_WARN_ON_ONCE(seq > READ_ONCE(lrugen->max_seq));
 
 	if (!mm_state)
-		return inc_max_seq(lruvec, seq, can_swap, force_scan);
+		return inc_max_seq(lruvec, seq, swappiness);
 
 	/* see the comment in iterate_mm_list() */
 	if (seq <= READ_ONCE(mm_state->seq))
@@ -3910,7 +3926,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
 
 	walk->lruvec = lruvec;
 	walk->seq = seq;
-	walk->can_swap = can_swap;
+	walk->swappiness = swappiness;
 	walk->force_scan = force_scan;
 
 	do {
@@ -3920,7 +3936,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
 	} while (mm);
 done:
 	if (success) {
-		success = inc_max_seq(lruvec, seq, can_swap, force_scan);
+		success = inc_max_seq(lruvec, seq, swappiness);
 		WARN_ON_ONCE(!success);
 	}
 
@@ -3961,13 +3977,13 @@ static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
 {
 	int gen, type, zone;
 	unsigned long total = 0;
-	bool can_swap = get_swappiness(lruvec, sc);
+	int swappiness = get_swappiness(lruvec, sc);
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	DEFINE_MAX_SEQ(lruvec);
 	DEFINE_MIN_SEQ(lruvec);
 
-	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+	for_each_evictable_type(type, swappiness) {
 		unsigned long seq;
 
 		for (seq = min_seq[type]; seq <= max_seq; seq++) {
@@ -3987,6 +4003,7 @@ static bool lruvec_is_reclaimable(struct lruvec *lruvec, struct scan_control *sc
 {
 	int gen;
 	unsigned long birth;
+	int swappiness = get_swappiness(lruvec, sc);
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	DEFINE_MIN_SEQ(lruvec);
 
@@ -3996,8 +4013,7 @@ static bool lruvec_is_reclaimable(struct lruvec *lruvec, struct scan_control *sc
 	if (!lruvec_is_sizable(lruvec, sc))
 		return false;
 
-	/* see the comment on lru_gen_folio */
-	gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]);
+	gen = lru_gen_from_seq(evictable_min_seq(min_seq, swappiness));
 	birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
 
 	return time_is_before_jiffies(birth + min_ttl);
@@ -4064,7 +4080,6 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	unsigned long addr = pvmw->address;
 	struct vm_area_struct *vma = pvmw->vma;
 	struct folio *folio = pfn_folio(pvmw->pfn);
-	bool can_swap = !folio_is_file_lru(folio);
 	struct mem_cgroup *memcg = folio_memcg(folio);
 	struct pglist_data *pgdat = folio_pgdat(folio);
 	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
@@ -4117,7 +4132,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 		if (pfn == -1)
 			continue;
 
-		folio = get_pfn_folio(pfn, memcg, pgdat, can_swap);
+		folio = get_pfn_folio(pfn, memcg, pgdat);
 		if (!folio)
 			continue;
 
@@ -4333,8 +4348,8 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		gen = folio_inc_gen(lruvec, folio, false);
 		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
 
-		WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
-			   lrugen->protected[hist][type][tier - 1] + delta);
+		WRITE_ONCE(lrugen->protected[hist][type][tier],
+			   lrugen->protected[hist][type][tier] + delta);
 		return true;
 	}
 
@@ -4533,7 +4548,6 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
 {
 	int i;
 	int type;
-	int scanned;
 	int tier = -1;
 	DEFINE_MIN_SEQ(lruvec);
 
@@ -4558,21 +4572,23 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
 	else
 		type = get_type_to_scan(lruvec, swappiness, &tier);
 
-	for (i = !swappiness; i < ANON_AND_FILE; i++) {
+	for_each_evictable_type(i, swappiness) {
+		int scanned;
+
 		if (tier < 0)
 			tier = get_tier_idx(lruvec, type);
 
+		*type_scanned = type;
+
 		scanned = scan_folios(lruvec, sc, type, tier, list);
 		if (scanned)
-			break;
+			return scanned;
 
 		type = !type;
 		tier = -1;
 	}
 
-	*type_scanned = type;
-
-	return scanned;
+	return 0;
 }
 
 static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
@@ -4588,6 +4604,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 	struct reclaim_stat stat;
 	struct lru_gen_mm_walk *walk;
 	bool skip_retry = false;
+	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
@@ -4597,7 +4614,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 
 	scanned += try_to_inc_min_seq(lruvec, swappiness);
 
-	if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS)
+	if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq)
 		scanned = 0;
 
 	spin_unlock_irq(&lruvec->lru_lock);
@@ -4669,63 +4686,32 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 }
 
 static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
-			     bool can_swap, unsigned long *nr_to_scan)
+			     int swappiness, unsigned long *nr_to_scan)
 {
 	int gen, type, zone;
-	unsigned long old = 0;
-	unsigned long young = 0;
-	unsigned long total = 0;
+	unsigned long size = 0;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	DEFINE_MIN_SEQ(lruvec);
 
-	/* whether this lruvec is completely out of cold folios */
-	if (min_seq[!can_swap] + MIN_NR_GENS > max_seq) {
-		*nr_to_scan = 0;
+	*nr_to_scan = 0;
+	/* have to run aging, since eviction is not possible anymore */
+	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
 		return true;
-	}
 
-	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+	for_each_evictable_type(type, swappiness) {
 		unsigned long seq;
 
 		for (seq = min_seq[type]; seq <= max_seq; seq++) {
-			unsigned long size = 0;
-
 			gen = lru_gen_from_seq(seq);
 
 			for (zone = 0; zone < MAX_NR_ZONES; zone++)
 				size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
-
-			total += size;
-			if (seq == max_seq)
-				young += size;
-			else if (seq + MIN_NR_GENS == max_seq)
-				old += size;
 		}
 	}
 
-	*nr_to_scan = total;
-
-	/*
-	 * The aging tries to be lazy to reduce the overhead, while the eviction
-	 * stalls when the number of generations reaches MIN_NR_GENS. Hence, the
-	 * ideal number of generations is MIN_NR_GENS+1.
-	 */
-	if (min_seq[!can_swap] + MIN_NR_GENS < max_seq)
-		return false;
-
-	/*
-	 * It's also ideal to spread pages out evenly, i.e., 1/(MIN_NR_GENS+1)
-	 * of the total number of pages for each generation. A reasonable range
-	 * for this average portion is [1/MIN_NR_GENS, 1/(MIN_NR_GENS+2)]. The
-	 * aging cares about the upper bound of hot pages, while the eviction
-	 * cares about the lower bound of cold pages.
-	 */
-	if (young * MIN_NR_GENS > total)
-		return true;
-	if (old * (MIN_NR_GENS + 2) < total)
-		return true;
-
-	return false;
+	*nr_to_scan = size;
+	/* better to run aging even though eviction is still possible */
+	return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
 }
 
 /*
@@ -4733,7 +4719,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
  * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
  *    reclaim.
  */
-static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool can_swap)
+static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
 {
 	bool success;
 	unsigned long nr_to_scan;
@@ -4743,7 +4729,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
 	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
 		return -1;
 
-	success = should_run_aging(lruvec, max_seq, can_swap, &nr_to_scan);
+	success = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
 
 	/* try to scrape all its memory if this memcg was deleted */
 	if (nr_to_scan && !mem_cgroup_online(memcg))
@@ -4754,7 +4740,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
 		return nr_to_scan >> sc->priority;
 
 	/* stop scanning this lruvec as it's low on cold folios */
-	return try_to_inc_max_seq(lruvec, max_seq, can_swap, false) ? -1 : 0;
+	return try_to_inc_max_seq(lruvec, max_seq, swappiness, false) ? -1 : 0;
 }
 
 static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
@@ -5298,8 +5284,7 @@ static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec,
 				s = "rep";
 				n[0] = atomic_long_read(&lrugen->refaulted[hist][type][tier]);
 				n[1] = atomic_long_read(&lrugen->evicted[hist][type][tier]);
-				if (tier)
-					n[2] = READ_ONCE(lrugen->protected[hist][type][tier - 1]);
+				n[2] = READ_ONCE(lrugen->protected[hist][type][tier]);
 			}
 
 			for (i = 0; i < 3; i++)
@@ -5354,7 +5339,7 @@ static int lru_gen_seq_show(struct seq_file *m, void *v)
 	seq_printf(m, " node %5d\n", nid);
 
 	if (!full)
-		seq = min_seq[LRU_GEN_ANON];
+		seq = evictable_min_seq(min_seq, MAX_SWAPPINESS / 2);
 	else if (max_seq >= MAX_NR_GENS)
 		seq = max_seq - MAX_NR_GENS + 1;
 	else
@@ -5394,23 +5379,14 @@ static const struct seq_operations lru_gen_seq_ops = {
 };
 
 static int run_aging(struct lruvec *lruvec, unsigned long seq,
-		     bool can_swap, bool force_scan)
+		     int swappiness, bool force_scan)
 {
 	DEFINE_MAX_SEQ(lruvec);
-	DEFINE_MIN_SEQ(lruvec);
-
-	if (seq < max_seq)
-		return 0;
 
 	if (seq > max_seq)
 		return -EINVAL;
 
-	if (!force_scan && min_seq[!can_swap] + MAX_NR_GENS - 1 <= max_seq)
-		return -ERANGE;
-
-	try_to_inc_max_seq(lruvec, max_seq, can_swap, force_scan);
-
-	return 0;
+	return try_to_inc_max_seq(lruvec, max_seq, swappiness, force_scan) ? 0 : -EEXIST;
 }
 
 static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
@@ -5426,7 +5402,7 @@ static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_co
 	while (!signal_pending(current)) {
 		DEFINE_MIN_SEQ(lruvec);
 
-		if (seq < min_seq[!swappiness])
+		if (seq < evictable_min_seq(min_seq, swappiness))
 			return 0;
 
 		if (sc->nr_reclaimed >= nr_to_reclaim)
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH mm-unstable v4 4/7] mm/mglru: rework type selection
  2024-12-31  4:35 [PATCH mm-unstable v4 0/7] mm/mglru: performance optimizations Yu Zhao
                   ` (2 preceding siblings ...)
  2024-12-31  4:35 ` [PATCH mm-unstable v4 3/7] mm/mglru: rework aging feedback Yu Zhao
@ 2024-12-31  4:35 ` Yu Zhao
  2024-12-31  4:35 ` [PATCH mm-unstable v4 5/7] mm/mglru: rework refault detection Yu Zhao
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Yu Zhao @ 2024-12-31  4:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Yu Zhao, David Stevens, Kalesh Singh

With anon and file min_seq being able to move independently, rework
type selection so that it is based on the total refaults from all
tiers of each type. Also allow a type to be selected until that type
reaches MIN_NR_GENS, regardless of whether that type has a larger
min_seq or not, to accommodate extreme swappiness.

Since some tiers of a selected type can have higher refaults than the
first tier of the other type, use a less larger gain factor 2:3
instead of 1:2, in order for those tiers in the selected type to be
better protected.

As an intermediate step to the final optimization, this change by
itself should not have userspace-visiable effects beyond performance.

Reported-by: David Stevens <stevensd@chromium.org>
Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
---
 mm/vmscan.c | 82 +++++++++++++++++------------------------------------
 1 file changed, 26 insertions(+), 56 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f767e3d34e73..a33221298fd0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3093,15 +3093,20 @@ struct ctrl_pos {
 static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
 			  struct ctrl_pos *pos)
 {
+	int i;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	int hist = lru_hist_from_seq(lrugen->min_seq[type]);
 
-	pos->refaulted = lrugen->avg_refaulted[type][tier] +
-			 atomic_long_read(&lrugen->refaulted[hist][type][tier]);
-	pos->total = lrugen->avg_total[type][tier] +
-		     lrugen->protected[hist][type][tier] +
-		     atomic_long_read(&lrugen->evicted[hist][type][tier]);
 	pos->gain = gain;
+	pos->refaulted = pos->total = 0;
+
+	for (i = tier % MAX_NR_TIERS; i <= min(tier, MAX_NR_TIERS - 1); i++) {
+		pos->refaulted += lrugen->avg_refaulted[type][i] +
+				  atomic_long_read(&lrugen->refaulted[hist][type][i]);
+		pos->total += lrugen->avg_total[type][i] +
+			      lrugen->protected[hist][type][i] +
+			      atomic_long_read(&lrugen->evicted[hist][type][i]);
+	}
 }
 
 static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)
@@ -4501,13 +4506,13 @@ static int get_tier_idx(struct lruvec *lruvec, int type)
 	struct ctrl_pos sp, pv;
 
 	/*
-	 * To leave a margin for fluctuations, use a larger gain factor (1:2).
+	 * To leave a margin for fluctuations, use a larger gain factor (2:3).
 	 * This value is chosen because any other tier would have at least twice
 	 * as many refaults as the first tier.
 	 */
-	read_ctrl_pos(lruvec, type, 0, 1, &sp);
+	read_ctrl_pos(lruvec, type, 0, 2, &sp);
 	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
-		read_ctrl_pos(lruvec, type, tier, 2, &pv);
+		read_ctrl_pos(lruvec, type, tier, 3, &pv);
 		if (!positive_ctrl_err(&sp, &pv))
 			break;
 	}
@@ -4515,68 +4520,34 @@ static int get_tier_idx(struct lruvec *lruvec, int type)
 	return tier - 1;
 }
 
-static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx)
+static int get_type_to_scan(struct lruvec *lruvec, int swappiness)
 {
-	int type, tier;
 	struct ctrl_pos sp, pv;
-	int gain[ANON_AND_FILE] = { swappiness, MAX_SWAPPINESS - swappiness };
 
+	if (!swappiness)
+		return LRU_GEN_FILE;
+
+	if (swappiness == MAX_SWAPPINESS)
+		return LRU_GEN_ANON;
 	/*
-	 * Compare the first tier of anon with that of file to determine which
-	 * type to scan. Also need to compare other tiers of the selected type
-	 * with the first tier of the other type to determine the last tier (of
-	 * the selected type) to evict.
+	 * Compare the sum of all tiers of anon with that of file to determine
+	 * which type to scan.
 	 */
-	read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp);
-	read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv);
-	type = positive_ctrl_err(&sp, &pv);
+	read_ctrl_pos(lruvec, LRU_GEN_ANON, MAX_NR_TIERS, swappiness, &sp);
+	read_ctrl_pos(lruvec, LRU_GEN_FILE, MAX_NR_TIERS, MAX_SWAPPINESS - swappiness, &pv);
 
-	read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp);
-	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
-		read_ctrl_pos(lruvec, type, tier, gain[type], &pv);
-		if (!positive_ctrl_err(&sp, &pv))
-			break;
-	}
-
-	*tier_idx = tier - 1;
-
-	return type;
+	return positive_ctrl_err(&sp, &pv);
 }
 
 static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
 			  int *type_scanned, struct list_head *list)
 {
 	int i;
-	int type;
-	int tier = -1;
-	DEFINE_MIN_SEQ(lruvec);
-
-	/*
-	 * Try to make the obvious choice first, and if anon and file are both
-	 * available from the same generation,
-	 * 1. Interpret swappiness 1 as file first and MAX_SWAPPINESS as anon
-	 *    first.
-	 * 2. If !__GFP_IO, file first since clean pagecache is more likely to
-	 *    exist than clean swapcache.
-	 */
-	if (!swappiness)
-		type = LRU_GEN_FILE;
-	else if (min_seq[LRU_GEN_ANON] < min_seq[LRU_GEN_FILE])
-		type = LRU_GEN_ANON;
-	else if (swappiness == 1)
-		type = LRU_GEN_FILE;
-	else if (swappiness == MAX_SWAPPINESS)
-		type = LRU_GEN_ANON;
-	else if (!(sc->gfp_mask & __GFP_IO))
-		type = LRU_GEN_FILE;
-	else
-		type = get_type_to_scan(lruvec, swappiness, &tier);
+	int type = get_type_to_scan(lruvec, swappiness);
 
 	for_each_evictable_type(i, swappiness) {
 		int scanned;
-
-		if (tier < 0)
-			tier = get_tier_idx(lruvec, type);
+		int tier = get_tier_idx(lruvec, type);
 
 		*type_scanned = type;
 
@@ -4585,7 +4556,6 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
 			return scanned;
 
 		type = !type;
-		tier = -1;
 	}
 
 	return 0;
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH mm-unstable v4 5/7] mm/mglru: rework refault detection
  2024-12-31  4:35 [PATCH mm-unstable v4 0/7] mm/mglru: performance optimizations Yu Zhao
                   ` (3 preceding siblings ...)
  2024-12-31  4:35 ` [PATCH mm-unstable v4 4/7] mm/mglru: rework type selection Yu Zhao
@ 2024-12-31  4:35 ` Yu Zhao
  2024-12-31  4:35 ` [PATCH mm-unstable v4 6/7] mm/mglru: rework workingset protection Yu Zhao
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Yu Zhao @ 2024-12-31  4:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Yu Zhao, Kairui Song, Kalesh Singh

With anon and file min_seq being able to move independently, rework
workingset protection as well so that the comparison of refaults
between anon and file is always on an equal footing.

Specifically, make lru_gen_test_recent() return true for refaults
happening within the distance of MAX_NR_GENS. For example, if min_seq
of a type is max_seq-MIN_NR_GENS, refaults from min_seq-1, i.e.,
max_seq-MIN_NR_GENS-1, are also considered recent, since the distance
max_seq-(max_seq-MIN_NR_GENS-1), i.e., MIN_NR_GENS+1 is less than
MAX_NR_GENS.

As an intermediate step to the final optimization, this change by
itself should not have userspace-visiable effects beyond performance.

Reported-by: Kairui Song <kasong@tencent.com>
Closes: https://lore.kernel.org/CAOUHufahuWcKf5f1Sg3emnqX+cODuR=2TQo7T4Gr-QYLujn4RA@mail.gmail.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
---
 mm/workingset.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index ad181d1b8cf1..2c310c29f51e 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -260,11 +260,11 @@ static void *lru_gen_eviction(struct folio *folio)
  * Tests if the shadow entry is for a folio that was recently evicted.
  * Fills in @lruvec, @token, @workingset with the values unpacked from shadow.
  */
-static bool lru_gen_test_recent(void *shadow, bool file, struct lruvec **lruvec,
+static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec,
 				unsigned long *token, bool *workingset)
 {
 	int memcg_id;
-	unsigned long min_seq;
+	unsigned long max_seq;
 	struct mem_cgroup *memcg;
 	struct pglist_data *pgdat;
 
@@ -273,8 +273,10 @@ static bool lru_gen_test_recent(void *shadow, bool file, struct lruvec **lruvec,
 	memcg = mem_cgroup_from_id(memcg_id);
 	*lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
-	min_seq = READ_ONCE((*lruvec)->lrugen.min_seq[file]);
-	return (*token >> LRU_REFS_WIDTH) == (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH));
+	max_seq = READ_ONCE((*lruvec)->lrugen.max_seq);
+	max_seq &= EVICTION_MASK >> LRU_REFS_WIDTH;
+
+	return abs_diff(max_seq, *token >> LRU_REFS_WIDTH) < MAX_NR_GENS;
 }
 
 static void lru_gen_refault(struct folio *folio, void *shadow)
@@ -290,7 +292,7 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 
 	rcu_read_lock();
 
-	recent = lru_gen_test_recent(shadow, type, &lruvec, &token, &workingset);
+	recent = lru_gen_test_recent(shadow, &lruvec, &token, &workingset);
 	if (lruvec != folio_lruvec(folio))
 		goto unlock;
 
@@ -331,7 +333,7 @@ static void *lru_gen_eviction(struct folio *folio)
 	return NULL;
 }
 
-static bool lru_gen_test_recent(void *shadow, bool file, struct lruvec **lruvec,
+static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec,
 				unsigned long *token, bool *workingset)
 {
 	return false;
@@ -432,8 +434,7 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset,
 		bool recent;
 
 		rcu_read_lock();
-		recent = lru_gen_test_recent(shadow, file, &eviction_lruvec,
-					     &eviction, workingset);
+		recent = lru_gen_test_recent(shadow, &eviction_lruvec, &eviction, workingset);
 		rcu_read_unlock();
 		return recent;
 	}
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH mm-unstable v4 6/7] mm/mglru: rework workingset protection
  2024-12-31  4:35 [PATCH mm-unstable v4 0/7] mm/mglru: performance optimizations Yu Zhao
                   ` (4 preceding siblings ...)
  2024-12-31  4:35 ` [PATCH mm-unstable v4 5/7] mm/mglru: rework refault detection Yu Zhao
@ 2024-12-31  4:35 ` Yu Zhao
  2024-12-31  4:35 ` [PATCH mm-unstable v4 7/7] mm/mglru: fix PTE-mapped large folios Yu Zhao
  2025-01-03  0:03 ` [PATCH mm-unstable v4 0/7] mm/mglru: performance optimizations Andrew Morton
  7 siblings, 0 replies; 14+ messages in thread
From: Yu Zhao @ 2024-12-31  4:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Yu Zhao, Kairui Song, Kalesh Singh

With the aging feedback no longer considering the distribution of
folios in each generation, rework workingset protection to better
distribute folios across MAX_NR_GENS. This is achieved by reusing
PG_workingset and PG_referenced/LRU_REFS_FLAGS in a slightly different
way.

For folios accessed multiple times through file descriptors, make
lru_gen_inc_refs() set additional bits of LRU_REFS_WIDTH in
folio->flags after PG_referenced, then PG_workingset after
LRU_REFS_WIDTH. After all its bits are set, i.e.,
LRU_REFS_FLAGS|BIT(PG_workingset), a folio is lazily promoted into the
second oldest generation in the eviction path. And when
folio_inc_gen() does that, it clears LRU_REFS_FLAGS so that
lru_gen_inc_refs() can start over. For this case, LRU_REFS_MASK is
only valid when PG_referenced is set.

For folios accessed multiple times through page tables,
folio_update_gen() from a page table walk or lru_gen_set_refs() from a
rmap walk sets PG_referenced after the accessed bit is cleared for the
first time. Thereafter, those two paths set PG_workingset and promote
folios to the youngest generation. Like folio_inc_gen(), when
folio_update_gen() does that, it also clears PG_referenced. For this
case, LRU_REFS_MASK is not used.

For both of the cases, after PG_workingset is set on a folio, it
remains until this folio is either reclaimed, or "deactivated" by
lru_gen_clear_refs(). It can be set again if lru_gen_test_recent()
returns true upon a refault.

When adding folios to the LRU lists, lru_gen_folio_seq() distributes
them as follows:
+---------------------------------+---------------------------------+
|    Accessed thru page tables    | Accessed thru file descriptors  |
+---------------------------------+---------------------------------+
| PG_active (set while isolated)  |                                 |
+----------------+----------------+----------------+----------------+
| PG_workingset  | PG_referenced  | PG_workingset  | LRU_REFS_FLAGS |
+---------------------------------+---------------------------------+
|<--------- MIN_NR_GENS --------->|                                 |
|<-------------------------- MAX_NR_GENS -------------------------->|

After this patch, some typical client and server workloads showed
improvements under heavy memory pressure. For example, Python TPC-C,
which was used to benchmark a different approach [1] to better detect
refault distances, showed a significant decrease in total refaults:
                            Before      After      Change
  Time (seconds)            10801       10801      0%
  Executed (transactions)   41472       43663      +5%
  workingset_nodes          109070      120244     +10%
  workingset_refault_anon   5019627     7281831    +45%
  workingset_refault_file   1294678786  554855564  -57%
  workingset_refault_total  1299698413  562137395  -57%

[1] https://lore.kernel.org/20230920190244.16839-1-ryncsn@gmail.com/

Reported-by: Kairui Song <kasong@tencent.com>
Closes: https://lore.kernel.org/CAOUHufahuWcKf5f1Sg3emnqX+cODuR=2TQo7T4Gr-QYLujn4RA@mail.gmail.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
---
 include/linux/mm_inline.h |  88 +++++++++++------------
 include/linux/mmzone.h    |  82 +++++++++++++--------
 mm/swap.c                 |  24 +++----
 mm/vmscan.c               | 147 ++++++++++++++++++++++----------------
 mm/workingset.c           |  29 ++++----
 5 files changed, 204 insertions(+), 166 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 34e5097182a0..f9157a0c42a5 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -133,31 +133,25 @@ static inline int lru_hist_from_seq(unsigned long seq)
 	return seq % NR_HIST_GENS;
 }
 
-static inline int lru_tier_from_refs(int refs)
+static inline int lru_tier_from_refs(int refs, bool workingset)
 {
 	VM_WARN_ON_ONCE(refs > BIT(LRU_REFS_WIDTH));
 
-	/* see the comment in folio_lru_refs() */
-	return order_base_2(refs + 1);
+	/* see the comment on MAX_NR_TIERS */
+	return workingset ? MAX_NR_TIERS - 1 : order_base_2(refs);
 }
 
 static inline int folio_lru_refs(struct folio *folio)
 {
 	unsigned long flags = READ_ONCE(folio->flags);
-	bool workingset = flags & BIT(PG_workingset);
 
+	if (!(flags & BIT(PG_referenced)))
+		return 0;
 	/*
-	 * Return the number of accesses beyond PG_referenced, i.e., N-1 if the
-	 * total number of accesses is N>1, since N=0,1 both map to the first
-	 * tier. lru_tier_from_refs() will account for this off-by-one. Also see
-	 * the comment on MAX_NR_TIERS.
+	 * Return the total number of accesses including PG_referenced. Also see
+	 * the comment on LRU_REFS_FLAGS.
 	 */
-	return ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + workingset;
-}
-
-static inline void folio_clear_lru_refs(struct folio *folio)
-{
-	set_mask_bits(&folio->flags, LRU_REFS_MASK | LRU_REFS_FLAGS, 0);
+	return ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + 1;
 }
 
 static inline int folio_lru_gen(struct folio *folio)
@@ -223,11 +217,43 @@ static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *foli
 	VM_WARN_ON_ONCE(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen));
 }
 
+static inline unsigned long lru_gen_folio_seq(struct lruvec *lruvec, struct folio *folio,
+					      bool reclaiming)
+{
+	int gen;
+	int type = folio_is_file_lru(folio);
+	struct lru_gen_folio *lrugen = &lruvec->lrugen;
+
+	/*
+	 * +-----------------------------------+-----------------------------------+
+	 * | Accessed through page tables and  | Accessed through file descriptors |
+	 * | promoted by folio_update_gen()    | and protected by folio_inc_gen()  |
+	 * +-----------------------------------+-----------------------------------+
+	 * | PG_active (set while isolated)    |                                   |
+	 * +-----------------+-----------------+-----------------+-----------------+
+	 * |  PG_workingset  |  PG_referenced  |  PG_workingset  |  LRU_REFS_FLAGS |
+	 * +-----------------------------------+-----------------------------------+
+	 * |<---------- MIN_NR_GENS ---------->|                                   |
+	 * |<---------------------------- MAX_NR_GENS ---------------------------->|
+	 */
+	if (folio_test_active(folio))
+		gen = MIN_NR_GENS - folio_test_workingset(folio);
+	else if (reclaiming)
+		gen = MAX_NR_GENS;
+	else if ((!folio_is_file_lru(folio) && !folio_test_swapcache(folio)) ||
+		 (folio_test_reclaim(folio) &&
+		  (folio_test_dirty(folio) || folio_test_writeback(folio))))
+		gen = MIN_NR_GENS;
+	else
+		gen = MAX_NR_GENS - folio_test_workingset(folio);
+
+	return max(READ_ONCE(lrugen->max_seq) - gen + 1, READ_ONCE(lrugen->min_seq[type]));
+}
+
 static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
 {
 	unsigned long seq;
 	unsigned long flags;
-	unsigned long mask;
 	int gen = folio_lru_gen(folio);
 	int type = folio_is_file_lru(folio);
 	int zone = folio_zonenum(folio);
@@ -237,40 +263,12 @@ static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio,
 
 	if (folio_test_unevictable(folio) || !lrugen->enabled)
 		return false;
-	/*
-	 * There are four common cases for this page:
-	 * 1. If it's hot, i.e., freshly faulted in, add it to the youngest
-	 *    generation, and it's protected over the rest below.
-	 * 2. If it can't be evicted immediately, i.e., a dirty page pending
-	 *    writeback, add it to the second youngest generation.
-	 * 3. If it should be evicted first, e.g., cold and clean from
-	 *    folio_rotate_reclaimable(), add it to the oldest generation.
-	 * 4. Everything else falls between 2 & 3 above and is added to the
-	 *    second oldest generation if it's considered inactive, or the
-	 *    oldest generation otherwise. See lru_gen_is_active().
-	 */
-	if (folio_test_active(folio))
-		seq = lrugen->max_seq;
-	else if ((type == LRU_GEN_ANON && !folio_test_swapcache(folio)) ||
-		 (folio_test_reclaim(folio) &&
-		  (folio_test_dirty(folio) || folio_test_writeback(folio))))
-		seq = lrugen->max_seq - 1;
-	else if (reclaiming || lrugen->min_seq[type] + MIN_NR_GENS >= lrugen->max_seq)
-		seq = lrugen->min_seq[type];
-	else
-		seq = lrugen->min_seq[type] + 1;
 
+	seq = lru_gen_folio_seq(lruvec, folio, reclaiming);
 	gen = lru_gen_from_seq(seq);
 	flags = (gen + 1UL) << LRU_GEN_PGOFF;
 	/* see the comment on MIN_NR_GENS about PG_active */
-	mask = LRU_GEN_MASK;
-	/*
-	 * Don't clear PG_workingset here because it can affect PSI accounting
-	 * if the activation is due to workingset refault.
-	 */
-	if (folio_test_active(folio))
-		mask |= LRU_REFS_MASK | BIT(PG_referenced) | BIT(PG_active);
-	set_mask_bits(&folio->flags, mask, flags);
+	set_mask_bits(&folio->flags, LRU_GEN_MASK | BIT(PG_active), flags);
 
 	lru_gen_update_size(lruvec, folio, -1, gen);
 	/* for folio_rotate_reclaimable() */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8245ecb0400b..9540b41894da 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -332,66 +332,88 @@ enum lruvec_flags {
 #endif /* !__GENERATING_BOUNDS_H */
 
 /*
- * Evictable pages are divided into multiple generations. The youngest and the
+ * Evictable folios are divided into multiple generations. The youngest and the
  * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
  * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
  * offset within MAX_NR_GENS, i.e., gen, indexes the LRU list of the
  * corresponding generation. The gen counter in folio->flags stores gen+1 while
- * a page is on one of lrugen->folios[]. Otherwise it stores 0.
+ * a folio is on one of lrugen->folios[]. Otherwise it stores 0.
  *
- * A page is added to the youngest generation on faulting. The aging needs to
- * check the accessed bit at least twice before handing this page over to the
- * eviction. The first check takes care of the accessed bit set on the initial
- * fault; the second check makes sure this page hasn't been used since then.
- * This process, AKA second chance, requires a minimum of two generations,
- * hence MIN_NR_GENS. And to maintain ABI compatibility with the active/inactive
- * LRU, e.g., /proc/vmstat, these two generations are considered active; the
- * rest of generations, if they exist, are considered inactive. See
- * lru_gen_is_active().
+ * After a folio is faulted in, the aging needs to check the accessed bit at
+ * least twice before handing this folio over to the eviction. The first check
+ * clears the accessed bit from the initial fault; the second check makes sure
+ * this folio hasn't been used since then. This process, AKA second chance,
+ * requires a minimum of two generations, hence MIN_NR_GENS. And to maintain ABI
+ * compatibility with the active/inactive LRU, e.g., /proc/vmstat, these two
+ * generations are considered active; the rest of generations, if they exist,
+ * are considered inactive. See lru_gen_is_active().
  *
- * PG_active is always cleared while a page is on one of lrugen->folios[] so
- * that the aging needs not to worry about it. And it's set again when a page
- * considered active is isolated for non-reclaiming purposes, e.g., migration.
- * See lru_gen_add_folio() and lru_gen_del_folio().
+ * PG_active is always cleared while a folio is on one of lrugen->folios[] so
+ * that the sliding window needs not to worry about it. And it's set again when
+ * a folio considered active is isolated for non-reclaiming purposes, e.g.,
+ * migration. See lru_gen_add_folio() and lru_gen_del_folio().
  *
  * MAX_NR_GENS is set to 4 so that the multi-gen LRU can support twice the
  * number of categories of the active/inactive LRU when keeping track of
  * accesses through page tables. This requires order_base_2(MAX_NR_GENS+1) bits
- * in folio->flags.
+ * in folio->flags, masked by LRU_GEN_MASK.
  */
 #define MIN_NR_GENS		2U
 #define MAX_NR_GENS		4U
 
 /*
- * Each generation is divided into multiple tiers. A page accessed N times
- * through file descriptors is in tier order_base_2(N). A page in the first tier
- * (N=0,1) is marked by PG_referenced unless it was faulted in through page
- * tables or read ahead. A page in any other tier (N>1) is marked by
- * PG_referenced and PG_workingset. This implies a minimum of two tiers is
- * supported without using additional bits in folio->flags.
+ * Each generation is divided into multiple tiers. A folio accessed N times
+ * through file descriptors is in tier order_base_2(N). A folio in the first
+ * tier (N=0,1) is marked by PG_referenced unless it was faulted in through page
+ * tables or read ahead. A folio in the last tier (MAX_NR_TIERS-1) is marked by
+ * PG_workingset. A folio in any other tier (1<N<5) between the first and last
+ * is marked by additional bits of LRU_REFS_WIDTH in folio->flags.
  *
  * In contrast to moving across generations which requires the LRU lock, moving
  * across tiers only involves atomic operations on folio->flags and therefore
  * has a negligible cost in the buffered access path. In the eviction path,
- * comparisons of refaulted/(evicted+protected) from the first tier and the
- * rest infer whether pages accessed multiple times through file descriptors
- * are statistically hot and thus worth protecting.
+ * comparisons of refaulted/(evicted+protected) from the first tier and the rest
+ * infer whether folios accessed multiple times through file descriptors are
+ * statistically hot and thus worth protecting.
  *
  * MAX_NR_TIERS is set to 4 so that the multi-gen LRU can support twice the
  * number of categories of the active/inactive LRU when keeping track of
  * accesses through file descriptors. This uses MAX_NR_TIERS-2 spare bits in
- * folio->flags.
+ * folio->flags, masked by LRU_REFS_MASK.
  */
 #define MAX_NR_TIERS		4U
 
 #ifndef __GENERATING_BOUNDS_H
 
-struct lruvec;
-struct page_vma_mapped_walk;
-
 #define LRU_GEN_MASK		((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
 #define LRU_REFS_MASK		((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
 
+/*
+ * For folios accessed multiple times through file descriptors,
+ * lru_gen_inc_refs() sets additional bits of LRU_REFS_WIDTH in folio->flags
+ * after PG_referenced, then PG_workingset after LRU_REFS_WIDTH. After all its
+ * bits are set, i.e., LRU_REFS_FLAGS|BIT(PG_workingset), a folio is lazily
+ * promoted into the second oldest generation in the eviction path. And when
+ * folio_inc_gen() does that, it clears LRU_REFS_FLAGS so that
+ * lru_gen_inc_refs() can start over. Note that for this case, LRU_REFS_MASK is
+ * only valid when PG_referenced is set.
+ *
+ * For folios accessed multiple times through page tables, folio_update_gen()
+ * from a page table walk or lru_gen_set_refs() from a rmap walk sets
+ * PG_referenced after the accessed bit is cleared for the first time.
+ * Thereafter, those two paths set PG_workingset and promote folios to the
+ * youngest generation. Like folio_inc_gen(), folio_update_gen() also clears
+ * PG_referenced. Note that for this case, LRU_REFS_MASK is not used.
+ *
+ * For both cases above, after PG_workingset is set on a folio, it remains until
+ * this folio is either reclaimed, or "deactivated" by lru_gen_clear_refs(). It
+ * can be set again if lru_gen_test_recent() returns true upon a refault.
+ */
+#define LRU_REFS_FLAGS		(LRU_REFS_MASK | BIT(PG_referenced))
+
+struct lruvec;
+struct page_vma_mapped_walk;
+
 #ifdef CONFIG_LRU_GEN
 
 enum {
@@ -406,8 +428,6 @@ enum {
 	NR_LRU_GEN_CAPS
 };
 
-#define LRU_REFS_FLAGS		(BIT(PG_referenced) | BIT(PG_workingset))
-
 #define MIN_LRU_BATCH		BITS_PER_LONG
 #define MAX_LRU_BATCH		(MIN_LRU_BATCH * 64)
 
diff --git a/mm/swap.c b/mm/swap.c
index 649ef7f2b74b..746a5ceba42c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -387,24 +387,20 @@ static void lru_gen_inc_refs(struct folio *folio)
 	if (folio_test_unevictable(folio))
 		return;
 
+	/* see the comment on LRU_REFS_FLAGS */
 	if (!folio_test_referenced(folio)) {
-		folio_set_referenced(folio);
+		set_mask_bits(&folio->flags, LRU_REFS_MASK, BIT(PG_referenced));
 		return;
 	}
 
-	if (!folio_test_workingset(folio)) {
-		folio_set_workingset(folio);
-		return;
-	}
-
-	/* see the comment on MAX_NR_TIERS */
 	do {
-		new_flags = old_flags & LRU_REFS_MASK;
-		if (new_flags == LRU_REFS_MASK)
-			break;
+		if ((old_flags & LRU_REFS_MASK) == LRU_REFS_MASK) {
+			if (!folio_test_workingset(folio))
+				folio_set_workingset(folio);
+			return;
+		}
 
-		new_flags += BIT(LRU_REFS_PGOFF);
-		new_flags |= old_flags & ~LRU_REFS_MASK;
+		new_flags = old_flags + BIT(LRU_REFS_PGOFF);
 	} while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
 }
 
@@ -417,7 +413,7 @@ static bool lru_gen_clear_refs(struct folio *folio)
 	if (gen < 0)
 		return true;
 
-	set_mask_bits(&folio->flags, LRU_REFS_MASK | LRU_REFS_FLAGS, 0);
+	set_mask_bits(&folio->flags, LRU_REFS_FLAGS | BIT(PG_workingset), 0);
 
 	lrugen = &folio_lruvec(folio)->lrugen;
 	/* whether can do without shuffling under the LRU lock */
@@ -499,7 +495,7 @@ void folio_add_lru(struct folio *folio)
 			folio_test_unevictable(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 
-	/* see the comment in lru_gen_add_folio() */
+	/* see the comment in lru_gen_folio_seq() */
 	if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
 	    lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
 		folio_set_active(folio);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a33221298fd0..74bc85fc7cdf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -862,6 +862,31 @@ enum folio_references {
 	FOLIOREF_ACTIVATE,
 };
 
+#ifdef CONFIG_LRU_GEN
+/*
+ * Only used on a mapped folio in the eviction (rmap walk) path, where promotion
+ * needs to be done by taking the folio off the LRU list and then adding it back
+ * with PG_active set. In contrast, the aging (page table walk) path uses
+ * folio_update_gen().
+ */
+static bool lru_gen_set_refs(struct folio *folio)
+{
+	/* see the comment on LRU_REFS_FLAGS */
+	if (!folio_test_referenced(folio) && !folio_test_workingset(folio)) {
+		set_mask_bits(&folio->flags, LRU_REFS_MASK, BIT(PG_referenced));
+		return false;
+	}
+
+	set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_workingset));
+	return true;
+}
+#else
+static bool lru_gen_set_refs(struct folio *folio)
+{
+	return false;
+}
+#endif /* CONFIG_LRU_GEN */
+
 static enum folio_references folio_check_references(struct folio *folio,
 						  struct scan_control *sc)
 {
@@ -870,7 +895,6 @@ static enum folio_references folio_check_references(struct folio *folio,
 
 	referenced_ptes = folio_referenced(folio, 1, sc->target_mem_cgroup,
 					   &vm_flags);
-	referenced_folio = folio_test_clear_referenced(folio);
 
 	/*
 	 * The supposedly reclaimable folio was found to be in a VM_LOCKED vma.
@@ -888,6 +912,15 @@ static enum folio_references folio_check_references(struct folio *folio,
 	if (referenced_ptes == -1)
 		return FOLIOREF_KEEP;
 
+	if (lru_gen_enabled()) {
+		if (!referenced_ptes)
+			return FOLIOREF_RECLAIM;
+
+		return lru_gen_set_refs(folio) ? FOLIOREF_ACTIVATE : FOLIOREF_KEEP;
+	}
+
+	referenced_folio = folio_test_clear_referenced(folio);
+
 	if (referenced_ptes) {
 		/*
 		 * All mapped folios start out with page table
@@ -1092,11 +1125,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		if (!sc->may_unmap && folio_mapped(folio))
 			goto keep_locked;
 
-		/* folio_update_gen() tried to promote this page? */
-		if (lru_gen_enabled() && !ignore_references &&
-		    folio_mapped(folio) && folio_test_referenced(folio))
-			goto keep_locked;
-
 		/*
 		 * The number of dirty pages determines if a node is marked
 		 * reclaim_congested. kswapd will stall and start writing
@@ -3167,16 +3195,19 @@ static int folio_update_gen(struct folio *folio, int gen)
 
 	VM_WARN_ON_ONCE(gen >= MAX_NR_GENS);
 
+	/* see the comment on LRU_REFS_FLAGS */
+	if (!folio_test_referenced(folio) && !folio_test_workingset(folio)) {
+		set_mask_bits(&folio->flags, LRU_REFS_MASK, BIT(PG_referenced));
+		return -1;
+	}
+
 	do {
 		/* lru_gen_del_folio() has isolated this page? */
-		if (!(old_flags & LRU_GEN_MASK)) {
-			/* for shrink_folio_list() */
-			new_flags = old_flags | BIT(PG_referenced);
-			continue;
-		}
+		if (!(old_flags & LRU_GEN_MASK))
+			return -1;
 
-		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
-		new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
+		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_FLAGS);
+		new_flags |= ((gen + 1UL) << LRU_GEN_PGOFF) | BIT(PG_workingset);
 	} while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
 
 	return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
@@ -3200,7 +3231,7 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
 
 		new_gen = (old_gen + 1) % MAX_NR_GENS;
 
-		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
+		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_FLAGS);
 		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
 		/* for folio_end_writeback() */
 		if (reclaiming)
@@ -3378,9 +3409,11 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned
 static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
 				   struct pglist_data *pgdat)
 {
-	struct folio *folio;
+	struct folio *folio = pfn_folio(pfn);
+
+	if (folio_lru_gen(folio) < 0)
+		return NULL;
 
-	folio = pfn_folio(pfn);
 	if (folio_nid(folio) != pgdat->node_id)
 		return NULL;
 
@@ -3757,8 +3790,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
 		while (!list_empty(head)) {
 			struct folio *folio = lru_to_folio(head);
 			int refs = folio_lru_refs(folio);
-			int tier = lru_tier_from_refs(refs);
-			int delta = folio_nr_pages(folio);
+			bool workingset = folio_test_workingset(folio);
 
 			VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
 			VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
@@ -3768,8 +3800,14 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
 			new_gen = folio_inc_gen(lruvec, folio, false);
 			list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
 
-			WRITE_ONCE(lrugen->protected[hist][type][tier],
-				   lrugen->protected[hist][type][tier] + delta);
+			/* don't count the workingset being lazily promoted */
+			if (refs + workingset != BIT(LRU_REFS_WIDTH) + 1) {
+				int tier = lru_tier_from_refs(refs, workingset);
+				int delta = folio_nr_pages(folio);
+
+				WRITE_ONCE(lrugen->protected[hist][type][tier],
+					   lrugen->protected[hist][type][tier] + delta);
+			}
 
 			if (!--remaining)
 				return false;
@@ -4155,16 +4193,10 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 			old_gen = folio_update_gen(folio, new_gen);
 			if (old_gen >= 0 && old_gen != new_gen)
 				update_batch_size(walk, folio, old_gen, new_gen);
-
-			continue;
-		}
-
-		old_gen = folio_lru_gen(folio);
-		if (old_gen < 0)
-			folio_set_referenced(folio);
-		else if (old_gen != new_gen) {
-			folio_clear_lru_refs(folio);
-			folio_activate(folio);
+		} else if (lru_gen_set_refs(folio)) {
+			old_gen = folio_lru_gen(folio);
+			if (old_gen >= 0 && old_gen != new_gen)
+				folio_activate(folio);
 		}
 	}
 
@@ -4325,7 +4357,8 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	int zone = folio_zonenum(folio);
 	int delta = folio_nr_pages(folio);
 	int refs = folio_lru_refs(folio);
-	int tier = lru_tier_from_refs(refs);
+	bool workingset = folio_test_workingset(folio);
+	int tier = lru_tier_from_refs(refs, workingset);
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 
 	VM_WARN_ON_ONCE_FOLIO(gen >= MAX_NR_GENS, folio);
@@ -4347,14 +4380,17 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	}
 
 	/* protected */
-	if (tier > tier_idx || refs == BIT(LRU_REFS_WIDTH)) {
-		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
-
+	if (tier > tier_idx || refs + workingset == BIT(LRU_REFS_WIDTH) + 1) {
 		gen = folio_inc_gen(lruvec, folio, false);
-		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
+		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
 
-		WRITE_ONCE(lrugen->protected[hist][type][tier],
-			   lrugen->protected[hist][type][tier] + delta);
+		/* don't count the workingset being lazily promoted */
+		if (refs + workingset != BIT(LRU_REFS_WIDTH) + 1) {
+			int hist = lru_hist_from_seq(lrugen->min_seq[type]);
+
+			WRITE_ONCE(lrugen->protected[hist][type][tier],
+				   lrugen->protected[hist][type][tier] + delta);
+		}
 		return true;
 	}
 
@@ -4374,8 +4410,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	}
 
 	/* waiting for writeback */
-	if (folio_test_locked(folio) || writeback ||
-	    (type == LRU_GEN_FILE && dirty)) {
+	if (writeback || (type == LRU_GEN_FILE && dirty)) {
 		gen = folio_inc_gen(lruvec, folio, true);
 		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
@@ -4404,13 +4439,12 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
 		return false;
 	}
 
-	/* see the comment on MAX_NR_TIERS */
+	/* see the comment on LRU_REFS_FLAGS */
 	if (!folio_test_referenced(folio))
-		folio_clear_lru_refs(folio);
+		set_mask_bits(&folio->flags, LRU_REFS_MASK, 0);
 
 	/* for shrink_folio_list() */
 	folio_clear_reclaim(folio);
-	folio_clear_referenced(folio);
 
 	success = lru_gen_del_folio(lruvec, folio, true);
 	VM_WARN_ON_ONCE_FOLIO(!success, folio);
@@ -4600,31 +4634,24 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
+		DEFINE_MIN_SEQ(lruvec);
+
 		if (!folio_evictable(folio)) {
 			list_del(&folio->lru);
 			folio_putback_lru(folio);
 			continue;
 		}
 
-		if (folio_test_reclaim(folio) &&
-		    (folio_test_dirty(folio) || folio_test_writeback(folio))) {
-			/* restore LRU_REFS_FLAGS cleared by isolate_folio() */
-			if (folio_test_workingset(folio))
-				folio_set_referenced(folio);
-			continue;
-		}
-
-		if (skip_retry || folio_test_active(folio) || folio_test_referenced(folio) ||
-		    folio_mapped(folio) || folio_test_locked(folio) ||
-		    folio_test_dirty(folio) || folio_test_writeback(folio)) {
-			/* don't add rejected folios to the oldest generation */
-			set_mask_bits(&folio->flags, LRU_REFS_MASK | LRU_REFS_FLAGS,
-				      BIT(PG_active));
-			continue;
-		}
-
 		/* retry folios that may have missed folio_rotate_reclaimable() */
-		list_move(&folio->lru, &clean);
+		if (!skip_retry && !folio_test_active(folio) && !folio_mapped(folio) &&
+		    !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
+			list_move(&folio->lru, &clean);
+			continue;
+		}
+
+		/* don't add rejected folios to the oldest generation */
+		if (lru_gen_folio_seq(lruvec, folio, false) == min_seq[type])
+			set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
 	}
 
 	spin_lock_irq(&lruvec->lru_lock);
diff --git a/mm/workingset.c b/mm/workingset.c
index 2c310c29f51e..4841ae8af411 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -239,7 +239,8 @@ static void *lru_gen_eviction(struct folio *folio)
 	int type = folio_is_file_lru(folio);
 	int delta = folio_nr_pages(folio);
 	int refs = folio_lru_refs(folio);
-	int tier = lru_tier_from_refs(refs);
+	bool workingset = folio_test_workingset(folio);
+	int tier = lru_tier_from_refs(refs, workingset);
 	struct mem_cgroup *memcg = folio_memcg(folio);
 	struct pglist_data *pgdat = folio_pgdat(folio);
 
@@ -253,7 +254,7 @@ static void *lru_gen_eviction(struct folio *folio)
 	hist = lru_hist_from_seq(min_seq);
 	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
 
-	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs);
+	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
 }
 
 /*
@@ -304,24 +305,20 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 	lrugen = &lruvec->lrugen;
 
 	hist = lru_hist_from_seq(READ_ONCE(lrugen->min_seq[type]));
-	/* see the comment in folio_lru_refs() */
-	refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset;
-	tier = lru_tier_from_refs(refs);
+	refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + 1;
+	tier = lru_tier_from_refs(refs, workingset);
 
 	atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
-	mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
 
-	/*
-	 * Count the following two cases as stalls:
-	 * 1. For pages accessed through page tables, hotter pages pushed out
-	 *    hot pages which refaulted immediately.
-	 * 2. For pages accessed multiple times through file descriptors,
-	 *    they would have been protected by sort_folio().
-	 */
-	if (lru_gen_in_fault() || refs >= BIT(LRU_REFS_WIDTH) - 1) {
-		set_mask_bits(&folio->flags, 0, LRU_REFS_MASK | BIT(PG_workingset));
+	/* see folio_add_lru() where folio_set_active() will be called */
+	if (lru_gen_in_fault())
+		mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
+
+	if (workingset) {
+		folio_set_workingset(folio);
 		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
-	}
+	} else
+		set_mask_bits(&folio->flags, LRU_REFS_MASK, (refs - 1UL) << LRU_REFS_PGOFF);
 unlock:
 	rcu_read_unlock();
 }
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH mm-unstable v4 7/7] mm/mglru: fix PTE-mapped large folios
  2024-12-31  4:35 [PATCH mm-unstable v4 0/7] mm/mglru: performance optimizations Yu Zhao
                   ` (5 preceding siblings ...)
  2024-12-31  4:35 ` [PATCH mm-unstable v4 6/7] mm/mglru: rework workingset protection Yu Zhao
@ 2024-12-31  4:35 ` Yu Zhao
  2025-01-03  0:03 ` [PATCH mm-unstable v4 0/7] mm/mglru: performance optimizations Andrew Morton
  7 siblings, 0 replies; 14+ messages in thread
From: Yu Zhao @ 2024-12-31  4:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Yu Zhao, Barry Song, Kalesh Singh

Count the accessed bits from PTEs mapping the same large folio as one
access rather than multiple accesses.

The last patch changed how folios accessed through page tables are
promoted: rather than getting promoted after the accessed bit is
cleared for the first time, a folio only gets promoted thereafter.
Counting the accessed bits from the same large folio as multiple
accesses can cause that folio to be promoted prematurely, which in
turn can cause overprotection of single-use large folios.

This patch reduced the sys time of the kernel compilation by 95% CI
[2, 5]% on Altra M128-30 with 3GB DRAM, 12GB zram, 16KB THPs and -j32.

Reported-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
---
 mm/vmscan.c | 110 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 72 insertions(+), 38 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 74bc85fc7cdf..a099876fa029 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3431,29 +3431,55 @@ static bool suitable_to_scan(int total, int young)
 	return young * n >= total;
 }
 
+static void walk_update_folio(struct lru_gen_mm_walk *walk, struct folio *folio,
+			      int new_gen, bool dirty)
+{
+	int old_gen;
+
+	if (!folio)
+		return;
+
+	if (dirty && !folio_test_dirty(folio) &&
+	    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+	      !folio_test_swapcache(folio)))
+		folio_mark_dirty(folio);
+
+	if (walk) {
+		old_gen = folio_update_gen(folio, new_gen);
+		if (old_gen >= 0 && old_gen != new_gen)
+			update_batch_size(walk, folio, old_gen, new_gen);
+	} else if (lru_gen_set_refs(folio)) {
+		old_gen = folio_lru_gen(folio);
+		if (old_gen >= 0 && old_gen != new_gen)
+			folio_activate(folio);
+	}
+}
+
 static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 			   struct mm_walk *args)
 {
 	int i;
+	bool dirty;
 	pte_t *pte;
 	spinlock_t *ptl;
 	unsigned long addr;
 	int total = 0;
 	int young = 0;
+	struct folio *last = NULL;
 	struct lru_gen_mm_walk *walk = args->private;
 	struct mem_cgroup *memcg = lruvec_memcg(walk->lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
 	DEFINE_MAX_SEQ(walk->lruvec);
-	int old_gen, new_gen = lru_gen_from_seq(max_seq);
+	int gen = lru_gen_from_seq(max_seq);
 	pmd_t pmdval;
 
-	pte = pte_offset_map_rw_nolock(args->mm, pmd, start & PMD_MASK, &pmdval,
-				       &ptl);
+	pte = pte_offset_map_rw_nolock(args->mm, pmd, start & PMD_MASK, &pmdval, &ptl);
 	if (!pte)
 		return false;
+
 	if (!spin_trylock(ptl)) {
 		pte_unmap(pte);
-		return false;
+		return true;
 	}
 
 	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
@@ -3482,19 +3508,23 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		if (!ptep_clear_young_notify(args->vma, addr, pte + i))
 			continue;
 
+		if (last != folio) {
+			walk_update_folio(walk, last, gen, dirty);
+
+			last = folio;
+			dirty = false;
+		}
+
+		if (pte_dirty(ptent))
+			dirty = true;
+
 		young++;
 		walk->mm_stats[MM_LEAF_YOUNG]++;
-
-		if (pte_dirty(ptent) && !folio_test_dirty(folio) &&
-		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
-		      !folio_test_swapcache(folio)))
-			folio_mark_dirty(folio);
-
-		old_gen = folio_update_gen(folio, new_gen);
-		if (old_gen >= 0 && old_gen != new_gen)
-			update_batch_size(walk, folio, old_gen, new_gen);
 	}
 
+	walk_update_folio(walk, last, gen, dirty);
+	last = NULL;
+
 	if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
 		goto restart;
 
@@ -3508,13 +3538,15 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 				  struct mm_walk *args, unsigned long *bitmap, unsigned long *first)
 {
 	int i;
+	bool dirty;
 	pmd_t *pmd;
 	spinlock_t *ptl;
+	struct folio *last = NULL;
 	struct lru_gen_mm_walk *walk = args->private;
 	struct mem_cgroup *memcg = lruvec_memcg(walk->lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
 	DEFINE_MAX_SEQ(walk->lruvec);
-	int old_gen, new_gen = lru_gen_from_seq(max_seq);
+	int gen = lru_gen_from_seq(max_seq);
 
 	VM_WARN_ON_ONCE(pud_leaf(*pud));
 
@@ -3567,20 +3599,23 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 		if (!pmdp_clear_young_notify(vma, addr, pmd + i))
 			goto next;
 
+		if (last != folio) {
+			walk_update_folio(walk, last, gen, dirty);
+
+			last = folio;
+			dirty = false;
+		}
+
+		if (pmd_dirty(pmd[i]))
+			dirty = true;
+
 		walk->mm_stats[MM_LEAF_YOUNG]++;
-
-		if (pmd_dirty(pmd[i]) && !folio_test_dirty(folio) &&
-		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
-		      !folio_test_swapcache(folio)))
-			folio_mark_dirty(folio);
-
-		old_gen = folio_update_gen(folio, new_gen);
-		if (old_gen >= 0 && old_gen != new_gen)
-			update_batch_size(walk, folio, old_gen, new_gen);
 next:
 		i = i > MIN_LRU_BATCH ? 0 : find_next_bit(bitmap, MIN_LRU_BATCH, i) + 1;
 	} while (i <= MIN_LRU_BATCH);
 
+	walk_update_folio(walk, last, gen, dirty);
+
 	arch_leave_lazy_mmu_mode();
 	spin_unlock(ptl);
 done:
@@ -4115,9 +4150,11 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 {
 	int i;
+	bool dirty;
 	unsigned long start;
 	unsigned long end;
 	struct lru_gen_mm_walk *walk;
+	struct folio *last = NULL;
 	int young = 1;
 	pte_t *pte = pvmw->pte;
 	unsigned long addr = pvmw->address;
@@ -4128,7 +4165,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 	struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
 	DEFINE_MAX_SEQ(lruvec);
-	int old_gen, new_gen = lru_gen_from_seq(max_seq);
+	int gen = lru_gen_from_seq(max_seq);
 
 	lockdep_assert_held(pvmw->ptl);
 	VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);
@@ -4182,24 +4219,21 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 		if (!ptep_clear_young_notify(vma, addr, pte + i))
 			continue;
 
-		young++;
+		if (last != folio) {
+			walk_update_folio(walk, last, gen, dirty);
 
-		if (pte_dirty(ptent) && !folio_test_dirty(folio) &&
-		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
-		      !folio_test_swapcache(folio)))
-			folio_mark_dirty(folio);
-
-		if (walk) {
-			old_gen = folio_update_gen(folio, new_gen);
-			if (old_gen >= 0 && old_gen != new_gen)
-				update_batch_size(walk, folio, old_gen, new_gen);
-		} else if (lru_gen_set_refs(folio)) {
-			old_gen = folio_lru_gen(folio);
-			if (old_gen >= 0 && old_gen != new_gen)
-				folio_activate(folio);
+			last = folio;
+			dirty = false;
 		}
+
+		if (pte_dirty(ptent))
+			dirty = true;
+
+		young++;
 	}
 
+	walk_update_folio(walk, last, gen, dirty);
+
 	arch_leave_lazy_mmu_mode();
 
 	/* feedback from rmap walkers to page table walkers */
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-unstable v4 0/7] mm/mglru: performance optimizations
  2024-12-31  4:35 [PATCH mm-unstable v4 0/7] mm/mglru: performance optimizations Yu Zhao
                   ` (6 preceding siblings ...)
  2024-12-31  4:35 ` [PATCH mm-unstable v4 7/7] mm/mglru: fix PTE-mapped large folios Yu Zhao
@ 2025-01-03  0:03 ` Andrew Morton
  2025-01-04  0:59   ` chenridong
  7 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2025-01-03  0:03 UTC (permalink / raw)
  To: Yu Zhao; +Cc: linux-mm, linux-kernel, Chen Ridong

On Mon, 30 Dec 2024 21:35:31 -0700 Yu Zhao <yuzhao@google.com> wrote:

> This series improves performance for some previously reported test
> cases. Most of the code changes gathered here has been floating on the
> mailing list [1][2]. They are now properly organized and have gone
> through various benchmarks on client and server devices, including
> Android, FIO, memcached, multiple VMs and MongoDB.

This series has significant conflicts with the patch "mm: vmscan: retry
folios written back while isolated for traditional LRU".  It appears
that "mm: vmscan: retry folios written back while isolated for
traditional LRU" is due for an update and that more discussion is
needed, so I shall drop this version of "mm: vmscan: retry folios
written back while isolated for traditional LRU" from mm-unstable.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-unstable v4 0/7] mm/mglru: performance optimizations
  2025-01-03  0:03 ` [PATCH mm-unstable v4 0/7] mm/mglru: performance optimizations Andrew Morton
@ 2025-01-04  0:59   ` chenridong
  2025-01-04  2:21     ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: chenridong @ 2025-01-04  0:59 UTC (permalink / raw)
  To: Andrew Morton, Yu Zhao; +Cc: linux-mm, linux-kernel



On 2025/1/3 8:03, Andrew Morton wrote:
> On Mon, 30 Dec 2024 21:35:31 -0700 Yu Zhao <yuzhao@google.com> wrote:
> 
>> This series improves performance for some previously reported test
>> cases. Most of the code changes gathered here has been floating on the
>> mailing list [1][2]. They are now properly organized and have gone
>> through various benchmarks on client and server devices, including
>> Android, FIO, memcached, multiple VMs and MongoDB.
> 
> This series has significant conflicts with the patch "mm: vmscan: retry
> folios written back while isolated for traditional LRU".  It appears
> that "mm: vmscan: retry folios written back while isolated for
> traditional LRU" is due for an update and that more discussion is
> needed, so I shall drop this version of "mm: vmscan: retry folios
> written back while isolated for traditional LRU" from mm-unstable.

Hi, Andrew and Yu, does this mean I should resend a new version that
fixes the conflict and updates the message?

Best regards,
Ridong




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-unstable v4 0/7] mm/mglru: performance optimizations
  2025-01-04  0:59   ` chenridong
@ 2025-01-04  2:21     ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2025-01-04  2:21 UTC (permalink / raw)
  To: chenridong; +Cc: Yu Zhao, linux-mm, linux-kernel

On Sat, 4 Jan 2025 08:59:08 +0800 chenridong <chenridong@huawei.com> wrote:

> 
> 
> On 2025/1/3 8:03, Andrew Morton wrote:
> > On Mon, 30 Dec 2024 21:35:31 -0700 Yu Zhao <yuzhao@google.com> wrote:
> > 
> >> This series improves performance for some previously reported test
> >> cases. Most of the code changes gathered here has been floating on the
> >> mailing list [1][2]. They are now properly organized and have gone
> >> through various benchmarks on client and server devices, including
> >> Android, FIO, memcached, multiple VMs and MongoDB.
> > 
> > This series has significant conflicts with the patch "mm: vmscan: retry
> > folios written back while isolated for traditional LRU".  It appears
> > that "mm: vmscan: retry folios written back while isolated for
> > traditional LRU" is due for an update and that more discussion is
> > needed, so I shall drop this version of "mm: vmscan: retry folios
> > written back while isolated for traditional LRU" from mm-unstable.
> 
> Hi, Andrew and Yu, does this mean I should resend a new version that
> fixes the conflict and updates the message?

Yes please.  Against
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/'s
mm-unstable branch would be ideal.

Please already double-check that all review comments have been
addressed in an appropriate fashion.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-unstable v4 3/7] mm/mglru: rework aging feedback
  2024-12-31  4:35 ` [PATCH mm-unstable v4 3/7] mm/mglru: rework aging feedback Yu Zhao
@ 2025-01-07 17:14   ` Kairui Song
  2025-01-13  6:51     ` Yu Zhao
  0 siblings, 1 reply; 14+ messages in thread
From: Kairui Song @ 2025-01-07 17:14 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, linux-mm, linux-kernel, David Stevens, Kalesh Singh

On Tue, Dec 31, 2024 at 12:36 PM Yu Zhao <yuzhao@google.com> wrote:

Hi Yu,

>
> The aging feedback is based on both the number of generations and the
> distribution of folios in each generation. The number of generations
> is currently the distance between max_seq and anon min_seq. This is
> because anon min_seq is not allowed to move past file min_seq. The
> rationale for that is that file is always evictable whereas anon is
> not. However, for use cases where anon is a lot cheaper than file:
> 1. Anon in the second oldest generation can be a better choice than
>    file in the oldest generation.
> 2. A large amount of file in the oldest generation can skew the
>    distribution, making should_run_aging() return false negative.
>
> Allow anon and file min_seq to move independently, and use solely the
> number of generations as the feedback for aging. Specifically, when
> both anon and file are evictable, anon min_seq can now be greater than
> file min_seq, and therefore the number of generations becomes the
> distance between max_seq and min(min_seq[0],min_seq[1]). And
> should_run_aging() returns true if and only if the number of
> generations is less than MAX_NR_GENS.
>
> As the first step to the final optimization, this change by itself
> should not have userspace-visiable effects beyond performance. The
> next twos patch will take advantage of this change; the last patch in
> this series will better distribute folios across MAX_NR_GENS.
>
> Reported-by: David Stevens <stevensd@chromium.org>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Tested-by: Kalesh Singh <kaleshsingh@google.com>
> ---
>  include/linux/mmzone.h |  17 ++--
>  mm/vmscan.c            | 200 ++++++++++++++++++-----------------------
>  2 files changed, 96 insertions(+), 121 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b36124145a16..8245ecb0400b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -421,12 +421,11 @@ enum {
>  /*
>   * The youngest generation number is stored in max_seq for both anon and file
>   * types as they are aged on an equal footing. The oldest generation numbers are
> - * stored in min_seq[] separately for anon and file types as clean file pages
> - * can be evicted regardless of swap constraints.
> - *
> - * Normally anon and file min_seq are in sync. But if swapping is constrained,
> - * e.g., out of swap space, file min_seq is allowed to advance and leave anon
> - * min_seq behind.
> + * stored in min_seq[] separately for anon and file types so that they can be
> + * incremented independently. Ideally min_seq[] are kept in sync when both anon
> + * and file types are evictable. However, to adapt to situations like extreme
> + * swappiness, they are allowed to be out of sync by at most
> + * MAX_NR_GENS-MIN_NR_GENS-1.
>   *
>   * The number of pages in each generation is eventually consistent and therefore
>   * can be transiently negative when reset_batch_size() is pending.
> @@ -446,8 +445,8 @@ struct lru_gen_folio {
>         unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
>         /* the exponential moving average of evicted+protected */
>         unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
> -       /* the first tier doesn't need protection, hence the minus one */
> -       unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
> +       /* can only be modified under the LRU lock */
> +       unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
>         /* can be modified without holding the LRU lock */
>         atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
>         atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> @@ -498,7 +497,7 @@ struct lru_gen_mm_walk {
>         int mm_stats[NR_MM_STATS];
>         /* total batched items */
>         int batched;
> -       bool can_swap;
> +       int swappiness;
>         bool force_scan;
>  };
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f236db86de8a..f767e3d34e73 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2627,11 +2627,17 @@ static bool should_clear_pmd_young(void)
>                 READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]),      \
>         }
>
> +#define evictable_min_seq(min_seq, swappiness)                         \
> +       min((min_seq)[!(swappiness)], (min_seq)[(swappiness) != MAX_SWAPPINESS])
> +
>  #define for_each_gen_type_zone(gen, type, zone)                                \
>         for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)                   \
>                 for ((type) = 0; (type) < ANON_AND_FILE; (type)++)      \
>                         for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
>
> +#define for_each_evictable_type(type, swappiness)                      \
> +       for ((type) = !(swappiness); (type) <= ((swappiness) != MAX_SWAPPINESS); (type)++)
> +
>  #define get_memcg_gen(seq)     ((seq) % MEMCG_NR_GENS)
>  #define get_memcg_bin(bin)     ((bin) % MEMCG_NR_BINS)
>
> @@ -2677,10 +2683,16 @@ static int get_nr_gens(struct lruvec *lruvec, int type)
>
>  static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
>  {
> -       /* see the comment on lru_gen_folio */
> -       return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS &&
> -              get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, LRU_GEN_ANON) &&
> -              get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
> +       int type;
> +
> +       for (type = 0; type < ANON_AND_FILE; type++) {
> +               int n = get_nr_gens(lruvec, type);
> +
> +               if (n < MIN_NR_GENS || n > MAX_NR_GENS)
> +                       return false;
> +       }
> +
> +       return true;
>  }
>
>  /******************************************************************************
> @@ -3087,9 +3099,8 @@ static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
>         pos->refaulted = lrugen->avg_refaulted[type][tier] +
>                          atomic_long_read(&lrugen->refaulted[hist][type][tier]);
>         pos->total = lrugen->avg_total[type][tier] +
> +                    lrugen->protected[hist][type][tier] +
>                      atomic_long_read(&lrugen->evicted[hist][type][tier]);
> -       if (tier)
> -               pos->total += lrugen->protected[hist][type][tier - 1];
>         pos->gain = gain;
>  }
>
> @@ -3116,17 +3127,15 @@ static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)
>                         WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
>
>                         sum = lrugen->avg_total[type][tier] +
> +                             lrugen->protected[hist][type][tier] +
>                               atomic_long_read(&lrugen->evicted[hist][type][tier]);
> -                       if (tier)
> -                               sum += lrugen->protected[hist][type][tier - 1];
>                         WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
>                 }
>
>                 if (clear) {
>                         atomic_long_set(&lrugen->refaulted[hist][type][tier], 0);
>                         atomic_long_set(&lrugen->evicted[hist][type][tier], 0);
> -                       if (tier)
> -                               WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0);
> +                       WRITE_ONCE(lrugen->protected[hist][type][tier], 0);
>                 }
>         }
>  }
> @@ -3261,7 +3270,7 @@ static int should_skip_vma(unsigned long start, unsigned long end, struct mm_wal
>                 return true;
>
>         if (vma_is_anonymous(vma))
> -               return !walk->can_swap;
> +               return !walk->swappiness;
>
>         if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping))
>                 return true;
> @@ -3271,7 +3280,10 @@ static int should_skip_vma(unsigned long start, unsigned long end, struct mm_wal
>                 return true;
>
>         if (shmem_mapping(mapping))
> -               return !walk->can_swap;
> +               return !walk->swappiness;
> +
> +       if (walk->swappiness == MAX_SWAPPINESS)
> +               return true;
>
>         /* to exclude special mappings like dax, etc. */
>         return !mapping->a_ops->read_folio;
> @@ -3359,7 +3371,7 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned
>  }
>
>  static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
> -                                  struct pglist_data *pgdat, bool can_swap)
> +                                  struct pglist_data *pgdat)
>  {
>         struct folio *folio;
>
> @@ -3370,10 +3382,6 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
>         if (folio_memcg(folio) != memcg)
>                 return NULL;
>
> -       /* file VMAs can contain anon pages from COW */
> -       if (!folio_is_file_lru(folio) && !can_swap)
> -               return NULL;
> -
>         return folio;
>  }
>
> @@ -3429,7 +3437,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
>                 if (pfn == -1)
>                         continue;
>
> -               folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
> +               folio = get_pfn_folio(pfn, memcg, pgdat);
>                 if (!folio)
>                         continue;
>
> @@ -3514,7 +3522,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
>                 if (pfn == -1)
>                         goto next;
>
> -               folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
> +               folio = get_pfn_folio(pfn, memcg, pgdat);
>                 if (!folio)
>                         goto next;
>
> @@ -3726,22 +3734,26 @@ static void clear_mm_walk(void)
>                 kfree(walk);
>  }
>
> -static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> +static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
>  {
>         int zone;
>         int remaining = MAX_LRU_BATCH;
>         struct lru_gen_folio *lrugen = &lruvec->lrugen;
> +       int hist = lru_hist_from_seq(lrugen->min_seq[type]);
>         int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
>
> -       if (type == LRU_GEN_ANON && !can_swap)
> +       if (type ? swappiness == MAX_SWAPPINESS : !swappiness)
>                 goto done;
>
> -       /* prevent cold/hot inversion if force_scan is true */
> +       /* prevent cold/hot inversion if the type is evictable */
>         for (zone = 0; zone < MAX_NR_ZONES; zone++) {
>                 struct list_head *head = &lrugen->folios[old_gen][type][zone];
>
>                 while (!list_empty(head)) {
>                         struct folio *folio = lru_to_folio(head);
> +                       int refs = folio_lru_refs(folio);
> +                       int tier = lru_tier_from_refs(refs);
> +                       int delta = folio_nr_pages(folio);
>
>                         VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
>                         VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
> @@ -3751,6 +3763,9 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
>                         new_gen = folio_inc_gen(lruvec, folio, false);
>                         list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
>
> +                       WRITE_ONCE(lrugen->protected[hist][type][tier],
> +                                  lrugen->protected[hist][type][tier] + delta);
> +
>                         if (!--remaining)
>                                 return false;
>                 }
> @@ -3762,7 +3777,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
>         return true;
>  }
>
> -static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
> +static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
>  {
>         int gen, type, zone;
>         bool success = false;
> @@ -3772,7 +3787,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
>         VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
>
>         /* find the oldest populated generation */
> -       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> +       for_each_evictable_type(type, swappiness) {
>                 while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) {
>                         gen = lru_gen_from_seq(min_seq[type]);
>
> @@ -3788,13 +3803,17 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
>         }
>
>         /* see the comment on lru_gen_folio */
> -       if (can_swap) {
> -               min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]);
> -               min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]);
> +       if (swappiness && swappiness != MAX_SWAPPINESS) {
> +               unsigned long seq = lrugen->max_seq - MIN_NR_GENS;
> +
> +               if (min_seq[LRU_GEN_ANON] > seq && min_seq[LRU_GEN_FILE] < seq)
> +                       min_seq[LRU_GEN_ANON] = seq;
> +               else if (min_seq[LRU_GEN_FILE] > seq && min_seq[LRU_GEN_ANON] < seq)
> +                       min_seq[LRU_GEN_FILE] = seq;
>         }
>
> -       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> -               if (min_seq[type] == lrugen->min_seq[type])
> +       for_each_evictable_type(type, swappiness) {
> +               if (min_seq[type] <= lrugen->min_seq[type])
>                         continue;
>
>                 reset_ctrl_pos(lruvec, type, true);
> @@ -3805,8 +3824,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
>         return success;
>  }
>
> -static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> -                       bool can_swap, bool force_scan)
> +static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness)
>  {
>         bool success;
>         int prev, next;
> @@ -3824,13 +3842,11 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
>         if (!success)
>                 goto unlock;
>
> -       for (type = ANON_AND_FILE - 1; type >= 0; type--) {
> +       for (type = 0; type < ANON_AND_FILE; type++) {
>                 if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
>                         continue;
>
> -               VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap));
> -
> -               if (inc_min_seq(lruvec, type, can_swap))
> +               if (inc_min_seq(lruvec, type, swappiness))
>                         continue;
>
>                 spin_unlock_irq(&lruvec->lru_lock);
> @@ -3874,7 +3890,7 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
>  }
>
>  static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> -                              bool can_swap, bool force_scan)
> +                              int swappiness, bool force_scan)
>  {
>         bool success;
>         struct lru_gen_mm_walk *walk;
> @@ -3885,7 +3901,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
>         VM_WARN_ON_ONCE(seq > READ_ONCE(lrugen->max_seq));
>
>         if (!mm_state)
> -               return inc_max_seq(lruvec, seq, can_swap, force_scan);
> +               return inc_max_seq(lruvec, seq, swappiness);
>
>         /* see the comment in iterate_mm_list() */
>         if (seq <= READ_ONCE(mm_state->seq))
> @@ -3910,7 +3926,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
>
>         walk->lruvec = lruvec;
>         walk->seq = seq;
> -       walk->can_swap = can_swap;
> +       walk->swappiness = swappiness;
>         walk->force_scan = force_scan;
>
>         do {
> @@ -3920,7 +3936,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
>         } while (mm);
>  done:
>         if (success) {
> -               success = inc_max_seq(lruvec, seq, can_swap, force_scan);
> +               success = inc_max_seq(lruvec, seq, swappiness);
>                 WARN_ON_ONCE(!success);
>         }
>
> @@ -3961,13 +3977,13 @@ static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
>  {
>         int gen, type, zone;
>         unsigned long total = 0;
> -       bool can_swap = get_swappiness(lruvec, sc);
> +       int swappiness = get_swappiness(lruvec, sc);
>         struct lru_gen_folio *lrugen = &lruvec->lrugen;
>         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>         DEFINE_MAX_SEQ(lruvec);
>         DEFINE_MIN_SEQ(lruvec);
>
> -       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> +       for_each_evictable_type(type, swappiness) {
>                 unsigned long seq;
>
>                 for (seq = min_seq[type]; seq <= max_seq; seq++) {
> @@ -3987,6 +4003,7 @@ static bool lruvec_is_reclaimable(struct lruvec *lruvec, struct scan_control *sc
>  {
>         int gen;
>         unsigned long birth;
> +       int swappiness = get_swappiness(lruvec, sc);
>         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>         DEFINE_MIN_SEQ(lruvec);
>
> @@ -3996,8 +4013,7 @@ static bool lruvec_is_reclaimable(struct lruvec *lruvec, struct scan_control *sc
>         if (!lruvec_is_sizable(lruvec, sc))
>                 return false;
>
> -       /* see the comment on lru_gen_folio */
> -       gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]);
> +       gen = lru_gen_from_seq(evictable_min_seq(min_seq, swappiness));
>         birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
>
>         return time_is_before_jiffies(birth + min_ttl);
> @@ -4064,7 +4080,6 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>         unsigned long addr = pvmw->address;
>         struct vm_area_struct *vma = pvmw->vma;
>         struct folio *folio = pfn_folio(pvmw->pfn);
> -       bool can_swap = !folio_is_file_lru(folio);
>         struct mem_cgroup *memcg = folio_memcg(folio);
>         struct pglist_data *pgdat = folio_pgdat(folio);
>         struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> @@ -4117,7 +4132,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>                 if (pfn == -1)
>                         continue;
>
> -               folio = get_pfn_folio(pfn, memcg, pgdat, can_swap);
> +               folio = get_pfn_folio(pfn, memcg, pgdat);
>                 if (!folio)
>                         continue;
>
> @@ -4333,8 +4348,8 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>                 gen = folio_inc_gen(lruvec, folio, false);
>                 list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
>
> -               WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
> -                          lrugen->protected[hist][type][tier - 1] + delta);
> +               WRITE_ONCE(lrugen->protected[hist][type][tier],
> +                          lrugen->protected[hist][type][tier] + delta);
>                 return true;
>         }
>
> @@ -4533,7 +4548,6 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
>  {
>         int i;
>         int type;
> -       int scanned;
>         int tier = -1;
>         DEFINE_MIN_SEQ(lruvec);
>
> @@ -4558,21 +4572,23 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
>         else
>                 type = get_type_to_scan(lruvec, swappiness, &tier);
>
> -       for (i = !swappiness; i < ANON_AND_FILE; i++) {
> +       for_each_evictable_type(i, swappiness) {

Thanks for working on solving the reported issues, but one concern
about this for_each_evictable_type  macro and its usage here.

It basically forbids eviction of file pages with "swappiness == 200"
even for global pressure, this is a quite a change.

For both active / inactive or MGLRU, max swappiness used to make
kernel try reclaim anon as much as possible, but still fall back to
file eviction. Forbidding file eviction may cause unsolvable OOM,
unlike anon pages, killing process won't necessarily release file
pages, so the system could hung easily.

For existing systems with swappiness == 200 which were running fine
before, may also hit OOM very quickly.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-unstable v4 3/7] mm/mglru: rework aging feedback
  2025-01-07 17:14   ` Kairui Song
@ 2025-01-13  6:51     ` Yu Zhao
  2025-01-15 17:56       ` Kairui Song
  0 siblings, 1 reply; 14+ messages in thread
From: Yu Zhao @ 2025-01-13  6:51 UTC (permalink / raw)
  To: Kairui Song
  Cc: Andrew Morton, linux-mm, linux-kernel, David Stevens, Kalesh Singh

On Wed, Jan 08, 2025 at 01:14:58AM +0800, Kairui Song wrote:
> On Tue, Dec 31, 2024 at 12:36 PM Yu Zhao <yuzhao@google.com> wrote:
> 
> Hi Yu,
> 
> >
> > The aging feedback is based on both the number of generations and the
> > distribution of folios in each generation. The number of generations
> > is currently the distance between max_seq and anon min_seq. This is
> > because anon min_seq is not allowed to move past file min_seq. The
> > rationale for that is that file is always evictable whereas anon is
> > not. However, for use cases where anon is a lot cheaper than file:
> > 1. Anon in the second oldest generation can be a better choice than
> >    file in the oldest generation.
> > 2. A large amount of file in the oldest generation can skew the
> >    distribution, making should_run_aging() return false negative.
> >
> > Allow anon and file min_seq to move independently, and use solely the
> > number of generations as the feedback for aging. Specifically, when
> > both anon and file are evictable, anon min_seq can now be greater than
> > file min_seq, and therefore the number of generations becomes the
> > distance between max_seq and min(min_seq[0],min_seq[1]). And
> > should_run_aging() returns true if and only if the number of
> > generations is less than MAX_NR_GENS.
> >
> > As the first step to the final optimization, this change by itself
> > should not have userspace-visiable effects beyond performance. The
> > next twos patch will take advantage of this change; the last patch in
> > this series will better distribute folios across MAX_NR_GENS.
> >
> > Reported-by: David Stevens <stevensd@chromium.org>
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > Tested-by: Kalesh Singh <kaleshsingh@google.com>
> > ---
> >  include/linux/mmzone.h |  17 ++--
> >  mm/vmscan.c            | 200 ++++++++++++++++++-----------------------
> >  2 files changed, 96 insertions(+), 121 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index b36124145a16..8245ecb0400b 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -421,12 +421,11 @@ enum {
> >  /*
> >   * The youngest generation number is stored in max_seq for both anon and file
> >   * types as they are aged on an equal footing. The oldest generation numbers are
> > - * stored in min_seq[] separately for anon and file types as clean file pages
> > - * can be evicted regardless of swap constraints.
> > - *
> > - * Normally anon and file min_seq are in sync. But if swapping is constrained,
> > - * e.g., out of swap space, file min_seq is allowed to advance and leave anon
> > - * min_seq behind.
> > + * stored in min_seq[] separately for anon and file types so that they can be
> > + * incremented independently. Ideally min_seq[] are kept in sync when both anon
> > + * and file types are evictable. However, to adapt to situations like extreme
> > + * swappiness, they are allowed to be out of sync by at most
> > + * MAX_NR_GENS-MIN_NR_GENS-1.
> >   *
> >   * The number of pages in each generation is eventually consistent and therefore
> >   * can be transiently negative when reset_batch_size() is pending.
> > @@ -446,8 +445,8 @@ struct lru_gen_folio {
> >         unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
> >         /* the exponential moving average of evicted+protected */
> >         unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
> > -       /* the first tier doesn't need protection, hence the minus one */
> > -       unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
> > +       /* can only be modified under the LRU lock */
> > +       unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> >         /* can be modified without holding the LRU lock */
> >         atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> >         atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > @@ -498,7 +497,7 @@ struct lru_gen_mm_walk {
> >         int mm_stats[NR_MM_STATS];
> >         /* total batched items */
> >         int batched;
> > -       bool can_swap;
> > +       int swappiness;
> >         bool force_scan;
> >  };
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index f236db86de8a..f767e3d34e73 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2627,11 +2627,17 @@ static bool should_clear_pmd_young(void)
> >                 READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]),      \
> >         }
> >
> > +#define evictable_min_seq(min_seq, swappiness)                         \
> > +       min((min_seq)[!(swappiness)], (min_seq)[(swappiness) != MAX_SWAPPINESS])
> > +
> >  #define for_each_gen_type_zone(gen, type, zone)                                \
> >         for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)                   \
> >                 for ((type) = 0; (type) < ANON_AND_FILE; (type)++)      \
> >                         for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
> >
> > +#define for_each_evictable_type(type, swappiness)                      \
> > +       for ((type) = !(swappiness); (type) <= ((swappiness) != MAX_SWAPPINESS); (type)++)
> > +
> >  #define get_memcg_gen(seq)     ((seq) % MEMCG_NR_GENS)
> >  #define get_memcg_bin(bin)     ((bin) % MEMCG_NR_BINS)
> >
> > @@ -2677,10 +2683,16 @@ static int get_nr_gens(struct lruvec *lruvec, int type)
> >
> >  static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
> >  {
> > -       /* see the comment on lru_gen_folio */
> > -       return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS &&
> > -              get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, LRU_GEN_ANON) &&
> > -              get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
> > +       int type;
> > +
> > +       for (type = 0; type < ANON_AND_FILE; type++) {
> > +               int n = get_nr_gens(lruvec, type);
> > +
> > +               if (n < MIN_NR_GENS || n > MAX_NR_GENS)
> > +                       return false;
> > +       }
> > +
> > +       return true;
> >  }
> >
> >  /******************************************************************************
> > @@ -3087,9 +3099,8 @@ static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
> >         pos->refaulted = lrugen->avg_refaulted[type][tier] +
> >                          atomic_long_read(&lrugen->refaulted[hist][type][tier]);
> >         pos->total = lrugen->avg_total[type][tier] +
> > +                    lrugen->protected[hist][type][tier] +
> >                      atomic_long_read(&lrugen->evicted[hist][type][tier]);
> > -       if (tier)
> > -               pos->total += lrugen->protected[hist][type][tier - 1];
> >         pos->gain = gain;
> >  }
> >
> > @@ -3116,17 +3127,15 @@ static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)
> >                         WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
> >
> >                         sum = lrugen->avg_total[type][tier] +
> > +                             lrugen->protected[hist][type][tier] +
> >                               atomic_long_read(&lrugen->evicted[hist][type][tier]);
> > -                       if (tier)
> > -                               sum += lrugen->protected[hist][type][tier - 1];
> >                         WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
> >                 }
> >
> >                 if (clear) {
> >                         atomic_long_set(&lrugen->refaulted[hist][type][tier], 0);
> >                         atomic_long_set(&lrugen->evicted[hist][type][tier], 0);
> > -                       if (tier)
> > -                               WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0);
> > +                       WRITE_ONCE(lrugen->protected[hist][type][tier], 0);
> >                 }
> >         }
> >  }
> > @@ -3261,7 +3270,7 @@ static int should_skip_vma(unsigned long start, unsigned long end, struct mm_wal
> >                 return true;
> >
> >         if (vma_is_anonymous(vma))
> > -               return !walk->can_swap;
> > +               return !walk->swappiness;
> >
> >         if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping))
> >                 return true;
> > @@ -3271,7 +3280,10 @@ static int should_skip_vma(unsigned long start, unsigned long end, struct mm_wal
> >                 return true;
> >
> >         if (shmem_mapping(mapping))
> > -               return !walk->can_swap;
> > +               return !walk->swappiness;
> > +
> > +       if (walk->swappiness == MAX_SWAPPINESS)
> > +               return true;
> >
> >         /* to exclude special mappings like dax, etc. */
> >         return !mapping->a_ops->read_folio;
> > @@ -3359,7 +3371,7 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned
> >  }
> >
> >  static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
> > -                                  struct pglist_data *pgdat, bool can_swap)
> > +                                  struct pglist_data *pgdat)
> >  {
> >         struct folio *folio;
> >
> > @@ -3370,10 +3382,6 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
> >         if (folio_memcg(folio) != memcg)
> >                 return NULL;
> >
> > -       /* file VMAs can contain anon pages from COW */
> > -       if (!folio_is_file_lru(folio) && !can_swap)
> > -               return NULL;
> > -
> >         return folio;
> >  }
> >
> > @@ -3429,7 +3437,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> >                 if (pfn == -1)
> >                         continue;
> >
> > -               folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
> > +               folio = get_pfn_folio(pfn, memcg, pgdat);
> >                 if (!folio)
> >                         continue;
> >
> > @@ -3514,7 +3522,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
> >                 if (pfn == -1)
> >                         goto next;
> >
> > -               folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
> > +               folio = get_pfn_folio(pfn, memcg, pgdat);
> >                 if (!folio)
> >                         goto next;
> >
> > @@ -3726,22 +3734,26 @@ static void clear_mm_walk(void)
> >                 kfree(walk);
> >  }
> >
> > -static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> > +static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
> >  {
> >         int zone;
> >         int remaining = MAX_LRU_BATCH;
> >         struct lru_gen_folio *lrugen = &lruvec->lrugen;
> > +       int hist = lru_hist_from_seq(lrugen->min_seq[type]);
> >         int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> >
> > -       if (type == LRU_GEN_ANON && !can_swap)
> > +       if (type ? swappiness == MAX_SWAPPINESS : !swappiness)
> >                 goto done;
> >
> > -       /* prevent cold/hot inversion if force_scan is true */
> > +       /* prevent cold/hot inversion if the type is evictable */
> >         for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> >                 struct list_head *head = &lrugen->folios[old_gen][type][zone];
> >
> >                 while (!list_empty(head)) {
> >                         struct folio *folio = lru_to_folio(head);
> > +                       int refs = folio_lru_refs(folio);
> > +                       int tier = lru_tier_from_refs(refs);
> > +                       int delta = folio_nr_pages(folio);
> >
> >                         VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
> >                         VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
> > @@ -3751,6 +3763,9 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> >                         new_gen = folio_inc_gen(lruvec, folio, false);
> >                         list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
> >
> > +                       WRITE_ONCE(lrugen->protected[hist][type][tier],
> > +                                  lrugen->protected[hist][type][tier] + delta);
> > +
> >                         if (!--remaining)
> >                                 return false;
> >                 }
> > @@ -3762,7 +3777,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> >         return true;
> >  }
> >
> > -static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
> > +static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
> >  {
> >         int gen, type, zone;
> >         bool success = false;
> > @@ -3772,7 +3787,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
> >         VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> >
> >         /* find the oldest populated generation */
> > -       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> > +       for_each_evictable_type(type, swappiness) {
> >                 while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) {
> >                         gen = lru_gen_from_seq(min_seq[type]);
> >
> > @@ -3788,13 +3803,17 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
> >         }
> >
> >         /* see the comment on lru_gen_folio */
> > -       if (can_swap) {
> > -               min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]);
> > -               min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]);
> > +       if (swappiness && swappiness != MAX_SWAPPINESS) {
> > +               unsigned long seq = lrugen->max_seq - MIN_NR_GENS;
> > +
> > +               if (min_seq[LRU_GEN_ANON] > seq && min_seq[LRU_GEN_FILE] < seq)
> > +                       min_seq[LRU_GEN_ANON] = seq;
> > +               else if (min_seq[LRU_GEN_FILE] > seq && min_seq[LRU_GEN_ANON] < seq)
> > +                       min_seq[LRU_GEN_FILE] = seq;
> >         }
> >
> > -       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> > -               if (min_seq[type] == lrugen->min_seq[type])
> > +       for_each_evictable_type(type, swappiness) {
> > +               if (min_seq[type] <= lrugen->min_seq[type])
> >                         continue;
> >
> >                 reset_ctrl_pos(lruvec, type, true);
> > @@ -3805,8 +3824,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
> >         return success;
> >  }
> >
> > -static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> > -                       bool can_swap, bool force_scan)
> > +static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness)
> >  {
> >         bool success;
> >         int prev, next;
> > @@ -3824,13 +3842,11 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> >         if (!success)
> >                 goto unlock;
> >
> > -       for (type = ANON_AND_FILE - 1; type >= 0; type--) {
> > +       for (type = 0; type < ANON_AND_FILE; type++) {
> >                 if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
> >                         continue;
> >
> > -               VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap));
> > -
> > -               if (inc_min_seq(lruvec, type, can_swap))
> > +               if (inc_min_seq(lruvec, type, swappiness))
> >                         continue;
> >
> >                 spin_unlock_irq(&lruvec->lru_lock);
> > @@ -3874,7 +3890,7 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> >  }
> >
> >  static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> > -                              bool can_swap, bool force_scan)
> > +                              int swappiness, bool force_scan)
> >  {
> >         bool success;
> >         struct lru_gen_mm_walk *walk;
> > @@ -3885,7 +3901,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> >         VM_WARN_ON_ONCE(seq > READ_ONCE(lrugen->max_seq));
> >
> >         if (!mm_state)
> > -               return inc_max_seq(lruvec, seq, can_swap, force_scan);
> > +               return inc_max_seq(lruvec, seq, swappiness);
> >
> >         /* see the comment in iterate_mm_list() */
> >         if (seq <= READ_ONCE(mm_state->seq))
> > @@ -3910,7 +3926,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> >
> >         walk->lruvec = lruvec;
> >         walk->seq = seq;
> > -       walk->can_swap = can_swap;
> > +       walk->swappiness = swappiness;
> >         walk->force_scan = force_scan;
> >
> >         do {
> > @@ -3920,7 +3936,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> >         } while (mm);
> >  done:
> >         if (success) {
> > -               success = inc_max_seq(lruvec, seq, can_swap, force_scan);
> > +               success = inc_max_seq(lruvec, seq, swappiness);
> >                 WARN_ON_ONCE(!success);
> >         }
> >
> > @@ -3961,13 +3977,13 @@ static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
> >  {
> >         int gen, type, zone;
> >         unsigned long total = 0;
> > -       bool can_swap = get_swappiness(lruvec, sc);
> > +       int swappiness = get_swappiness(lruvec, sc);
> >         struct lru_gen_folio *lrugen = &lruvec->lrugen;
> >         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> >         DEFINE_MAX_SEQ(lruvec);
> >         DEFINE_MIN_SEQ(lruvec);
> >
> > -       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> > +       for_each_evictable_type(type, swappiness) {
> >                 unsigned long seq;
> >
> >                 for (seq = min_seq[type]; seq <= max_seq; seq++) {
> > @@ -3987,6 +4003,7 @@ static bool lruvec_is_reclaimable(struct lruvec *lruvec, struct scan_control *sc
> >  {
> >         int gen;
> >         unsigned long birth;
> > +       int swappiness = get_swappiness(lruvec, sc);
> >         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> >         DEFINE_MIN_SEQ(lruvec);
> >
> > @@ -3996,8 +4013,7 @@ static bool lruvec_is_reclaimable(struct lruvec *lruvec, struct scan_control *sc
> >         if (!lruvec_is_sizable(lruvec, sc))
> >                 return false;
> >
> > -       /* see the comment on lru_gen_folio */
> > -       gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]);
> > +       gen = lru_gen_from_seq(evictable_min_seq(min_seq, swappiness));
> >         birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
> >
> >         return time_is_before_jiffies(birth + min_ttl);
> > @@ -4064,7 +4080,6 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> >         unsigned long addr = pvmw->address;
> >         struct vm_area_struct *vma = pvmw->vma;
> >         struct folio *folio = pfn_folio(pvmw->pfn);
> > -       bool can_swap = !folio_is_file_lru(folio);
> >         struct mem_cgroup *memcg = folio_memcg(folio);
> >         struct pglist_data *pgdat = folio_pgdat(folio);
> >         struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > @@ -4117,7 +4132,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> >                 if (pfn == -1)
> >                         continue;
> >
> > -               folio = get_pfn_folio(pfn, memcg, pgdat, can_swap);
> > +               folio = get_pfn_folio(pfn, memcg, pgdat);
> >                 if (!folio)
> >                         continue;
> >
> > @@ -4333,8 +4348,8 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >                 gen = folio_inc_gen(lruvec, folio, false);
> >                 list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
> >
> > -               WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
> > -                          lrugen->protected[hist][type][tier - 1] + delta);
> > +               WRITE_ONCE(lrugen->protected[hist][type][tier],
> > +                          lrugen->protected[hist][type][tier] + delta);
> >                 return true;
> >         }
> >
> > @@ -4533,7 +4548,6 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
> >  {
> >         int i;
> >         int type;
> > -       int scanned;
> >         int tier = -1;
> >         DEFINE_MIN_SEQ(lruvec);
> >
> > @@ -4558,21 +4572,23 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
> >         else
> >                 type = get_type_to_scan(lruvec, swappiness, &tier);
> >
> > -       for (i = !swappiness; i < ANON_AND_FILE; i++) {
> > +       for_each_evictable_type(i, swappiness) {
> 
> Thanks for working on solving the reported issues, but one concern
> about this for_each_evictable_type  macro and its usage here.
> 
> It basically forbids eviction of file pages with "swappiness == 200"
> even for global pressure, this is a quite a change.
> 
> For both active / inactive or MGLRU, max swappiness used to make
> kernel try reclaim anon as much as possible, but still fall back to
> file eviction. Forbidding file eviction may cause unsolvable OOM,
> unlike anon pages, killing process won't necessarily release file
> pages, so the system could hung easily.
> 
> For existing systems with swappiness == 200 which were running fine
> before, may also hit OOM very quickly.

Do you know anyone actually uses 200? I only use 200 for testing but I
can use 201 instead, since the debugfs interface isn't limited to [0,
200].

I think the following addresses your concern. If so, Andrew, could you
please squash it into this patch? Thanks!


diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7eaa975d8546..fe41c8d6800a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2629,7 +2629,7 @@ static bool should_clear_pmd_young(void)
 	}
 
 #define evictable_min_seq(min_seq, swappiness)				\
-	min((min_seq)[!(swappiness)], (min_seq)[(swappiness) != MAX_SWAPPINESS])
+	min((min_seq)[!(swappiness)], (min_seq)[(swappiness) <= MAX_SWAPPINESS])
 
 #define for_each_gen_type_zone(gen, type, zone)				\
 	for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)			\
@@ -2637,7 +2637,7 @@ static bool should_clear_pmd_young(void)
 			for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
 
 #define for_each_evictable_type(type, swappiness)			\
-	for ((type) = !(swappiness); (type) <= ((swappiness) != MAX_SWAPPINESS); (type)++)
+	for ((type) = !(swappiness); (type) <= ((swappiness) <= MAX_SWAPPINESS); (type)++)
 
 #define get_memcg_gen(seq)	((seq) % MEMCG_NR_GENS)
 #define get_memcg_bin(bin)	((bin) % MEMCG_NR_BINS)
@@ -3288,7 +3288,7 @@ static int should_skip_vma(unsigned long start, unsigned long end, struct mm_wal
 	if (shmem_mapping(mapping))
 		return !walk->swappiness;
 
-	if (walk->swappiness == MAX_SWAPPINESS)
+	if (walk->swappiness > MAX_SWAPPINESS)
 		return true;
 
 	/* to exclude special mappings like dax, etc. */
@@ -3748,7 +3748,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
 	int hist = lru_hist_from_seq(lrugen->min_seq[type]);
 	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
 
-	if (type ? swappiness == MAX_SWAPPINESS : !swappiness)
+	if (type ? swappiness > MAX_SWAPPINESS : !swappiness)
 		goto done;
 
 	/* prevent cold/hot inversion if the type is evictable */
@@ -3809,7 +3809,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
 	}
 
 	/* see the comment on lru_gen_folio */
-	if (swappiness && swappiness != MAX_SWAPPINESS) {
+	if (swappiness && swappiness <= MAX_SWAPPINESS) {
 		unsigned long seq = lrugen->max_seq - MIN_NR_GENS;
 
 		if (min_seq[LRU_GEN_ANON] > seq && min_seq[LRU_GEN_FILE] < seq)
@@ -4525,10 +4525,10 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness)
 {
 	struct ctrl_pos sp, pv;
 
-	if (!swappiness)
+	if (swappiness <= MIN_SWAPPINESS + 1)
 		return LRU_GEN_FILE;
 
-	if (swappiness == MAX_SWAPPINESS)
+	if (swappiness >= MAX_SWAPPINESS)
 		return LRU_GEN_ANON;
 	/*
 	 * Compare the sum of all tiers of anon with that of file to determine
@@ -5423,7 +5423,7 @@ static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
 
 	if (swappiness < MIN_SWAPPINESS)
 		swappiness = get_swappiness(lruvec, sc);
-	else if (swappiness > MAX_SWAPPINESS)
+	else if (swappiness > MAX_SWAPPINESS + 1)
 		goto done;
 
 	switch (cmd) {


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mm-unstable v4 3/7] mm/mglru: rework aging feedback
  2025-01-13  6:51     ` Yu Zhao
@ 2025-01-15 17:56       ` Kairui Song
  0 siblings, 0 replies; 14+ messages in thread
From: Kairui Song @ 2025-01-15 17:56 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, linux-mm, linux-kernel, David Stevens, Kalesh Singh

On Mon, Jan 13, 2025 at 2:51 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Jan 08, 2025 at 01:14:58AM +0800, Kairui Song wrote:
> > On Tue, Dec 31, 2024 at 12:36 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > Hi Yu,
> >
> > >
> > > The aging feedback is based on both the number of generations and the
> > > distribution of folios in each generation. The number of generations
> > > is currently the distance between max_seq and anon min_seq. This is
> > > because anon min_seq is not allowed to move past file min_seq. The
> > > rationale for that is that file is always evictable whereas anon is
> > > not. However, for use cases where anon is a lot cheaper than file:
> > > 1. Anon in the second oldest generation can be a better choice than
> > >    file in the oldest generation.
> > > 2. A large amount of file in the oldest generation can skew the
> > >    distribution, making should_run_aging() return false negative.
> > >
> > > Allow anon and file min_seq to move independently, and use solely the
> > > number of generations as the feedback for aging. Specifically, when
> > > both anon and file are evictable, anon min_seq can now be greater than
> > > file min_seq, and therefore the number of generations becomes the
> > > distance between max_seq and min(min_seq[0],min_seq[1]). And
> > > should_run_aging() returns true if and only if the number of
> > > generations is less than MAX_NR_GENS.
> > >
> > > As the first step to the final optimization, this change by itself
> > > should not have userspace-visiable effects beyond performance. The
> > > next twos patch will take advantage of this change; the last patch in
> > > this series will better distribute folios across MAX_NR_GENS.
> > >
> > > Reported-by: David Stevens <stevensd@chromium.org>
> > > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > > Tested-by: Kalesh Singh <kaleshsingh@google.com>
> > > ---
> > >  include/linux/mmzone.h |  17 ++--
> > >  mm/vmscan.c            | 200 ++++++++++++++++++-----------------------
> > >  2 files changed, 96 insertions(+), 121 deletions(-)
> > >
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index b36124145a16..8245ecb0400b 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -421,12 +421,11 @@ enum {
> > >  /*
> > >   * The youngest generation number is stored in max_seq for both anon and file
> > >   * types as they are aged on an equal footing. The oldest generation numbers are
> > > - * stored in min_seq[] separately for anon and file types as clean file pages
> > > - * can be evicted regardless of swap constraints.
> > > - *
> > > - * Normally anon and file min_seq are in sync. But if swapping is constrained,
> > > - * e.g., out of swap space, file min_seq is allowed to advance and leave anon
> > > - * min_seq behind.
> > > + * stored in min_seq[] separately for anon and file types so that they can be
> > > + * incremented independently. Ideally min_seq[] are kept in sync when both anon
> > > + * and file types are evictable. However, to adapt to situations like extreme
> > > + * swappiness, they are allowed to be out of sync by at most
> > > + * MAX_NR_GENS-MIN_NR_GENS-1.
> > >   *
> > >   * The number of pages in each generation is eventually consistent and therefore
> > >   * can be transiently negative when reset_batch_size() is pending.
> > > @@ -446,8 +445,8 @@ struct lru_gen_folio {
> > >         unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
> > >         /* the exponential moving average of evicted+protected */
> > >         unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
> > > -       /* the first tier doesn't need protection, hence the minus one */
> > > -       unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
> > > +       /* can only be modified under the LRU lock */
> > > +       unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > >         /* can be modified without holding the LRU lock */
> > >         atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > >         atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > > @@ -498,7 +497,7 @@ struct lru_gen_mm_walk {
> > >         int mm_stats[NR_MM_STATS];
> > >         /* total batched items */
> > >         int batched;
> > > -       bool can_swap;
> > > +       int swappiness;
> > >         bool force_scan;
> > >  };
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index f236db86de8a..f767e3d34e73 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2627,11 +2627,17 @@ static bool should_clear_pmd_young(void)
> > >                 READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]),      \
> > >         }
> > >
> > > +#define evictable_min_seq(min_seq, swappiness)                         \
> > > +       min((min_seq)[!(swappiness)], (min_seq)[(swappiness) != MAX_SWAPPINESS])
> > > +
> > >  #define for_each_gen_type_zone(gen, type, zone)                                \
> > >         for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)                   \
> > >                 for ((type) = 0; (type) < ANON_AND_FILE; (type)++)      \
> > >                         for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
> > >
> > > +#define for_each_evictable_type(type, swappiness)                      \
> > > +       for ((type) = !(swappiness); (type) <= ((swappiness) != MAX_SWAPPINESS); (type)++)
> > > +
> > >  #define get_memcg_gen(seq)     ((seq) % MEMCG_NR_GENS)
> > >  #define get_memcg_bin(bin)     ((bin) % MEMCG_NR_BINS)
> > >
> > > @@ -2677,10 +2683,16 @@ static int get_nr_gens(struct lruvec *lruvec, int type)
> > >
> > >  static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
> > >  {
> > > -       /* see the comment on lru_gen_folio */
> > > -       return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS &&
> > > -              get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, LRU_GEN_ANON) &&
> > > -              get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
> > > +       int type;
> > > +
> > > +       for (type = 0; type < ANON_AND_FILE; type++) {
> > > +               int n = get_nr_gens(lruvec, type);
> > > +
> > > +               if (n < MIN_NR_GENS || n > MAX_NR_GENS)
> > > +                       return false;
> > > +       }
> > > +
> > > +       return true;
> > >  }
> > >
> > >  /******************************************************************************
> > > @@ -3087,9 +3099,8 @@ static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
> > >         pos->refaulted = lrugen->avg_refaulted[type][tier] +
> > >                          atomic_long_read(&lrugen->refaulted[hist][type][tier]);
> > >         pos->total = lrugen->avg_total[type][tier] +
> > > +                    lrugen->protected[hist][type][tier] +
> > >                      atomic_long_read(&lrugen->evicted[hist][type][tier]);
> > > -       if (tier)
> > > -               pos->total += lrugen->protected[hist][type][tier - 1];
> > >         pos->gain = gain;
> > >  }
> > >
> > > @@ -3116,17 +3127,15 @@ static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)
> > >                         WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
> > >
> > >                         sum = lrugen->avg_total[type][tier] +
> > > +                             lrugen->protected[hist][type][tier] +
> > >                               atomic_long_read(&lrugen->evicted[hist][type][tier]);
> > > -                       if (tier)
> > > -                               sum += lrugen->protected[hist][type][tier - 1];
> > >                         WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
> > >                 }
> > >
> > >                 if (clear) {
> > >                         atomic_long_set(&lrugen->refaulted[hist][type][tier], 0);
> > >                         atomic_long_set(&lrugen->evicted[hist][type][tier], 0);
> > > -                       if (tier)
> > > -                               WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0);
> > > +                       WRITE_ONCE(lrugen->protected[hist][type][tier], 0);
> > >                 }
> > >         }
> > >  }
> > > @@ -3261,7 +3270,7 @@ static int should_skip_vma(unsigned long start, unsigned long end, struct mm_wal
> > >                 return true;
> > >
> > >         if (vma_is_anonymous(vma))
> > > -               return !walk->can_swap;
> > > +               return !walk->swappiness;
> > >
> > >         if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping))
> > >                 return true;
> > > @@ -3271,7 +3280,10 @@ static int should_skip_vma(unsigned long start, unsigned long end, struct mm_wal
> > >                 return true;
> > >
> > >         if (shmem_mapping(mapping))
> > > -               return !walk->can_swap;
> > > +               return !walk->swappiness;
> > > +
> > > +       if (walk->swappiness == MAX_SWAPPINESS)
> > > +               return true;
> > >
> > >         /* to exclude special mappings like dax, etc. */
> > >         return !mapping->a_ops->read_folio;
> > > @@ -3359,7 +3371,7 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned
> > >  }
> > >
> > >  static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
> > > -                                  struct pglist_data *pgdat, bool can_swap)
> > > +                                  struct pglist_data *pgdat)
> > >  {
> > >         struct folio *folio;
> > >
> > > @@ -3370,10 +3382,6 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
> > >         if (folio_memcg(folio) != memcg)
> > >                 return NULL;
> > >
> > > -       /* file VMAs can contain anon pages from COW */
> > > -       if (!folio_is_file_lru(folio) && !can_swap)
> > > -               return NULL;
> > > -
> > >         return folio;
> > >  }
> > >
> > > @@ -3429,7 +3437,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> > >                 if (pfn == -1)
> > >                         continue;
> > >
> > > -               folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
> > > +               folio = get_pfn_folio(pfn, memcg, pgdat);
> > >                 if (!folio)
> > >                         continue;
> > >
> > > @@ -3514,7 +3522,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
> > >                 if (pfn == -1)
> > >                         goto next;
> > >
> > > -               folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
> > > +               folio = get_pfn_folio(pfn, memcg, pgdat);
> > >                 if (!folio)
> > >                         goto next;
> > >
> > > @@ -3726,22 +3734,26 @@ static void clear_mm_walk(void)
> > >                 kfree(walk);
> > >  }
> > >
> > > -static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> > > +static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
> > >  {
> > >         int zone;
> > >         int remaining = MAX_LRU_BATCH;
> > >         struct lru_gen_folio *lrugen = &lruvec->lrugen;
> > > +       int hist = lru_hist_from_seq(lrugen->min_seq[type]);
> > >         int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > >
> > > -       if (type == LRU_GEN_ANON && !can_swap)
> > > +       if (type ? swappiness == MAX_SWAPPINESS : !swappiness)
> > >                 goto done;
> > >
> > > -       /* prevent cold/hot inversion if force_scan is true */
> > > +       /* prevent cold/hot inversion if the type is evictable */
> > >         for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> > >                 struct list_head *head = &lrugen->folios[old_gen][type][zone];
> > >
> > >                 while (!list_empty(head)) {
> > >                         struct folio *folio = lru_to_folio(head);
> > > +                       int refs = folio_lru_refs(folio);
> > > +                       int tier = lru_tier_from_refs(refs);
> > > +                       int delta = folio_nr_pages(folio);
> > >
> > >                         VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
> > >                         VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
> > > @@ -3751,6 +3763,9 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> > >                         new_gen = folio_inc_gen(lruvec, folio, false);
> > >                         list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
> > >
> > > +                       WRITE_ONCE(lrugen->protected[hist][type][tier],
> > > +                                  lrugen->protected[hist][type][tier] + delta);
> > > +
> > >                         if (!--remaining)
> > >                                 return false;
> > >                 }
> > > @@ -3762,7 +3777,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> > >         return true;
> > >  }
> > >
> > > -static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
> > > +static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
> > >  {
> > >         int gen, type, zone;
> > >         bool success = false;
> > > @@ -3772,7 +3787,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
> > >         VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> > >
> > >         /* find the oldest populated generation */
> > > -       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> > > +       for_each_evictable_type(type, swappiness) {
> > >                 while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) {
> > >                         gen = lru_gen_from_seq(min_seq[type]);
> > >
> > > @@ -3788,13 +3803,17 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
> > >         }
> > >
> > >         /* see the comment on lru_gen_folio */
> > > -       if (can_swap) {
> > > -               min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]);
> > > -               min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]);
> > > +       if (swappiness && swappiness != MAX_SWAPPINESS) {
> > > +               unsigned long seq = lrugen->max_seq - MIN_NR_GENS;
> > > +
> > > +               if (min_seq[LRU_GEN_ANON] > seq && min_seq[LRU_GEN_FILE] < seq)
> > > +                       min_seq[LRU_GEN_ANON] = seq;
> > > +               else if (min_seq[LRU_GEN_FILE] > seq && min_seq[LRU_GEN_ANON] < seq)
> > > +                       min_seq[LRU_GEN_FILE] = seq;
> > >         }
> > >
> > > -       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> > > -               if (min_seq[type] == lrugen->min_seq[type])
> > > +       for_each_evictable_type(type, swappiness) {
> > > +               if (min_seq[type] <= lrugen->min_seq[type])
> > >                         continue;
> > >
> > >                 reset_ctrl_pos(lruvec, type, true);
> > > @@ -3805,8 +3824,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
> > >         return success;
> > >  }
> > >
> > > -static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> > > -                       bool can_swap, bool force_scan)
> > > +static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness)
> > >  {
> > >         bool success;
> > >         int prev, next;
> > > @@ -3824,13 +3842,11 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> > >         if (!success)
> > >                 goto unlock;
> > >
> > > -       for (type = ANON_AND_FILE - 1; type >= 0; type--) {
> > > +       for (type = 0; type < ANON_AND_FILE; type++) {
> > >                 if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
> > >                         continue;
> > >
> > > -               VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap));
> > > -
> > > -               if (inc_min_seq(lruvec, type, can_swap))
> > > +               if (inc_min_seq(lruvec, type, swappiness))
> > >                         continue;
> > >
> > >                 spin_unlock_irq(&lruvec->lru_lock);
> > > @@ -3874,7 +3890,7 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> > >  }
> > >
> > >  static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> > > -                              bool can_swap, bool force_scan)
> > > +                              int swappiness, bool force_scan)
> > >  {
> > >         bool success;
> > >         struct lru_gen_mm_walk *walk;
> > > @@ -3885,7 +3901,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> > >         VM_WARN_ON_ONCE(seq > READ_ONCE(lrugen->max_seq));
> > >
> > >         if (!mm_state)
> > > -               return inc_max_seq(lruvec, seq, can_swap, force_scan);
> > > +               return inc_max_seq(lruvec, seq, swappiness);
> > >
> > >         /* see the comment in iterate_mm_list() */
> > >         if (seq <= READ_ONCE(mm_state->seq))
> > > @@ -3910,7 +3926,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> > >
> > >         walk->lruvec = lruvec;
> > >         walk->seq = seq;
> > > -       walk->can_swap = can_swap;
> > > +       walk->swappiness = swappiness;
> > >         walk->force_scan = force_scan;
> > >
> > >         do {
> > > @@ -3920,7 +3936,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> > >         } while (mm);
> > >  done:
> > >         if (success) {
> > > -               success = inc_max_seq(lruvec, seq, can_swap, force_scan);
> > > +               success = inc_max_seq(lruvec, seq, swappiness);
> > >                 WARN_ON_ONCE(!success);
> > >         }
> > >
> > > @@ -3961,13 +3977,13 @@ static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
> > >  {
> > >         int gen, type, zone;
> > >         unsigned long total = 0;
> > > -       bool can_swap = get_swappiness(lruvec, sc);
> > > +       int swappiness = get_swappiness(lruvec, sc);
> > >         struct lru_gen_folio *lrugen = &lruvec->lrugen;
> > >         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > >         DEFINE_MAX_SEQ(lruvec);
> > >         DEFINE_MIN_SEQ(lruvec);
> > >
> > > -       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> > > +       for_each_evictable_type(type, swappiness) {
> > >                 unsigned long seq;
> > >
> > >                 for (seq = min_seq[type]; seq <= max_seq; seq++) {
> > > @@ -3987,6 +4003,7 @@ static bool lruvec_is_reclaimable(struct lruvec *lruvec, struct scan_control *sc
> > >  {
> > >         int gen;
> > >         unsigned long birth;
> > > +       int swappiness = get_swappiness(lruvec, sc);
> > >         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > >         DEFINE_MIN_SEQ(lruvec);
> > >
> > > @@ -3996,8 +4013,7 @@ static bool lruvec_is_reclaimable(struct lruvec *lruvec, struct scan_control *sc
> > >         if (!lruvec_is_sizable(lruvec, sc))
> > >                 return false;
> > >
> > > -       /* see the comment on lru_gen_folio */
> > > -       gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]);
> > > +       gen = lru_gen_from_seq(evictable_min_seq(min_seq, swappiness));
> > >         birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
> > >
> > >         return time_is_before_jiffies(birth + min_ttl);
> > > @@ -4064,7 +4080,6 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> > >         unsigned long addr = pvmw->address;
> > >         struct vm_area_struct *vma = pvmw->vma;
> > >         struct folio *folio = pfn_folio(pvmw->pfn);
> > > -       bool can_swap = !folio_is_file_lru(folio);
> > >         struct mem_cgroup *memcg = folio_memcg(folio);
> > >         struct pglist_data *pgdat = folio_pgdat(folio);
> > >         struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > > @@ -4117,7 +4132,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> > >                 if (pfn == -1)
> > >                         continue;
> > >
> > > -               folio = get_pfn_folio(pfn, memcg, pgdat, can_swap);
> > > +               folio = get_pfn_folio(pfn, memcg, pgdat);
> > >                 if (!folio)
> > >                         continue;
> > >
> > > @@ -4333,8 +4348,8 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > >                 gen = folio_inc_gen(lruvec, folio, false);
> > >                 list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
> > >
> > > -               WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
> > > -                          lrugen->protected[hist][type][tier - 1] + delta);
> > > +               WRITE_ONCE(lrugen->protected[hist][type][tier],
> > > +                          lrugen->protected[hist][type][tier] + delta);
> > >                 return true;
> > >         }
> > >
> > > @@ -4533,7 +4548,6 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
> > >  {
> > >         int i;
> > >         int type;
> > > -       int scanned;
> > >         int tier = -1;
> > >         DEFINE_MIN_SEQ(lruvec);
> > >
> > > @@ -4558,21 +4572,23 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
> > >         else
> > >                 type = get_type_to_scan(lruvec, swappiness, &tier);
> > >
> > > -       for (i = !swappiness; i < ANON_AND_FILE; i++) {
> > > +       for_each_evictable_type(i, swappiness) {
> >
> > Thanks for working on solving the reported issues, but one concern
> > about this for_each_evictable_type  macro and its usage here.
> >
> > It basically forbids eviction of file pages with "swappiness == 200"
> > even for global pressure, this is a quite a change.
> >
> > For both active / inactive or MGLRU, max swappiness used to make
> > kernel try reclaim anon as much as possible, but still fall back to
> > file eviction. Forbidding file eviction may cause unsolvable OOM,
> > unlike anon pages, killing process won't necessarily release file
> > pages, so the system could hung easily.
> >
> > For existing systems with swappiness == 200 which were running fine
> > before, may also hit OOM very quickly.
>
> Do you know anyone actually uses 200? I only use 200 for testing but I
> can use 201 instead, since the debugfs interface isn't limited to [0,
> 200].
>

Thanks for the update patch.

Yes, I've seen some users using 200, especially with ZRAM/ZSWAP, so
the kernel prefers to keep page cache in memory when under pressure.
We also use 200 for some workloads.

We have an internal patch similar to your update, which allows using
201 for proactive reclaim, so proactive reclaim is able to only
compress pages in-RAM, to avoid increased IO due to page cache miss,
and it worked very well.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-01-15 17:56 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-31  4:35 [PATCH mm-unstable v4 0/7] mm/mglru: performance optimizations Yu Zhao
2024-12-31  4:35 ` [PATCH mm-unstable v4 1/7] mm/mglru: clean up workingset Yu Zhao
2024-12-31  4:35 ` [PATCH mm-unstable v4 2/7] mm/mglru: optimize deactivation Yu Zhao
2024-12-31  4:35 ` [PATCH mm-unstable v4 3/7] mm/mglru: rework aging feedback Yu Zhao
2025-01-07 17:14   ` Kairui Song
2025-01-13  6:51     ` Yu Zhao
2025-01-15 17:56       ` Kairui Song
2024-12-31  4:35 ` [PATCH mm-unstable v4 4/7] mm/mglru: rework type selection Yu Zhao
2024-12-31  4:35 ` [PATCH mm-unstable v4 5/7] mm/mglru: rework refault detection Yu Zhao
2024-12-31  4:35 ` [PATCH mm-unstable v4 6/7] mm/mglru: rework workingset protection Yu Zhao
2024-12-31  4:35 ` [PATCH mm-unstable v4 7/7] mm/mglru: fix PTE-mapped large folios Yu Zhao
2025-01-03  0:03 ` [PATCH mm-unstable v4 0/7] mm/mglru: performance optimizations Andrew Morton
2025-01-04  0:59   ` chenridong
2025-01-04  2:21     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox