From: Yu Zhao <yuzhao@google.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Yu Zhao <yuzhao@google.com>,
"T . J . Mercier" <tjmercier@google.com>,
stable@vger.kernel.org
Subject: [PATCH mm-unstable v1 3/4] mm/mglru: respect min_ttl_ms with memcgs
Date: Thu, 7 Dec 2023 23:14:06 -0700 [thread overview]
Message-ID: <20231208061407.2125867-3-yuzhao@google.com> (raw)
In-Reply-To: <20231208061407.2125867-1-yuzhao@google.com>
While investigating kswapd "consuming 100% CPU" [1] (also see
"mm/mglru: try to stop at high watermarks"), it was discovered that
the memcg LRU can breach the thrashing protection imposed by
min_ttl_ms.
Before the memcg LRU:
kswapd()
shrink_node_memcgs()
mem_cgroup_iter()
inc_max_seq() // always hit a different memcg
lru_gen_age_node()
mem_cgroup_iter()
check the timestamp of the oldest generation
After the memcg LRU:
kswapd()
shrink_many()
restart:
iterate the memcg LRU:
inc_max_seq() // occasionally hit the same memcg
if raced with lru_gen_rotate_memcg():
goto restart
lru_gen_age_node()
mem_cgroup_iter()
check the timestamp of the oldest generation
Specifically, when the restart happens in shrink_many(), it needs to
stick with the (memcg LRU) generation it began with. In other words,
it should neither re-read memcg_lru->seq nor age an lruvec of a
different generation. Otherwise it can hit the same memcg multiple
times without giving lru_gen_age_node() a chance to check the
timestamp of that memcg's oldest generation (against min_ttl_ms).
[1] https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/
Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: T.J. Mercier <tjmercier@google.com>
Cc: stable@vger.kernel.org
---
include/linux/mmzone.h | 30 +++++++++++++++++-------------
mm/vmscan.c | 30 ++++++++++++++++--------------
2 files changed, 33 insertions(+), 27 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b23bc5390240..e3093ef9530f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -510,33 +510,37 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
* the old generation, is incremented when all its bins become empty.
*
* There are four operations:
- * 1. MEMCG_LRU_HEAD, which moves an memcg to the head of a random bin in its
+ * 1. MEMCG_LRU_HEAD, which moves a memcg to the head of a random bin in its
* current generation (old or young) and updates its "seg" to "head";
- * 2. MEMCG_LRU_TAIL, which moves an memcg to the tail of a random bin in its
+ * 2. MEMCG_LRU_TAIL, which moves a memcg to the tail of a random bin in its
* current generation (old or young) and updates its "seg" to "tail";
- * 3. MEMCG_LRU_OLD, which moves an memcg to the head of a random bin in the old
+ * 3. MEMCG_LRU_OLD, which moves a memcg to the head of a random bin in the old
* generation, updates its "gen" to "old" and resets its "seg" to "default";
- * 4. MEMCG_LRU_YOUNG, which moves an memcg to the tail of a random bin in the
+ * 4. MEMCG_LRU_YOUNG, which moves a memcg to the tail of a random bin in the
* young generation, updates its "gen" to "young" and resets its "seg" to
* "default".
*
* The events that trigger the above operations are:
* 1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
- * 2. The first attempt to reclaim an memcg below low, which triggers
+ * 2. The first attempt to reclaim a memcg below low, which triggers
* MEMCG_LRU_TAIL;
- * 3. The first attempt to reclaim an memcg below reclaimable size threshold,
+ * 3. The first attempt to reclaim a memcg below reclaimable size threshold,
* which triggers MEMCG_LRU_TAIL;
- * 4. The second attempt to reclaim an memcg below reclaimable size threshold,
+ * 4. The second attempt to reclaim a memcg below reclaimable size threshold,
* which triggers MEMCG_LRU_YOUNG;
- * 5. Attempting to reclaim an memcg below min, which triggers MEMCG_LRU_YOUNG;
+ * 5. Attempting to reclaim a memcg below min, which triggers MEMCG_LRU_YOUNG;
* 6. Finishing the aging on the eviction path, which triggers MEMCG_LRU_YOUNG;
- * 7. Offlining an memcg, which triggers MEMCG_LRU_OLD.
+ * 7. Offlining a memcg, which triggers MEMCG_LRU_OLD.
*
- * Note that memcg LRU only applies to global reclaim, and the round-robin
- * incrementing of their max_seq counters ensures the eventual fairness to all
- * eligible memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
+ * Notes:
+ * 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing
+ * of their max_seq counters ensures the eventual fairness to all eligible
+ * memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
+ * 2. There are only two valid generations: old (seq) and young (seq+1).
+ * MEMCG_NR_GENS is set to three so that when reading the generation counter
+ * locklessly, a stale value (seq-1) does not wraparound to young.
*/
-#define MEMCG_NR_GENS 2
+#define MEMCG_NR_GENS 3
#define MEMCG_NR_BINS 8
struct lru_gen_memcg {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 10e964cd0efe..cac38e9cac86 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4117,6 +4117,9 @@ static void lru_gen_rotate_memcg(struct lruvec *lruvec, int op)
else
VM_WARN_ON_ONCE(true);
+ WRITE_ONCE(lruvec->lrugen.seg, seg);
+ WRITE_ONCE(lruvec->lrugen.gen, new);
+
hlist_nulls_del_rcu(&lruvec->lrugen.list);
if (op == MEMCG_LRU_HEAD || op == MEMCG_LRU_OLD)
@@ -4127,9 +4130,6 @@ static void lru_gen_rotate_memcg(struct lruvec *lruvec, int op)
pgdat->memcg_lru.nr_memcgs[old]--;
pgdat->memcg_lru.nr_memcgs[new]++;
- lruvec->lrugen.gen = new;
- WRITE_ONCE(lruvec->lrugen.seg, seg);
-
if (!pgdat->memcg_lru.nr_memcgs[old] && old == get_memcg_gen(pgdat->memcg_lru.seq))
WRITE_ONCE(pgdat->memcg_lru.seq, pgdat->memcg_lru.seq + 1);
@@ -4152,11 +4152,11 @@ void lru_gen_online_memcg(struct mem_cgroup *memcg)
gen = get_memcg_gen(pgdat->memcg_lru.seq);
+ lruvec->lrugen.gen = gen;
+
hlist_nulls_add_tail_rcu(&lruvec->lrugen.list, &pgdat->memcg_lru.fifo[gen][bin]);
pgdat->memcg_lru.nr_memcgs[gen]++;
- lruvec->lrugen.gen = gen;
-
spin_unlock_irq(&pgdat->memcg_lru.lock);
}
}
@@ -4663,7 +4663,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
DEFINE_MAX_SEQ(lruvec);
if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
- return 0;
+ return -1;
if (!should_run_aging(lruvec, max_seq, sc, can_swap, &nr_to_scan))
return nr_to_scan;
@@ -4738,7 +4738,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
cond_resched();
}
- /* whether try_to_inc_max_seq() was successful */
+ /* whether this lruvec should be rotated */
return nr_to_scan < 0;
}
@@ -4792,13 +4792,13 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
struct lruvec *lruvec;
struct lru_gen_folio *lrugen;
struct mem_cgroup *memcg;
- const struct hlist_nulls_node *pos;
+ struct hlist_nulls_node *pos;
+ gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));
bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
restart:
op = 0;
memcg = NULL;
- gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));
rcu_read_lock();
@@ -4809,6 +4809,10 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
}
mem_cgroup_put(memcg);
+ memcg = NULL;
+
+ if (gen != READ_ONCE(lrugen->gen))
+ continue;
lruvec = container_of(lrugen, struct lruvec, lrugen);
memcg = lruvec_memcg(lruvec);
@@ -4893,16 +4897,14 @@ static void set_initial_priority(struct pglist_data *pgdat, struct scan_control
if (sc->priority != DEF_PRIORITY || sc->nr_to_reclaim < MIN_LRU_BATCH)
return;
/*
- * Determine the initial priority based on ((total / MEMCG_NR_GENS) >>
- * priority) * reclaimed_to_scanned_ratio = nr_to_reclaim, where the
- * estimated reclaimed_to_scanned_ratio = inactive / total.
+ * Determine the initial priority based on
+ * (total >> priority) * reclaimed_to_scanned_ratio = nr_to_reclaim,
+ * where reclaimed_to_scanned_ratio = inactive / total.
*/
reclaimable = node_page_state(pgdat, NR_INACTIVE_FILE);
if (get_swappiness(lruvec, sc))
reclaimable += node_page_state(pgdat, NR_INACTIVE_ANON);
- reclaimable /= MEMCG_NR_GENS;
-
/* round down reclaimable and round up sc->nr_to_reclaim */
priority = fls_long(reclaimable) - 1 - fls_long(sc->nr_to_reclaim - 1);
--
2.43.0.472.g3155946c3a-goog
next prev parent reply other threads:[~2023-12-08 6:14 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-08 6:14 [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache Yu Zhao
2023-12-08 6:14 ` [PATCH mm-unstable v1 2/4] mm/mglru: try to stop at high watermarks Yu Zhao
2023-12-08 11:00 ` Hillf Danton
2023-12-11 22:01 ` Yu Zhao
2023-12-08 6:14 ` Yu Zhao [this message]
2023-12-08 6:14 ` [PATCH mm-unstable v1 4/4] mm/mglru: reclaim offlined memcgs harder Yu Zhao
2023-12-08 8:24 ` [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache Kairui Song
2023-12-11 22:06 ` Yu Zhao
2023-12-12 6:52 ` Kairui Song
2023-12-13 3:02 ` Kairui Song
2023-12-13 7:59 ` Yu Zhao
2023-12-14 3:09 ` Yu Zhao
2023-12-14 18:37 ` Kairui Song
2023-12-14 23:51 ` Yu Zhao
2023-12-15 4:56 ` Yu Zhao
2023-12-18 18:05 ` Kairui Song
2023-12-19 3:21 ` Yu Zhao
2023-12-19 3:44 ` Yu Zhao
2023-12-19 18:58 ` Kairui Song
2023-12-20 6:38 ` Yu Zhao
2023-12-20 8:16 ` Yu Zhao
2023-12-20 8:24 ` Kairui Song
2023-12-25 6:30 ` Yu Zhao
2023-12-25 12:02 ` Kairui Song
2023-12-25 21:52 ` Yu Zhao
2023-12-25 22:00 ` Yu Zhao
2024-01-10 19:16 ` Kairui Song
2024-01-11 7:02 ` Yu Zhao
2023-12-17 18:31 ` Yu Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231208061407.2125867-3-yuzhao@google.com \
--to=yuzhao@google.com \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=stable@vger.kernel.org \
--cc=tjmercier@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox