From: Chengming Zhou <zhouchengming@bytedance.com>
To: Nhat Pham <nphamcs@gmail.com>,
Yosry Ahmed <yosryahmed@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org, Nhat Pham <nphamcs@gmail.com>,
Chengming Zhou <zhouchengming@bytedance.com>,
linux-kernel@vger.kernel.org, Yosry Ahmed <yosryahmed@google.com>,
Johannes Weiner <hannes@cmpxchg.org>
Subject: [PATCH v2 1/6] mm/zswap: add more comments in shrink_memcg_cb()
Date: Sun, 04 Feb 2024 03:05:59 +0000 [thread overview]
Message-ID: <20240201-b4-zswap-invalidate-entry-v2-1-99d4084260a0@bytedance.com> (raw)
In-Reply-To: <20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com>
Add more comments in shrink_memcg_cb() to describe the deref dance
which is implemented to fix race problem between lru writeback and
swapoff, and the reason why we rotate the entry at the beginning.
Also fix the stale comments in zswap_writeback_entry(), and add
more comments to state that we only deref the tree after we get
the swapcache reference.
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
mm/zswap.c | 43 ++++++++++++++++++++++++++-----------------
1 file changed, 26 insertions(+), 17 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index 4aea03285532..735f1a6ef336 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1207,10 +1207,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
/*
* folio is locked, and the swapcache is now secured against
- * concurrent swapping to and from the slot. Verify that the
- * swap entry hasn't been invalidated and recycled behind our
- * backs (our zswap_entry reference doesn't prevent that), to
- * avoid overwriting a new swap folio with old compressed data.
+ * concurrent swapping to and from the slot, and concurrent
+ * swapoff so we can safely dereference the zswap tree here.
+ * Verify that the swap entry hasn't been invalidated and recycled
+ * behind our backs, to avoid overwriting a new swap folio with
+ * old compressed data. Only when this is successful can the entry
+ * be dereferenced.
*/
tree = swap_zswap_tree(swpentry);
spin_lock(&tree->lock);
@@ -1263,22 +1265,29 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
int writeback_result;
/*
- * Rotate the entry to the tail before unlocking the LRU,
- * so that in case of an invalidation race concurrent
- * reclaimers don't waste their time on it.
+ * As soon as we drop the LRU lock, the entry can be freed by
+ * a concurrent invalidation. This means the following:
*
- * If writeback succeeds, or failure is due to the entry
- * being invalidated by the swap subsystem, the invalidation
- * will unlink and free it.
+ * 1. We extract the swp_entry_t to the stack, allowing
+ * zswap_writeback_entry() to pin the swap entry and
+ * then validate the zwap entry against that swap entry's
+ * tree using pointer value comparison. Only when that
+ * is successful can the entry be dereferenced.
*
- * Temporary failures, where the same entry should be tried
- * again immediately, almost never happen for this shrinker.
- * We don't do any trylocking; -ENOMEM comes closest,
- * but that's extremely rare and doesn't happen spuriously
- * either. Don't bother distinguishing this case.
+ * 2. Usually, objects are taken off the LRU for reclaim. In
+ * this case this isn't possible, because if reclaim fails
+ * for whatever reason, we have no means of knowing if the
+ * entry is alive to put it back on the LRU.
*
- * But since they do exist in theory, the entry cannot just
- * be unlinked, or we could leak it. Hence, rotate.
+ * So rotate it before dropping the lock. If the entry is
+ * written back or invalidated, the free path will unlink
+ * it. For failures, rotation is the right thing as well.
+ *
+ * Temporary failures, where the same entry should be tried
+ * again immediately, almost never happen for this shrinker.
+ * We don't do any trylocking; -ENOMEM comes closest,
+ * but that's extremely rare and doesn't happen spuriously
+ * either. Don't bother distinguishing this case.
*/
list_move_tail(item, &l->list);
--
b4 0.10.1
next prev parent reply other threads:[~2024-02-04 3:06 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-04 3:05 [PATCH v2 0/6] mm/zswap: optimize zswap lru list Chengming Zhou
2024-02-04 3:05 ` Chengming Zhou [this message]
2024-02-04 3:06 ` [PATCH v2 2/6] mm/zswap: invalidate zswap entry when swap entry free Chengming Zhou
2024-02-05 21:20 ` Yosry Ahmed
2024-02-04 3:06 ` [PATCH v2 3/6] mm/zswap: stop lru list shrinking when encounter warm region Chengming Zhou
2024-02-04 3:06 ` [PATCH v2 4/6] mm/zswap: remove duplicate_entry debug value Chengming Zhou
2024-02-04 3:06 ` [PATCH v2 5/6] mm/zswap: only support zswap_exclusive_loads_enabled Chengming Zhou
2024-02-04 3:06 ` [PATCH v2 6/6] mm/zswap: zswap entry doesn't need refcount anymore Chengming Zhou
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240201-b4-zswap-invalidate-entry-v2-1-99d4084260a0@bytedance.com \
--to=zhouchengming@bytedance.com \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox