linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Yosry Ahmed <yosryahmed@google.com>,
	Nhat Pham <nphamcs@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chriscli@google.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff
Date: Mon, 29 Jan 2024 22:17:27 -0500	[thread overview]
Message-ID: <20240130031727.GA780575@cmpxchg.org> (raw)
In-Reply-To: <527bd543-97a5-4262-be73-6a5d21c2f896@bytedance.com>

On Tue, Jan 30, 2024 at 10:30:20AM +0800, Chengming Zhou wrote:
> On 2024/1/30 08:22, Yosry Ahmed wrote:
> > On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
> >> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> >>  {
> >>  	struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
> >>  	bool *encountered_page_in_swapcache = (bool *)arg;
> >> -	struct zswap_tree *tree;
> >> -	pgoff_t swpoffset;
> >> +	swp_entry_t swpentry;
> >>  	enum lru_status ret = LRU_REMOVED_RETRY;
> >>  	int writeback_result;
> >>  
> >> +	/*
> >> +	 * Rotate the entry to the tail before unlocking the LRU,
> >> +	 * so that in case of an invalidation race concurrent
> >> +	 * reclaimers don't waste their time on it.
> >> +	 *
> >> +	 * If writeback succeeds, or failure is due to the entry
> >> +	 * being invalidated by the swap subsystem, the invalidation
> >> +	 * will unlink and free it.
> >> +	 *
> >> +	 * Temporary failures, where the same entry should be tried
> >> +	 * again immediately, almost never happen for this shrinker.
> >> +	 * We don't do any trylocking; -ENOMEM comes closest,
> >> +	 * but that's extremely rare and doesn't happen spuriously
> >> +	 * either. Don't bother distinguishing this case.
> >> +	 *
> >> +	 * But since they do exist in theory, the entry cannot just
> >> +	 * be unlinked, or we could leak it. Hence, rotate.
> > 
> > The entry cannot be unlinked because we cannot get a ref on it without
> > holding the tree lock, and we cannot deref the tree before we acquire a
> > swap cache ref in zswap_writeback_entry() -- or if
> > zswap_writeback_entry() fails. This should be called out explicitly
> > somewhere. Perhaps we can describe this whole deref dance with added
> > docs to shrink_memcg_cb().
> 
> Maybe we should add some comments before or after zswap_writeback_entry()?
> Or do you have some suggestions? I'm not good at this. :)

I agree with the suggestion of a central point to document this.

How about something like this:

/*
 * As soon as we drop the LRU lock, the entry can be freed by
 * a concurrent invalidation. This means the following:
 *
 * 1. We extract the swp_entry_t to the stack, allowing
 *    zswap_writeback_entry() to pin the swap entry and
 *    then validate the zwap entry against that swap entry's
 *    tree using pointer value comparison. Only when that
 *    is successful can the entry be dereferenced.
 *
 * 2. Usually, objects are taken off the LRU for reclaim. In
 *    this case this isn't possible, because if reclaim fails
 *    for whatever reason, we have no means of knowing if the
 *    entry is alive to put it back on the LRU.
 *
 *    So rotate it before dropping the lock. If the entry is
 *    written back or invalidated, the free path will unlink
 *    it. For failures, rotation is the right thing as well.
 *
 *    Temporary failures, where the same entry should be tried
 *    again immediately, almost never happen for this shrinker.
 *    We don't do any trylocking; -ENOMEM comes closest,
 *    but that's extremely rare and doesn't happen spuriously
 *    either. Don't bother distinguishing this case.
 */

> > We could also use a comment in zswap_writeback_entry() (or above it) to
> > state that we only deref the tree *after* we get the swapcache ref.
> 
> I just notice there are some comments in zswap_writeback_entry(), should
> we add more comments here?
> 
> 	/*
> 	 * folio is locked, and the swapcache is now secured against
> 	 * concurrent swapping to and from the slot. Verify that the
> 	 * swap entry hasn't been invalidated and recycled behind our
> 	 * backs (our zswap_entry reference doesn't prevent that), to
> 	 * avoid overwriting a new swap folio with old compressed data.
> 	 */

The bit in () is now stale, since we're not even holding a ref ;)

Otherwise, a brief note that entry can not be dereferenced until
validation would be helpful in zswap_writeback_entry(). The core of
the scheme I'd probably describe in shrink_memcg_cb(), though.

Can I ask a favor, though?

For non-critical updates to this patch, can you please make them
follow-up changes? I just sent out 20 cleanup patches on top of this
patch which would be super painful and error prone to rebase. I'd like
to avoid that if at all possible.


  reply	other threads:[~2024-01-30  3:17 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-28 13:28 [PATCH v2 0/3] " Chengming Zhou
2024-01-28 13:28 ` [PATCH v2 1/3] mm/zswap: don't return LRU_SKIP if we have dropped lru lock Chengming Zhou
2024-01-30  0:09   ` Yosry Ahmed
2024-01-30  0:12     ` Nhat Pham
2024-01-30  0:58       ` Yosry Ahmed
2024-01-28 13:28 ` [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff Chengming Zhou
2024-01-30  0:22   ` Yosry Ahmed
2024-01-30  2:30     ` Chengming Zhou
2024-01-30  3:17       ` Johannes Weiner [this message]
2024-01-30  3:31         ` Chengming Zhou
2024-01-28 13:28 ` [PATCH v2 3/3] mm/list_lru: remove list_lru_putback() Chengming Zhou
2024-01-28 15:52   ` Johannes Weiner
2024-01-28 19:45     ` Nhat Pham
2024-01-30  0:34   ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240130031727.GA780575@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=chriscli@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=yosryahmed@google.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox