Re: [PATCH v6] zswap: replace RB tree with xarray

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Johannes Weiner <hannes@cmpxchg.org>
To: Chris Li <chrisl@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Yosry Ahmed <yosryahmed@google.com>,
	Nhat Pham <nphamcs@gmail.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Chengming Zhou <zhouchengming@bytedance.com>,
	Barry Song <v-songbaohua@oppo.com>,
	Barry Song <baohua@kernel.org>,
	Chengming Zhou <chengming.zhou@linux.dev>
Subject: Re: [PATCH v6] zswap: replace RB tree with xarray
Date: Tue, 12 Mar 2024 14:43:29 -0400	[thread overview]
Message-ID: <20240312184329.GA3501@cmpxchg.org> (raw)
In-Reply-To: <20240312-zswap-xarray-v6-1-1b82027d7082@kernel.org>

On Tue, Mar 12, 2024 at 10:31:12AM -0700, Chris Li wrote:
> Very deep RB tree requires rebalance at times. That
> contributes to the zswap fault latencies. Xarray does not
> need to perform tree rebalance. Replacing RB tree to xarray
> can have some small performance gain.
> 
> One small difference is that xarray insert might fail with
> ENOMEM, while RB tree insert does not allocate additional
> memory.
> 
> The zswap_entry size will reduce a bit due to removing the
> RB node, which has two pointers and a color field. Xarray
> store the pointer in the xarray tree rather than the
> zswap_entry. Every entry has one pointer from the xarray
> tree. Overall, switching to xarray should save some memory,
> if the swap entries are densely packed.
> 
> Notice the zswap_rb_search and zswap_rb_insert always
> followed by zswap_rb_erase. Use xa_erase and xa_store
> directly. That saves one tree lookup as well.
> 
> Remove zswap_invalidate_entry due to no need to call
> zswap_rb_erase any more. Use zswap_free_entry instead.
> 
> The "struct zswap_tree" has been replaced by "struct xarray".
> The tree spin lock has transferred to the xarray lock.
> 
> Run the kernel build testing 10 times for each version, averages:
> (memory.max=2GB, zswap shrinker and writeback enabled,
> one 50GB swapfile, 24 HT core, 32 jobs)
> 
> mm-9a0181a3710eb             xarray v5
> user       3532.385			3535.658
> sys        536.231                      530.083
> real       200.431                      200.176

This is a great improvement code and complexity wise.

I have a few questions and comments below:

What kernel version is this based on? It doesn't apply to
mm-everything, and I can't find 9a0181a3710eb anywhere.

> @@ -1555,28 +1473,35 @@ bool zswap_store(struct folio *folio)
>  insert_entry:
>  	entry->swpentry = swp;
>  	entry->objcg = objcg;
> -	if (objcg) {
> -		obj_cgroup_charge_zswap(objcg, entry->length);
> -		/* Account before objcg ref is moved to tree */
> -		count_objcg_event(objcg, ZSWPOUT);
> -	}
>  
> -	/* map */
> -	spin_lock(&tree->lock);
>  	/*
>  	 * The folio may have been dirtied again, invalidate the
>  	 * possibly stale entry before inserting the new entry.
>  	 */

The comment is now somewhat stale and somewhat out of place. It should
be above that `if (old)` part... See below.

> -	if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> -		zswap_invalidate_entry(tree, dupentry);
> -		WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> +	old = xa_store(tree, offset, entry, GFP_KERNEL);
> +	if (xa_is_err(old)) {
> +		int err = xa_err(old);
> +		if (err == -ENOMEM)
> +			zswap_reject_alloc_fail++;
> +		else
> +			WARN_ONCE(err, "%s: xa_store failed: %d\n",
> +				  __func__, err);
> +		goto store_failed;

No need to complicate it. If we have a bug there, an incorrect fail
stat bump is the least of our concerns. Also, no need for __func__
since that information is included in the WARN:

	if (xa_is_err(old)) {
		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
		zswap_reject_alloc_fail++;
		goto store_failed;
	}

I think here is where that comment above should go:

	/*
	 * We may have had an existing entry that became stale when
	 * the folio was redirtied and now the new version is being
	 * swapped out. Get rid of the old.
	 */
> +	if (old)
> +		zswap_entry_free(old);
> +
> +	if (objcg) {
> +		obj_cgroup_charge_zswap(objcg, entry->length);
> +		/* Account before objcg ref is moved to tree */
> +		count_objcg_event(objcg, ZSWPOUT);
>  	}
> +
>  	if (entry->length) {
>  		INIT_LIST_HEAD(&entry->lru);
>  		zswap_lru_add(&zswap.list_lru, entry);
>  		atomic_inc(&zswap.nr_stored);
>  	}
> -	spin_unlock(&tree->lock);

We previously relied on the tree lock to finish initializing the entry
while it's already in tree. Now we rely on something else:

	1. Concurrent stores and invalidations are excluded by folio lock.

	2. Writeback is excluded by the entry not being on the LRU yet.
	   The publishing order matters to prevent writeback from seeing
	   an incoherent entry.

I think this deserves a comment.

>  	/* update stats */
>  	atomic_inc(&zswap_stored_pages);
> @@ -1585,6 +1510,12 @@ bool zswap_store(struct folio *folio)
>  
>  	return true;
>  
> +store_failed:
> +	if (!entry->length) {
> +		atomic_dec(&zswap_same_filled_pages);
> +		goto freepage;
> +	}

It'd be good to avoid the nested goto. Why not make the pool
operations conditional on entry->length instead:

store_failed:
	if (!entry->length)
		atomic_dec(&zswap_same_filled_pages);
	else {
		zpool_free(zswap_find_zpool(...));
put_pool:
		zswap_pool_put(entry->pool);
	}
freepage:

Not super pretty either, but it's a linear flow at least.

next prev parent reply	other threads:[~2024-03-12 18:43 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-12 17:31 Chris Li
2024-03-12 18:43 ` Johannes Weiner [this message]
2024-03-13 23:24   ` Chris Li
2024-03-14  9:24 ` Nhat Pham
2024-03-16  1:30 ` Yosry Ahmed
2024-03-16 13:33   ` Johannes Weiner
2024-03-17  6:12     ` Yosry Ahmed
2024-03-20  0:26       ` Chris Li
2024-03-20  0:20     ` Chris Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240312184329.GA3501@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=v-songbaohua@oppo.com \
    --cc=willy@infradead.org \
    --cc=yosryahmed@google.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox