Re: [PATCH v6] zswap: replace RB tree with xarray

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Chris Li <chrisl@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,  linux-mm@kvack.org,
	Yosry Ahmed <yosryahmed@google.com>,
	Nhat Pham <nphamcs@gmail.com>,
	 "Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Chengming Zhou <zhouchengming@bytedance.com>,
	 Barry Song <v-songbaohua@oppo.com>,
	Barry Song <baohua@kernel.org>,
	 Chengming Zhou <chengming.zhou@linux.dev>
Subject: Re: [PATCH v6] zswap: replace RB tree with xarray
Date: Wed, 13 Mar 2024 16:24:34 -0700	[thread overview]
Message-ID: <CAF8kJuM8YAwnNA4LJqwuM1mu9+UaBek0_K+jG4JjUmu6RRqSNg@mail.gmail.com> (raw)
In-Reply-To: <20240312184329.GA3501@cmpxchg.org>

Hi Johannes,

On Tue, Mar 12, 2024 at 11:43 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Mar 12, 2024 at 10:31:12AM -0700, Chris Li wrote:
> > Very deep RB tree requires rebalance at times. That
> > contributes to the zswap fault latencies. Xarray does not
> > need to perform tree rebalance. Replacing RB tree to xarray
> > can have some small performance gain.
> >
> > One small difference is that xarray insert might fail with
> > ENOMEM, while RB tree insert does not allocate additional
> > memory.
> >
> > The zswap_entry size will reduce a bit due to removing the
> > RB node, which has two pointers and a color field. Xarray
> > store the pointer in the xarray tree rather than the
> > zswap_entry. Every entry has one pointer from the xarray
> > tree. Overall, switching to xarray should save some memory,
> > if the swap entries are densely packed.
> >
> > Notice the zswap_rb_search and zswap_rb_insert always
> > followed by zswap_rb_erase. Use xa_erase and xa_store
> > directly. That saves one tree lookup as well.
> >
> > Remove zswap_invalidate_entry due to no need to call
> > zswap_rb_erase any more. Use zswap_free_entry instead.
> >
> > The "struct zswap_tree" has been replaced by "struct xarray".
> > The tree spin lock has transferred to the xarray lock.
> >
> > Run the kernel build testing 10 times for each version, averages:
> > (memory.max=2GB, zswap shrinker and writeback enabled,
> > one 50GB swapfile, 24 HT core, 32 jobs)
> >
> > mm-9a0181a3710eb             xarray v5
> > user       3532.385                   3535.658
> > sys        536.231                      530.083
> > real       200.431                      200.176
>
> This is a great improvement code and complexity wise.

Thanks!

>
> I have a few questions and comments below:
>
> What kernel version is this based on? It doesn't apply to
> mm-everything, and I can't find 9a0181a3710eb anywhere.

It is based on an old version of the mm-unstable. I can try to rebase
on mm-everything or later mm-unstable.

>
> > @@ -1555,28 +1473,35 @@ bool zswap_store(struct folio *folio)
> >  insert_entry:
> >       entry->swpentry = swp;
> >       entry->objcg = objcg;
> > -     if (objcg) {
> > -             obj_cgroup_charge_zswap(objcg, entry->length);
> > -             /* Account before objcg ref is moved to tree */
> > -             count_objcg_event(objcg, ZSWPOUT);
> > -     }
> >
> > -     /* map */
> > -     spin_lock(&tree->lock);
> >       /*
> >        * The folio may have been dirtied again, invalidate the
> >        * possibly stale entry before inserting the new entry.
> >        */
>
> The comment is now somewhat stale and somewhat out of place. It should
> be above that `if (old)` part... See below.

Ack.

>
> > -     if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> > -             zswap_invalidate_entry(tree, dupentry);
> > -             WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> > +     old = xa_store(tree, offset, entry, GFP_KERNEL);
> > +     if (xa_is_err(old)) {
> > +             int err = xa_err(old);
> > +             if (err == -ENOMEM)
> > +                     zswap_reject_alloc_fail++;
> > +             else
> > +                     WARN_ONCE(err, "%s: xa_store failed: %d\n",
> > +                               __func__, err);
> > +             goto store_failed;
>
> No need to complicate it. If we have a bug there, an incorrect fail
> stat bump is the least of our concerns. Also, no need for __func__
> since that information is included in the WARN:
>
>         if (xa_is_err(old)) {
>                 WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
>                 zswap_reject_alloc_fail++;
>                 goto store_failed;
>         }

Ah, I see. Thanks for the simplification.

>
> I think here is where that comment above should go:

Ack.

>
>         /*
>          * We may have had an existing entry that became stale when
>          * the folio was redirtied and now the new version is being
>          * swapped out. Get rid of the old.
>          */
> > +     if (old)
> > +             zswap_entry_free(old);
> > +
> > +     if (objcg) {
> > +             obj_cgroup_charge_zswap(objcg, entry->length);
> > +             /* Account before objcg ref is moved to tree */
> > +             count_objcg_event(objcg, ZSWPOUT);
> >       }
> > +
> >       if (entry->length) {
> >               INIT_LIST_HEAD(&entry->lru);
> >               zswap_lru_add(&zswap.list_lru, entry);
> >               atomic_inc(&zswap.nr_stored);
> >       }
> > -     spin_unlock(&tree->lock);
>
> We previously relied on the tree lock to finish initializing the entry
> while it's already in tree. Now we rely on something else:
>
>         1. Concurrent stores and invalidations are excluded by folio lock.
>
>         2. Writeback is excluded by the entry not being on the LRU yet.
>            The publishing order matters to prevent writeback from seeing
>            an incoherent entry.
>
> I think this deserves a comment.

I will add your 1. and 2. into a comment block. Thanks for the suggestion.

>
> >       /* update stats */
> >       atomic_inc(&zswap_stored_pages);
> > @@ -1585,6 +1510,12 @@ bool zswap_store(struct folio *folio)
> >
> >       return true;
> >
> > +store_failed:
> > +     if (!entry->length) {
> > +             atomic_dec(&zswap_same_filled_pages);
> > +             goto freepage;
> > +     }
>
> It'd be good to avoid the nested goto. Why not make the pool
> operations conditional on entry->length instead:
>
> store_failed:
>         if (!entry->length)
>                 atomic_dec(&zswap_same_filled_pages);
>         else {
>                 zpool_free(zswap_find_zpool(...));
> put_pool:
>                 zswap_pool_put(entry->pool);
>         }
> freepage:

Sure, I have one internal version exactly like that. I later changed
again to get rid of the else. I can use your version as well.

>
> Not super pretty either, but it's a linear flow at least.

Thanks for your suggestions. I will send out a new version.

Chris

next prev parent reply	other threads:[~2024-03-13 23:24 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-12 17:31 Chris Li
2024-03-12 18:43 ` Johannes Weiner
2024-03-13 23:24   ` Chris Li [this message]
2024-03-14  9:24 ` Nhat Pham
2024-03-16  1:30 ` Yosry Ahmed
2024-03-16 13:33   ` Johannes Weiner
2024-03-17  6:12     ` Yosry Ahmed
2024-03-20  0:26       ` Chris Li
2024-03-20  0:20     ` Chris Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAF8kJuM8YAwnNA4LJqwuM1mu9+UaBek0_K+jG4JjUmu6RRqSNg@mail.gmail.com \
    --to=chrisl@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=v-songbaohua@oppo.com \
    --cc=willy@infradead.org \
    --cc=yosryahmed@google.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox