linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nhat Pham <nphamcs@gmail.com>
To: Chris Li <chrisl@kernel.org>
Cc: SeongJae Park <sj@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Chengming Zhou <chengming.zhou@linux.dev>,
	David Hildenbrand <david@redhat.com>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Yosry Ahmed <yosry.ahmed@linux.dev>,
	linux-kernel@vger.kernel.org,  linux-mm@kvack.org,
	Takero Funaki <flintglass@gmail.com>,
	Hugh Dickins <hughd@google.com>
Subject: Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
Date: Fri, 15 Aug 2025 16:08:50 -0700	[thread overview]
Message-ID: <CAKEwX=Mtu3KyNUv_oWFo9vNiWKkmbzMXmG3t=XgpVtG9C_v2mA@mail.gmail.com> (raw)
In-Reply-To: <CAF8kJuONDFj4NAksaR4j_WyDbNwNGYLmTe-o76rqU17La=nkOw@mail.gmail.com>

On Fri, Aug 15, 2025 at 3:29 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Wed, Aug 13, 2025 at 11:20 AM SeongJae Park <sj@kernel.org> wrote:
> >
> > My RFC v1[1] was using a kind of such approach.  I changed it to use zpool
> > starting from RFC v2, following the suggestions from Nhat, Johannes and Joshua.
> > It was suggested to use zpool to reuse its support of migration, highmem
> > handling, and stats.  And I agree their suggestions.  David also raised a
> > question on RFC v2, about the movability behavior changes[2].
>
> Thanks for the pointer, that is an interesting read.
>
> > I agree we will get more metadata overhead.  Nonetheless, I argue it is not a
>
> Keep in mind that is more than just metadata. There is another hidden
> overhead of using zpool to store your full page compared to directly
> using the page.  When the page is read, because most likely the zpool
> back end is not at a position aligned to the page. That zpool page
> will not be able to free immediately, those fragmented zpool will need
> to wait for compaction to free. it.

It's been awhile since I last worked on zsmalloc, but IIRC zsmalloc
handles these pages PAGE_SIZE sized object by just handing out a full
page. I skimmed through zsmalloc code, and it seems like for this
size_class, each zspage just contain one base page?

(check out the logic around ZsHugePage and
calculate_zspage_chain_size() in mm/zsmalloc.c. Confusing name, I
know).

So we won't need compaction to get rid of these buffers for
incompressible pages, at free time. There might still be some extra
overhead (metadata), but at least I assume we can free these pages
easily?

>
> > big problem and can be mitigated in writeback-enabled environments.  Hence this
> > feature is disabled on writeback-disabled case.  Next paragraph on the original
> > cover letter explains my points about this.
> >
> > >
> > > The pros is that, on the page fault, there is no need to allocate a
> > > new page. You can just move the uncompressed_page into the swap_cache.
> > > You save one memcpy on the page fault as well. That will make the
> > > incompressible page fault behave very similar to the minor page fault.
> >
> > I agree this can save one memcpy, and that's cool idea!  Nonetheless, this
> > patch is not making the [z]swap-in overhead worse.  Rather, this patch also
>
> We might still wait to evaluate the lost/gain vs store the
> incompressible in swap cache. Google actually has an internal patch to
> store the incompressible pages in swap cache and move them out of the
> LRU to another already existing list. I can clean it up a bit and send
> it to the list for comparison. This approach will not mess with the
> zswap LRU because the incompressible data is not moving into zswap.
> There is no page fault to handle the incompressible pages.

We can always iterate to get the best of all worlds :) One objective at a time.

BTW, May I ask - how/when do we move the incompressible pages out of
the "incompressible LRU"? I believe there needs to be some sort of
scanning mechanism to detect dirtying right?

>
> > slightly improve it for incompressible pages case by skipping the
> > decompression.  Also, if we take this way, I think we should also need to make
> > a good answer to the zswapped-page migrations etc.
>
> Yes, if we keep the store page in the zswap approach, we might have to
> optimize inside zsmalloc to handle full page storing better.

I believe it already does - check my response above. But I can be wrong.

>
> > So, if we agree on my justification about the metadata overhead, I think this
> > could be done as a followup work of this patch?
>
> Sure. I am most interested in getting the best overall solution. No
> objects to get it now vs later.

Agree. We can always iterate on solutions. No big deal.

>
> >
> > >
> > > The cons is that, now zpool does not account for uncompressed pages,
> > > on the second thought, that can be a con as well, the page is not
> > > compressed, why should it account as compressed pages?
> >
> > I agree it might look weird.  But, in my humble opinion, in a perspective of
> > thinking it as zswap backend storage, and by thinking the definition of
> > compression in a flexible way, this may be understandable and not cause real
> > user problems?
>
> I think the real problem is hiding the real error with non errors like
> data not compressible. If you want to store uncompressed data in the
> zswap layer by design, that is fine. Just need to reason the trade off
> vs other options like store in swap cache etc.
>
> > [...]
> > > >  mm/zswap.c | 36 ++++++++++++++++++++++++++++++++++--
> > > >  1 file changed, 34 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > index 3c0fd8a13718..0fb940d03268 100644
> > > > --- a/mm/zswap.c
> > > > +++ b/mm/zswap.c
> > > > @@ -60,6 +60,8 @@ static u64 zswap_written_back_pages;
> > > >  static u64 zswap_reject_reclaim_fail;
> > > >  /* Store failed due to compression algorithm failure */
> > > >  static u64 zswap_reject_compress_fail;
> > > > +/* Compression into a size of <PAGE_SIZE failed */
> > > > +static u64 zswap_compress_fail;
> > > >  /* Compressed page was too big for the allocator to (optimally) store */
> > > >  static u64 zswap_reject_compress_poor;
> > > >  /* Load or writeback failed due to decompression failure */
> > > > @@ -976,8 +978,26 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > > >          */
> > > >         comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> > > >         dlen = acomp_ctx->req->dlen;
> > > > -       if (comp_ret)
> > > > -               goto unlock;
> > > > +
> > > > +       /*
> > > > +        * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> > > > +        * save the content as is without a compression, to keep the LRU order
> > > > +        * of writebacks.  If writeback is disabled, reject the page since it
> > > > +        * only adds metadata overhead.  swap_writeout() will put the page back
> > > > +        * to the active LRU list in the case.
> > > > +        */
> > > > +       if (comp_ret || dlen >= PAGE_SIZE) {
> > > > +               zswap_compress_fail++;
> > >
> > > I feel that mixing comp_ret and dlen size if tested here complicates
> > > the nested if statement and the behavior as well.
> > > One behavior change is that, if the comp_ret != 0, it means the
> > > compression crypt engine has some error. e.g. have some bad chip
> > > always fail the compression. That is a different error case than the
> > > compression was successful and the compression rate is not good. In
> > > this case the patch makes the page store in zpool for cryto engine
> > > failure, which does not help.
> >
> > I agree the code has rooms to improve, in terms of the readability, and keeping
> > fine grained observailibty.
> >
> > >
> > > Is there any reason you want to store the page in zpool when the
> > > compression engine (not the ratio) fails?
> >
> > The main problem this patch tries to solve is the LRU order corruption.  In any
> > case, I want to keep the order if possible.
>
> In that case, does it mean that in order to keep the LRU, you never
> want to write from zswap to a real back end device? That will make
> zswap behave more like a standalone tier.
>
> >
> > >
> > > What do you say about the following alternative, this keeps the
> > > original behavior if compression engines fail.
> > >
> > >      if (comp_ret) {
> > >           zswap_compress_egine_fail++;
> >
> > I agree this counter can be useful.
>
> Ack.
>
> >
> > >           goto unlock;
> >
> > This will make the page go directly to underlying slower swap device, and hence
> > cause LRU order inversion.  So unfortuately I have to say this is not the
> > behavior I want.
>
> Ack, that is a design choice. I understand now.
>
> >
> > I agree engine failure is not frequent, so this behavior might not really make
> > problem to me.  But, still I don't see a reason to let the page go directly to
> > swap and make LRU order unexpected.
> >
> > >      }
> > >
> > >      if (dlen >= PAGE_SIZE) {
> > >         zswap_compress_fail++;
> >
> > I define compress failure here as a failure of attempt to compress the given
> > page's content into a size smaller than PAGE_SIZE.  It is a superset includin.
> > both "ratio" failure and "engine" failure.  Hence I think zswap_compress_fail
> > should be increased even in the upper case.
>
> I slept over it a bit. Now I think we should make this a counter of
> how many uncompressed pages count stored in zswap. Preperbelly as per
> memcg counter.

Actually, yeah I asked about this counter in a review in an earlier
version as well, then I completely forgot about it :)


> I saw that you implement it as a counter in your V1. Does the zsmalloc
> already track this information in the zspool class? Having this per

Kinda sorta. If we build the kernel with CONFIG_ZSMALLOC_STAT, we can
get the number of objects in each size_class.

Each time we read, I believe we have to read every size class though.
So it's kinda annoying. Whereas here, we can just read an atomic
counter? :)

> cgroup information we can evaluate how  much zswap is saving. Some
> jobs like databases might store a lot of hash value and encrypted data
> which is not compressible. In that case, we might just pass the whole

Hmmm actually, this might be a good point. Lemme think about it a bit more.

(but again, worst case scenario, we can send a follow up patch to add
these counters).

> zswap tier as a whole. The overall compression ratio will reflect that
> as well. Having the separate per memcg counter is kind of useful as
> well.
>
> >
> > We can discuss about re-defining and re-naming what each stat means, of course,
> > if you want.  But I think even if we keep the current definitions and names, we
> > can still get the ratio failures if we add your nicely suggested
> > zswap_compress_engine_fail, as
> > 'zswap_compress_fail - zswap_compress_engine_fail', so we might not really need
> > re-definition and re-naming?
>
> I am actually less interested in the absolute failure number which
> keeps increasing, more on how much incompressible zswap is stored.
> That counter + number of engine errors should be perfect.
>
> >
> > >         if (!mem_cgroup_zswap_writeback_enabled(
> > >                                       folio_memcg(page_folio(page)))) {
> > >               comp_ret = -EINVAL;
> > >               goto unlock;
> > >         }
> > >        dlen = PAGE_SIZE;
> > >        dst = kmap_local_page(page);
> > >     }
> > >
> > > Overall I feel this alternative is simpler and easier to read.
> >
> > I agree this code is nice to read.  Nonetheless I have to say the behavior is
> > somewhat different from what I want.
>
> Even if you keep the current behavior, you can move the invert the
> test condition and then remove the "else + goto" similar to the above.
> That will make your code less and flatter. I will need to think about
> whether we can assign the return value less.
>
> Another point I would like to make is that you currently make the cut
> off threshold as page size. The ideal threshold might be something
> slightly smaller than page size. The reason is that the zsmalloc has a
> fixed size chuck to store it. If your page is close to full, it will
> store the data in the same class as the full page. You are not gaining
> anything from zsmalloc if the store page compression ratio is 95%.
> That 5% will be in the waste in the zsmalloc class fragment anyway. So
> the trade off is, decompress 95% of a page vs memcpy 100% of a page.
> It is likely memcpy 100% is faster. I don't know the exact ideal cut
> off of threshold. If I had to guess, I would guess below 90%.
>
> >
> > > I can
> > > be biased towards my own code :-).
> > > I think we should treat the compression engine failure separately from
> > > the compression rate failure. The engine failure is rare and we should
> > > know about it as a real error. The compression rate failure is normal.
> >
> > Again, I agree having the observability would be nice.  I can add a new counter
> > for that, like below attached patch, for example.
>
> I would love that. Make that per memcg if possible :-)
>
> >
> > [1] https://lore.kernel.org/20250730234059.4603-1-sj@kernel.org
> > [2] https://lore.kernel.org/761a2899-6fd9-4bfe-aeaf-23bce0baa0f1@redhat.com
>
> Thanks for the pointer, good read.
>
> Chris


  reply	other threads:[~2025-08-15 23:09 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-12 17:00 SeongJae Park
2025-08-12 18:52 ` Nhat Pham
2025-08-13 17:07 ` Chris Li
2025-08-13 18:20   ` SeongJae Park
2025-08-15 22:28     ` Chris Li
2025-08-15 23:08       ` Nhat Pham [this message]
2025-08-16  0:14         ` SeongJae Park
2025-08-16  2:23           ` Chris Li
2025-08-18 18:18             ` SeongJae Park
2025-08-18 20:33               ` Chris Li
2025-08-16  0:30         ` Chris Li
2025-08-16  0:07       ` SeongJae Park
2025-08-16  2:20         ` Chris Li
2025-08-13 18:32   ` Nhat Pham
2025-08-15 22:34     ` Chris Li
2025-08-15 22:44       ` Nhat Pham
2025-08-13 18:58   ` Hugh Dickins
2025-08-15 22:38     ` Chris Li
2025-08-13 19:42   ` Shakeel Butt
2025-08-13 20:48     ` Johannes Weiner
2025-08-15 22:44       ` Chris Li
2025-08-15 22:42     ` Chris Li
2025-08-14 16:23 ` Johannes Weiner
2025-08-14 17:04   ` SeongJae Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKEwX=Mtu3KyNUv_oWFo9vNiWKkmbzMXmG3t=XgpVtG9C_v2mA@mail.gmail.com' \
    --to=nphamcs@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=flintglass@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=sj@kernel.org \
    --cc=yosry.ahmed@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox