From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2881FCA0EE4 for ; Fri, 15 Aug 2025 23:09:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C3BA56B0379; Fri, 15 Aug 2025 19:09:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C14146B037A; Fri, 15 Aug 2025 19:09:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B29596B037B; Fri, 15 Aug 2025 19:09:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 9FBCC6B0379 for ; Fri, 15 Aug 2025 19:09:06 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 1FD1F58601 for ; Fri, 15 Aug 2025 23:09:06 +0000 (UTC) X-FDA: 83780534292.05.8498CF9 Received: from mail-il1-f181.google.com (mail-il1-f181.google.com [209.85.166.181]) by imf14.hostedemail.com (Postfix) with ESMTP id 2D80F10000A for ; Fri, 15 Aug 2025 23:09:03 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=eoUvgjqj; spf=pass (imf14.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.166.181 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1755299344; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=42MdIw/zn0IyOrePf8q8Yhtrpa4xzlqA2mTCVTQGHKI=; b=jRoygBisADH982z4Hd8gioWk/PRNSWH1f2JyzwNZ5peI1t4JOWkGVGzqL+4TIS03R3o/A7 SCAZj2F3NKaq51caVpFMlD59JwPv0dwd6Zh9WeMgkh2yNRKLb7OLwwSPvuK/xV4fKg8WLy Efk9wpK9TxKAlb6oMReGpVANihb8DOM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1755299344; a=rsa-sha256; cv=none; b=kD8NoAzaVUzf3fVDJvDVKJF8HoPNRl7dohnmpOqSTbOuygHtrjssWLfiXx2sqlR7Rv/4yZ K3U+VQqyDwBBG+gIsvj7XbDkowjefq4apEvwZBDUvJB/Qd/DxmbQDv076+GbPJfTXxHEW1 MqNECSLdY1XgqNIrjbu9GdQnZo4EDSk= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=eoUvgjqj; spf=pass (imf14.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.166.181 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-il1-f181.google.com with SMTP id e9e14a558f8ab-3e57002bc23so11393485ab.3 for ; Fri, 15 Aug 2025 16:09:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1755299343; x=1755904143; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=42MdIw/zn0IyOrePf8q8Yhtrpa4xzlqA2mTCVTQGHKI=; b=eoUvgjqj3xxblqPThcnCkt48gGV391pJBdCjxF2TtZgeoMzLEg/OzFLUX8oGaYBqjm CCucr1jqGtQyBbOBiyp2GGab4emRu5KXCuV82cYRXJFEm0n6RHoENaXfKwSpWV3hOPav LLDwJWhnt/c3BXW+HX7BJxldwzjlCcH+h+h3qZA9YVuRuLYhdmXTd3G6CWEDEfe9PSF2 aaO7l9E5EggUJh3OM3rfXRAiUDMX9hgecQyM6XXmnElQGpUd7YPG6/jx0QN53GGjpAXj qGXkAfxb1DtDC1y4jh717YP8cAOP86HbXvlw7LbQoc6Gw0HH7iqIOzABNhpxVqyQ5KmI 6M4A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755299343; x=1755904143; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=42MdIw/zn0IyOrePf8q8Yhtrpa4xzlqA2mTCVTQGHKI=; b=g+B9Ah8uDZk0gJtCkGtScLeVy3moCQJ7q2wehPNtsBPQ3UokKp27zcU2LYiWUYV0bR LJgwKLMR1IxjCwyHs6jc9E9yaY9cniGYL2sdfvpwdsris3xCLkTMLnIWo4i3Vzt9w9o9 kzUeKQxkXaJhxbEk6kyX0QnLXvggt5Zp4SwfYGVT3BlewEXUVT6mrmtPy887QSi2RkDh aeOoXVZZQ8M54Qss/BxyzhjPT9oqsN1RhT7YumUKgsKtz63NzaJL4og1ep0OHfT4xv2m PTl/3IcjpODNGLXYDO/+ozVAtYeG3ywzBREUfg9JGdyGu+0JCBni0uXiBSh81MZAooSV lOeA== X-Forwarded-Encrypted: i=1; AJvYcCVepHQIfrubriagewUZEA18ch26S1T/ms2Zs6BTLcoQnR7drLb30+Mlq8+HbgfcjfV3MawT+lf/sg==@kvack.org X-Gm-Message-State: AOJu0YzFblF5hebvBggzGYlfWjG9WPsUMqP1tuO3qA8Pr9uBPcH7TssA gqiGij8NqGvMBk6eSZslQeh3MlH6poY74hsDhIPeA/vdd1wFI+wnwlpe/Kp7JADP8lisnndKWDr E5fvcpHxFhwpG1ZcwmaGVreoE32uu1bM= X-Gm-Gg: ASbGncsRrhc9/VDiyaqCyIVKM0hp11sRer79bKiIS51flEMW3lYK0NfcPuk8zvprL5U 2qO5FB+VnF4dxtgXz4C7UuhT+o/UAcKvZ3QlCrcmESOhvImC91nFQYfSF+VaXMYakzgKyXoGuUN i1oW2tVftKPsatSm/nmzKuM4G+uQt5rVcjiKu9/XdJoVUp97sL/6CDzykWZb/QprWBL8uM/kAFu e2EIoI= X-Google-Smtp-Source: AGHT+IFthbpENU4cR1d+wIfUd2fLqa+M/nb9eoAPF87b/fD4gwDKKoQlmFeYW/ZUpK51HaqrMW1FTYZD29CTtYda3is= X-Received: by 2002:a05:6e02:1b05:b0:3e5:6bc6:4dcf with SMTP id e9e14a558f8ab-3e57e8670damr67303825ab.7.1755299343025; Fri, 15 Aug 2025 16:09:03 -0700 (PDT) MIME-Version: 1.0 References: <20250813182024.7250-1-sj@kernel.org> In-Reply-To: From: Nhat Pham Date: Fri, 15 Aug 2025 16:08:50 -0700 X-Gm-Features: Ac12FXyPKTgQ73WeJtuawZdGdFJidghULGlU1JeYubHbQaLPPy9GQfXReZwelH4 Message-ID: Subject: Re: [PATCH v2] mm/zswap: store Cc: SeongJae Park , Andrew Morton , Chengming Zhou , David Hildenbrand , Johannes Weiner , Yosry Ahmed , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Takero Funaki , Hugh Dickins Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 2D80F10000A X-Stat-Signature: iakpic55q36dteij6a5i43ktdkax9m7x X-HE-Tag: 1755299343-244060 X-HE-Meta: U2FsdGVkX1+koaZLDOhvvuaOMn94dYUdTEt/eiTxgZHfDHjlBe7aIaoYyNxd3T76tUkE8blB+ClFPznsf2/UTvmkq0m4142bLnPNagss8iRYx7T++AVjL4FPNWQSU7DeU1JQ/fRyaDLSUBqp9h97lTz1g7KOQFJ2lh3Z0LQfXAy4sHYvzoU8CHfX82mqZJIyZBkB60YIWHXlMIiCi1i+R7e+JuRZBJstk93N5fyAUvuNVOJ5+m8DcCTOJo61+XGVw0qGbuzNeXFTY2/rXKRSyt5/N2ydVg3XKYelJZCTMB3kV4luaKtNpXeoRY/03yeObxWdRhYn72/5SxJmzK6TOSdQuwZLc5+8QeyY8Y4pIzq2o2CPhE2qB8np6nVQEHteCxTwc8HPBsF9Rw8+kn0RiN9JZU+AsX3jcmcOi2ErWItq2L5o+n955qKs5HKIjgfx9ZZ4Aj+UP/ZwKbYg2cXveG813qYe+wQI6KP8iuQPHa4Jlz2i3+8vY/UKAe6P/Ss47YkZR9sFNn+pt72cScr2GQ+M/fxjwx2x+bQOPOJYWVaQnSmjnWf++xzlyw+TNWmUlG7syOidDqSDSGUVMmRKDi/NjfN9a2jWvejnUe+OMQ6uZyQALBuLElBTMTsvZB2PtjqheirrQ2ubV0B5qIuA7PYMbHWFMJBlSMUF54rsm1sYHWFEwqau6PTo82K/yPxKvkR3oZxA/Rc112YYIdytQzVXYjZKc+w60n5jT8kMehheGF190kBmOesHgXdChIsj6YG831CbgTh7DPp7oSnff2gj7jmu0vPgH71iZv/3a17L++ukyi9qMFHczOqtbjItmkFEYP/TuzlN8Qs2Amy93O9ASIa5/Dq6Rt/WQSD1Hxxo4xDax+GuYnMh8by+JSLimUSmGPGXKhHCnXtjcQWFJ8fkQhvhfUcJVAH2wf90rFTOiJetRvQrMdBzTxPjzAEWzELArwVJ4wqfu8k++i4 5JBTIvuJ FBMHJFZJsOZtTU1jIWYo0I/0v3afmipq7jZJ/hDSGGaShpo4nDzylAOpAIB9H32OhMlpIswFEyUjVuetjLQIzuPG4nYy33fdoP9/hjLDm+OwpDxeWKiHoRwGtYQ6O0JSSC02ntAIZOuE2aEz6YlkupokSHXM9Kva5n0Mt9zrAtKTP9PD1pUOIHJ1s7Uz6B/npSWJ+13qyHeW+sp5ikw/tkJ7BidDO1FYcilKlQPWuaZMlwMWlLg7gNM9z3WJqBka83R0Ql4OcjTY5cMSuFVdfaGWSLUfZj4CRicCf9CIqpRki6+6SOmvy44DXQjOVd2r7fRUUM5oAI1DedhrJyMZpO447u57P12PuGrX+RX4q2cc0LBZOtBkUc6RX+V9CmDJ7W2tAunasbyb+fp3aBThg0bbgxfJyC+fg8t37 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Aug 15, 2025 at 3:29=E2=80=AFPM Chris Li wrote: > > On Wed, Aug 13, 2025 at 11:20=E2=80=AFAM SeongJae Park wr= ote: > > > > My RFC v1[1] was using a kind of such approach. I changed it to use zp= ool > > starting from RFC v2, following the suggestions from Nhat, Johannes and= Joshua. > > It was suggested to use zpool to reuse its support of migration, highme= m > > handling, and stats. And I agree their suggestions. David also raised= a > > question on RFC v2, about the movability behavior changes[2]. > > Thanks for the pointer, that is an interesting read. > > > I agree we will get more metadata overhead. Nonetheless, I argue it is= not a > > Keep in mind that is more than just metadata. There is another hidden > overhead of using zpool to store your full page compared to directly > using the page. When the page is read, because most likely the zpool > back end is not at a position aligned to the page. That zpool page > will not be able to free immediately, those fragmented zpool will need > to wait for compaction to free. it. It's been awhile since I last worked on zsmalloc, but IIRC zsmalloc handles these pages PAGE_SIZE sized object by just handing out a full page. I skimmed through zsmalloc code, and it seems like for this size_class, each zspage just contain one base page? (check out the logic around ZsHugePage and calculate_zspage_chain_size() in mm/zsmalloc.c. Confusing name, I know). So we won't need compaction to get rid of these buffers for incompressible pages, at free time. There might still be some extra overhead (metadata), but at least I assume we can free these pages easily? > > > big problem and can be mitigated in writeback-enabled environments. He= nce this > > feature is disabled on writeback-disabled case. Next paragraph on the = original > > cover letter explains my points about this. > > > > > > > > The pros is that, on the page fault, there is no need to allocate a > > > new page. You can just move the uncompressed_page into the swap_cache= . > > > You save one memcpy on the page fault as well. That will make the > > > incompressible page fault behave very similar to the minor page fault= . > > > > I agree this can save one memcpy, and that's cool idea! Nonetheless, t= his > > patch is not making the [z]swap-in overhead worse. Rather, this patch = also > > We might still wait to evaluate the lost/gain vs store the > incompressible in swap cache. Google actually has an internal patch to > store the incompressible pages in swap cache and move them out of the > LRU to another already existing list. I can clean it up a bit and send > it to the list for comparison. This approach will not mess with the > zswap LRU because the incompressible data is not moving into zswap. > There is no page fault to handle the incompressible pages. We can always iterate to get the best of all worlds :) One objective at a t= ime. BTW, May I ask - how/when do we move the incompressible pages out of the "incompressible LRU"? I believe there needs to be some sort of scanning mechanism to detect dirtying right? > > > slightly improve it for incompressible pages case by skipping the > > decompression. Also, if we take this way, I think we should also need = to make > > a good answer to the zswapped-page migrations etc. > > Yes, if we keep the store page in the zswap approach, we might have to > optimize inside zsmalloc to handle full page storing better. I believe it already does - check my response above. But I can be wrong. > > > So, if we agree on my justification about the metadata overhead, I thin= k this > > could be done as a followup work of this patch? > > Sure. I am most interested in getting the best overall solution. No > objects to get it now vs later. Agree. We can always iterate on solutions. No big deal. > > > > > > > > > The cons is that, now zpool does not account for uncompressed pages, > > > on the second thought, that can be a con as well, the page is not > > > compressed, why should it account as compressed pages? > > > > I agree it might look weird. But, in my humble opinion, in a perspecti= ve of > > thinking it as zswap backend storage, and by thinking the definition of > > compression in a flexible way, this may be understandable and not cause= real > > user problems? > > I think the real problem is hiding the real error with non errors like > data not compressible. If you want to store uncompressed data in the > zswap layer by design, that is fine. Just need to reason the trade off > vs other options like store in swap cache etc. > > > [...] > > > > mm/zswap.c | 36 ++++++++++++++++++++++++++++++++++-- > > > > 1 file changed, 34 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > index 3c0fd8a13718..0fb940d03268 100644 > > > > --- a/mm/zswap.c > > > > +++ b/mm/zswap.c > > > > @@ -60,6 +60,8 @@ static u64 zswap_written_back_pages; > > > > static u64 zswap_reject_reclaim_fail; > > > > /* Store failed due to compression algorithm failure */ > > > > static u64 zswap_reject_compress_fail; > > > > +/* Compression into a size of > > > +static u64 zswap_compress_fail; > > > > /* Compressed page was too big for the allocator to (optimally) st= ore */ > > > > static u64 zswap_reject_compress_poor; > > > > /* Load or writeback failed due to decompression failure */ > > > > @@ -976,8 +978,26 @@ static bool zswap_compress(struct page *page, = struct zswap_entry *entry, > > > > */ > > > > comp_ret =3D crypto_wait_req(crypto_acomp_compress(acomp_ct= x->req), &acomp_ctx->wait); > > > > dlen =3D acomp_ctx->req->dlen; > > > > - if (comp_ret) > > > > - goto unlock; > > > > + > > > > + /* > > > > + * If a page cannot be compressed into a size smaller than = PAGE_SIZE, > > > > + * save the content as is without a compression, to keep th= e LRU order > > > > + * of writebacks. If writeback is disabled, reject the pag= e since it > > > > + * only adds metadata overhead. swap_writeout() will put t= he page back > > > > + * to the active LRU list in the case. > > > > + */ > > > > + if (comp_ret || dlen >=3D PAGE_SIZE) { > > > > + zswap_compress_fail++; > > > > > > I feel that mixing comp_ret and dlen size if tested here complicates > > > the nested if statement and the behavior as well. > > > One behavior change is that, if the comp_ret !=3D 0, it means the > > > compression crypt engine has some error. e.g. have some bad chip > > > always fail the compression. That is a different error case than the > > > compression was successful and the compression rate is not good. In > > > this case the patch makes the page store in zpool for cryto engine > > > failure, which does not help. > > > > I agree the code has rooms to improve, in terms of the readability, and= keeping > > fine grained observailibty. > > > > > > > > Is there any reason you want to store the page in zpool when the > > > compression engine (not the ratio) fails? > > > > The main problem this patch tries to solve is the LRU order corruption.= In any > > case, I want to keep the order if possible. > > In that case, does it mean that in order to keep the LRU, you never > want to write from zswap to a real back end device? That will make > zswap behave more like a standalone tier. > > > > > > > > > What do you say about the following alternative, this keeps the > > > original behavior if compression engines fail. > > > > > > if (comp_ret) { > > > zswap_compress_egine_fail++; > > > > I agree this counter can be useful. > > Ack. > > > > > > goto unlock; > > > > This will make the page go directly to underlying slower swap device, a= nd hence > > cause LRU order inversion. So unfortuately I have to say this is not t= he > > behavior I want. > > Ack, that is a design choice. I understand now. > > > > > I agree engine failure is not frequent, so this behavior might not real= ly make > > problem to me. But, still I don't see a reason to let the page go dire= ctly to > > swap and make LRU order unexpected. > > > > > } > > > > > > if (dlen >=3D PAGE_SIZE) { > > > zswap_compress_fail++; > > > > I define compress failure here as a failure of attempt to compress the = given > > page's content into a size smaller than PAGE_SIZE. It is a superset in= cludin. > > both "ratio" failure and "engine" failure. Hence I think zswap_compres= s_fail > > should be increased even in the upper case. > > I slept over it a bit. Now I think we should make this a counter of > how many uncompressed pages count stored in zswap. Preperbelly as per > memcg counter. Actually, yeah I asked about this counter in a review in an earlier version as well, then I completely forgot about it :) > I saw that you implement it as a counter in your V1. Does the zsmalloc > already track this information in the zspool class? Having this per Kinda sorta. If we build the kernel with CONFIG_ZSMALLOC_STAT, we can get the number of objects in each size_class. Each time we read, I believe we have to read every size class though. So it's kinda annoying. Whereas here, we can just read an atomic counter? :) > cgroup information we can evaluate how much zswap is saving. Some > jobs like databases might store a lot of hash value and encrypted data > which is not compressible. In that case, we might just pass the whole Hmmm actually, this might be a good point. Lemme think about it a bit more. (but again, worst case scenario, we can send a follow up patch to add these counters). > zswap tier as a whole. The overall compression ratio will reflect that > as well. Having the separate per memcg counter is kind of useful as > well. > > > > > We can discuss about re-defining and re-naming what each stat means, of= course, > > if you want. But I think even if we keep the current definitions and n= ames, we > > can still get the ratio failures if we add your nicely suggested > > zswap_compress_engine_fail, as > > 'zswap_compress_fail - zswap_compress_engine_fail', so we might not rea= lly need > > re-definition and re-naming? > > I am actually less interested in the absolute failure number which > keeps increasing, more on how much incompressible zswap is stored. > That counter + number of engine errors should be perfect. > > > > > > if (!mem_cgroup_zswap_writeback_enabled( > > > folio_memcg(page_folio(page))))= { > > > comp_ret =3D -EINVAL; > > > goto unlock; > > > } > > > dlen =3D PAGE_SIZE; > > > dst =3D kmap_local_page(page); > > > } > > > > > > Overall I feel this alternative is simpler and easier to read. > > > > I agree this code is nice to read. Nonetheless I have to say the behav= ior is > > somewhat different from what I want. > > Even if you keep the current behavior, you can move the invert the > test condition and then remove the "else + goto" similar to the above. > That will make your code less and flatter. I will need to think about > whether we can assign the return value less. > > Another point I would like to make is that you currently make the cut > off threshold as page size. The ideal threshold might be something > slightly smaller than page size. The reason is that the zsmalloc has a > fixed size chuck to store it. If your page is close to full, it will > store the data in the same class as the full page. You are not gaining > anything from zsmalloc if the store page compression ratio is 95%. > That 5% will be in the waste in the zsmalloc class fragment anyway. So > the trade off is, decompress 95% of a page vs memcpy 100% of a page. > It is likely memcpy 100% is faster. I don't know the exact ideal cut > off of threshold. If I had to guess, I would guess below 90%. > > > > > > I can > > > be biased towards my own code :-). > > > I think we should treat the compression engine failure separately fro= m > > > the compression rate failure. The engine failure is rare and we shoul= d > > > know about it as a real error. The compression rate failure is normal= . > > > > Again, I agree having the observability would be nice. I can add a new= counter > > for that, like below attached patch, for example. > > I would love that. Make that per memcg if possible :-) > > > > > [1] https://lore.kernel.org/20250730234059.4603-1-sj@kernel.org > > [2] https://lore.kernel.org/761a2899-6fd9-4bfe-aeaf-23bce0baa0f1@redhat= .com > > Thanks for the pointer, good read. > > Chris