From: Yosry Ahmed <yosry.ahmed@linux.dev>
To: Gregory Price <gourry@gourry.net>
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-cxl@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
kernel-team@meta.com, longman@redhat.com, tj@kernel.org,
hannes@cmpxchg.org, mkoutny@suse.com, corbet@lwn.net,
gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org,
dave@stgolabs.net, jonathan.cameron@huawei.com,
dave.jiang@intel.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, akpm@linux-foundation.org,
vbabka@suse.cz, surenb@google.com, mhocko@suse.com,
jackmanb@google.com, ziy@nvidia.com, david@kernel.org,
lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
rppt@kernel.org, axelrasmussen@google.com, yuanchu@google.com,
weixugc@google.com, yury.norov@gmail.com,
linux@rasmusvillemoes.dk, rientjes@google.com,
shakeel.butt@linux.dev, chrisl@kernel.org, kasong@tencent.com,
shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com,
baohua@kernel.org, chengming.zhou@linux.dev,
roman.gushchin@linux.dev, muchun.song@linux.dev,
osalvador@suse.de, matthew.brost@intel.com,
joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com,
ying.huang@linux.alibaba.com, apopple@nvidia.com, cl@gentwo.org,
harry.yoo@oracle.com, zhengqi.arch@bytedance.com
Subject: Re: [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration
Date: Mon, 12 Jan 2026 21:13:26 +0000 [thread overview]
Message-ID: <4ftthovin57fi4blr2mardw4elwfsiv6vrkhrjqjsfvvuuugjj@uivjc5uzj5ys> (raw)
In-Reply-To: <aWF1uDdP75gOCGLm@gourry-fedora-PF4VCD3F>
On Fri, Jan 09, 2026 at 04:40:08PM -0500, Gregory Price wrote:
> On Fri, Jan 09, 2026 at 04:00:00PM +0000, Yosry Ahmed wrote:
> > On Thu, Jan 08, 2026 at 03:37:54PM -0500, Gregory Price wrote:
> >
> > If the memory is byte-addressable, using it as a second tier makes it
> > directly accessible without page faults, so the access latency is much
> > better than a swapped out page in zswap.
> >
> > Are there some HW limitations that allow a node to be used as a backend
> > for zswap but not a second tier?
> >
>
> Coming back around - presumably any compressed node capable of hosting a
> proper tier would be compatible with zswap, but you might have hardware
> which is sufficiently slow(er than dram, faster than storage) that using
> it as a proper tier may be less efficient than incurring faults.
>
> The standard I've been using is 500ns+ cacheline fetches, but this is
> somewhat arbitrary. Even 500ns might be better than accessing multi-us
> storage, but then when you add compression you might hit 600ns-1us.
>
> This is besides the point, and apologies for the wall of text below,
> feel free to skip this next section - writing out what hardware-specific
> details I can share for the sake of completeness.
The wall of text is very helpful :)
>
>
> Some hardware details
> =====================
> The way every proposed piece of compressed memory hardware I have seen
> would operate is essentially by lying about its capacity to the
> operating system - and then providing mechanisms to determine when the
> compression ratio becomes is dropping to dangerous levels.
>
> Hardware Says : 8GB
> Hardware Has : 1GB
> Node Capacity : 8GB
>
> The capacity numbers are static. Even with hotplug, they must be
> considered static - because the runtime compression ratio can change.
>
> If the device fails to achieve a 4:1 compression ratio, and real usage
> starts to exceed real capacity - the system will fail.
> (dropped writes, poisons, machine checks, etc).
>
> We can mitigate this with strong write-controls and querying the device
> for compression ratio data prior to actually migrating a page.
I am a little bit confused about this. Why do we only need to query the
device before migrating the page?
Are we checking if the device has enough memory for the worst case
scenario (i.e. PAGE_SIZE)?
Or are we checking if the device can compress this specific page and
checking if it can compress it and store it? This seems like it could be
racy and there might be some throwaway work.
I guess my question is: why not just give the page to the device and get
either: successfully compressed and stored OR failed?
Another question, can the device or driver be configured such that we
reject pages that compress poorly to avoid wasting memory and BW on the
device for little savings?
>
> Why Zswap to start
> ==================
> ZSwap is an existing, clean read and write control path control.
> - We fault on all accesses.
> - It otherwise uses system memory under the hood (kmalloc)
>
> I decided to use zswap as a proving ground for the concept. While the
> design in this patch is simplistic (and as you suggest below, can
> clearly be improved), it demonstrates the entire concept:
>
> on demotion:
> - allocate a page from private memory
> - ask the driver if it's safe to use
> - if safe -> migrate
> if unsafe -> fallback
>
> on memory access:
> - "promote" to a real page
> - inform the driver the page has been released (zero or discard)
>
> As you point out, the real value in byte-accessible memory is leaving
> the memory mapped, the only difference on cram.c and zswap.c in the
> above pattern would be:
>
> on demotion:
> - allocate a page from private memory
> - ask the driver if it's safe to use
> - if safe -> migrate and remap the page as RO in page tables
> if unsafe
> -> trigger reclaim on cram node
> -> fallback to another demotion
>
> on *write* access:
> - promote to real page
> - clean up the compressed page
This makes sense. I am assuming the main benefit of zswap.c over cram.c
in this scenario is limiting read accesses as well.
[..]
> > So the CXL code tells zswap what nodes are usable, then zswap tries
> > getting a page from these nodes and checking them using APIs provided by
> > the CXL code.
> >
> > Wouldn't it be a better abstraction if the nodemask lived in the CXL
> > code and an API was exposed to zswap just to allocate a page to copy to?
> > Or we can abstract the copy as well and provide an API that directly
> > tries to copy the page to the compressible node.
> >
> > IOW move zswap_compress_direct() (probably under a different name?) and
> > zswap_direct_nodes into CXL code since it's not really zswap logic.
> >
> > Also, I am not sure if the zswap_compress_direct() call and check would
> > introduce any latency, since almost all existing callers will pay for it
> > without benefiting.
> >
> > If we move the function into CXL code, we could probably have an inline
> > wrapper in a header with a static key guarding it to make there is no
> > overhead for existing users.
> >
>
>
> CXL is also the wrong place to put it - cxl is just one potential
> source of such a node. We'd want that abstracted...
>
> So this looks like a good use of memor-tiers.c - do dispatch there and
> have it set static branches for various features on node registration.
>
> struct page* mt_migrate_page_to(NODE_TYPE, src, &size);
> -> on success return dst page and the size of the page on hardware
> (target_size would address your accounting notes below)
>
> Then have the migrate function in mt do all the node_private callbacks.
>
> So that would limit the zswap internal change to
>
> if (zswap_node_check()) { /* static branch check */
> cpage = mt_migrate_page_to(NODE_PRIVATE_ZSWAP, src, &size);
> if (compressed_page) {
> entry->page_handle = cpage;
> entry->length = size;
> entry->direct = true;
> return true;
> }
> }
> /* Fallthrough */
Yeah I didn't necessarily mean CXL code, but whatever layer is
responsible for keeping track of which nodes can be used for what.
>
> ack. this is all great, thank you.
>
> ... snip ...
> > > entry->length = size
> >
> > I don't think this works. Setting entry->length = PAGE_SIZE will cause a
> > few problems, off the top of my head:
> >
> > 1. An entire page of memory will be charged to the memcg, so swapping
> > out the page won't reduce the memcg usage, which will cause thrashing
> > (reclaim with no progress when hitting the limit).
> >
> > Ideally we'd get the compressed length from HW and record it here to
> > charge it appropriately, but I am not sure how we actually want to
> > charge memory on a compressed node. Do we charge the compressed size as
> > normal memory? Does it need separate charging and a separate limit?
> >
> > There are design discussions to be had before we commit to something.
>
> I have a feeling tracking individual page usage would be way too
> granular / inefficient, but I will consult with some folks on whether
> this can be quieried. If so, we can add way to get that info.
>
> node_private_page_size(page) -> returns device reported page size.
>
> or work it directly into the migrate() call like above
>
> --- assuming there isn't a way and we have to deal with fuzzy math ---
>
> The goal should definitely be to leave the charging statistics the same
> from the perspective of services - i.e zswap should charge a whole page,
> because according to the OS it just used a whole page.
>
> What this would mean is memcg would have to work with fuzzy data.
> If 1GB is charged and the compression ratio is 4:1, reclaim should
> operate (by way of callback) like it has used 256MB.
>
> I think this is the best you can do without tracking individual pages.
This part needs more thought. Zswap cannot charge a full page because
then from the memcg perspective reclaim is not making any progress.
OTOH, as you mention, from the system perspective we just consumed a
full page, so not charging that would be inconsistent.
This is not a zswap-specific thing though, even with cram.c we have to
figure out how to charge memory on the compressed node to the memcg.
It's perhaps not as much of a problem as with zswap because we are not
dealing with reclaim not making progress.
Maybe the memcg limits need to be "enlightened" about different tiers?
We did have such discussions in the past outside the context of
compressed memory, for memory tiering in general.
Not sure if this is the right place to discuss this, but I see the memcg
folks CC'd so maybe it is :)
>
> >
> > 2. The page will be incorrectly counted in
> > zswap_stored_incompressible_pages.
> >
>
> If we can track individual page size, then we can fix that.
>
> If we can't, then we'd need zswap_stored_direct_pages and to do the
> accounting a bit differently. Probably want direct_pages accounting
> anyway, so i might just add that.
Yeah probably the easiest way to deal with this, assuming we keep
entry->length as PAGE_SIZE.
>
> > Aside from that, zswap_total_pages() will be wrong now, as it gets the
> > pool size from zsmalloc and these pages are not allocated from zsmalloc.
> > This is used when checking the pool limits and is exposed in stats.
> >
>
> This is ignorance of zswap on my part, and yeah good point. Will look
> into this accounting a little more.
This is similar-ish to the memcg charging problem, how do we count the
compressed memory usage toward the global zswap limit? Do we keep this
limit for the top-tier? If not, do we charge full size for pages in
c.zswap or compressed size?
Do we need a separate limit for c.zswap? Probably not if the whole node
is dedicated for zswap usage.
>
> > > + memcpy_folio(folio, 0, zfolio, 0, PAGE_SIZE);
> >
> > Why are we using memcpy_folio() here but copy_mc_highpage() on the
> > compression path? Are they equivalent?
> >
>
> both are in include/linux/highmem.h
>
> I was avoiding page->folio conversions in the compression path because
> I had a struct page already.
>
> tl;dr: I'm still looking for the "right" way to do this. I originally
> had a "HACK:" tag here previously but seems I definitely dropped it
> prematurely.
Not a big deal. An RFC or HACK or whatever tag just usually helps signal
to everyone (and more importantly, to Andrew) that this should not be
merged as-is.
>
> (I also think this code can be pushed into mt_ or callbacks)
Agreed.
>
> > > + if (entry->direct) {
> > > + struct page *freepage = (struct page *)entry->handle;
> > > +
> > > + node_private_freed(freepage);
> > > + __free_page(freepage);
> > > + } else
> > > + zs_free(pool->zs_pool, entry->handle);
> >
> > This code is repeated in zswap_entry_free(), we should probably wrap it
> > in a helper that frees the private page or the zsmalloc entry based on
> > entry->direct.
> >
>
> ack.
>
> Thank you again for taking a look, this has been enlightening. Good
> takeaways for the rest of the N_PRIVATE design.
Thanks for kicking off the discussion here, an interesting problem to
solve for sure :)
>
> I think we can minimize zswap changes even further given this.
>
> ~Gregory
next prev parent reply other threads:[~2026-01-12 21:13 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-08 20:37 [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 1/8] numa,memory_hotplug: create N_PRIVATE (Private Nodes) Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 2/8] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 3/8] mm: restrict slub, compaction, and page_alloc to sysram Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 4/8] cpuset: introduce cpuset.mems.sysram Gregory Price
2026-01-12 17:56 ` Yury Norov
2026-01-08 20:37 ` [RFC PATCH v3 5/8] Documentation/admin-guide/cgroups: update docs for mems_allowed Gregory Price
2026-01-12 14:30 ` Michal Koutný
2026-01-12 15:25 ` Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 6/8] drivers/cxl/core/region: add private_region Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration Gregory Price
2026-01-09 16:00 ` Yosry Ahmed
2026-01-09 17:03 ` Gregory Price
2026-01-09 21:40 ` Gregory Price
2026-01-12 21:13 ` Yosry Ahmed [this message]
2026-01-12 23:33 ` Gregory Price
2026-01-12 23:46 ` Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 8/8] drivers/cxl: add zswap private_region type Gregory Price
2026-01-12 11:12 ` [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory Balbir Singh
2026-01-12 14:36 ` Gregory Price
2026-01-12 17:18 ` Yury Norov
2026-01-12 17:36 ` Gregory Price
2026-01-12 21:24 ` dan.j.williams
2026-01-12 21:57 ` Balbir Singh
2026-01-12 22:10 ` dan.j.williams
2026-01-12 22:54 ` Balbir Singh
2026-01-12 23:40 ` Gregory Price
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4ftthovin57fi4blr2mardw4elwfsiv6vrkhrjqjsfvvuuugjj@uivjc5uzj5ys \
--to=yosry.ahmed@linux.dev \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=apopple@nvidia.com \
--cc=axelrasmussen@google.com \
--cc=baohua@kernel.org \
--cc=bhe@redhat.com \
--cc=byungchul@sk.com \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=chrisl@kernel.org \
--cc=cl@gentwo.org \
--cc=corbet@lwn.net \
--cc=dakr@kernel.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=david@kernel.org \
--cc=gourry@gourry.net \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=ira.weiny@intel.com \
--cc=jackmanb@google.com \
--cc=jonathan.cameron@huawei.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux@rasmusvillemoes.dk \
--cc=longman@redhat.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=osalvador@suse.de \
--cc=rafael@kernel.org \
--cc=rakie.kim@sk.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=surenb@google.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=vishal.l.verma@intel.com \
--cc=weixugc@google.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=yury.norov@gmail.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox