From: Gregory Price <gourry@gourry.net>
To: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-cxl@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
kernel-team@meta.com, longman@redhat.com, tj@kernel.org,
hannes@cmpxchg.org, mkoutny@suse.com, corbet@lwn.net,
gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org,
dave@stgolabs.net, jonathan.cameron@huawei.com,
dave.jiang@intel.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, akpm@linux-foundation.org,
vbabka@suse.cz, surenb@google.com, mhocko@suse.com,
jackmanb@google.com, ziy@nvidia.com, david@kernel.org,
lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
rppt@kernel.org, axelrasmussen@google.com, yuanchu@google.com,
weixugc@google.com, yury.norov@gmail.com,
linux@rasmusvillemoes.dk, rientjes@google.com,
shakeel.butt@linux.dev, chrisl@kernel.org, kasong@tencent.com,
shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com,
baohua@kernel.org, chengming.zhou@linux.dev,
roman.gushchin@linux.dev, muchun.song@linux.dev,
osalvador@suse.de, matthew.brost@intel.com,
joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com,
ying.huang@linux.alibaba.com, apopple@nvidia.com, cl@gentwo.org,
harry.yoo@oracle.com, zhengqi.arch@bytedance.com
Subject: Re: [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration
Date: Fri, 9 Jan 2026 16:40:08 -0500 [thread overview]
Message-ID: <aWF1uDdP75gOCGLm@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <i6o5k4xumd5i3ehl6ifk3554sowd2qe7yul7vhaqlh2zo6y7is@z2ky4m432wd6>
On Fri, Jan 09, 2026 at 04:00:00PM +0000, Yosry Ahmed wrote:
> On Thu, Jan 08, 2026 at 03:37:54PM -0500, Gregory Price wrote:
>
> If the memory is byte-addressable, using it as a second tier makes it
> directly accessible without page faults, so the access latency is much
> better than a swapped out page in zswap.
>
> Are there some HW limitations that allow a node to be used as a backend
> for zswap but not a second tier?
>
Coming back around - presumably any compressed node capable of hosting a
proper tier would be compatible with zswap, but you might have hardware
which is sufficiently slow(er than dram, faster than storage) that using
it as a proper tier may be less efficient than incurring faults.
The standard I've been using is 500ns+ cacheline fetches, but this is
somewhat arbitrary. Even 500ns might be better than accessing multi-us
storage, but then when you add compression you might hit 600ns-1us.
This is besides the point, and apologies for the wall of text below,
feel free to skip this next section - writing out what hardware-specific
details I can share for the sake of completeness.
Some hardware details
=====================
The way every proposed piece of compressed memory hardware I have seen
would operate is essentially by lying about its capacity to the
operating system - and then providing mechanisms to determine when the
compression ratio becomes is dropping to dangerous levels.
Hardware Says : 8GB
Hardware Has : 1GB
Node Capacity : 8GB
The capacity numbers are static. Even with hotplug, they must be
considered static - because the runtime compression ratio can change.
If the device fails to achieve a 4:1 compression ratio, and real usage
starts to exceed real capacity - the system will fail.
(dropped writes, poisons, machine checks, etc).
We can mitigate this with strong write-controls and querying the device
for compression ratio data prior to actually migrating a page.
Why Zswap to start
==================
ZSwap is an existing, clean read and write control path control.
- We fault on all accesses.
- It otherwise uses system memory under the hood (kmalloc)
I decided to use zswap as a proving ground for the concept. While the
design in this patch is simplistic (and as you suggest below, can
clearly be improved), it demonstrates the entire concept:
on demotion:
- allocate a page from private memory
- ask the driver if it's safe to use
- if safe -> migrate
if unsafe -> fallback
on memory access:
- "promote" to a real page
- inform the driver the page has been released (zero or discard)
As you point out, the real value in byte-accessible memory is leaving
the memory mapped, the only difference on cram.c and zswap.c in the
above pattern would be:
on demotion:
- allocate a page from private memory
- ask the driver if it's safe to use
- if safe -> migrate and remap the page as RO in page tables
if unsafe
-> trigger reclaim on cram node
-> fallback to another demotion
on *write* access:
- promote to real page
- clean up the compressed page
> Or is the idea to make promotions from compressed memory to normal
> memory fault-driver instead of relying on page hotness?
>
> I also think there are some design decisions that need to be made before
> we commit to this, see the comments below for more.
>
100% agreed, i'm absolutely not locked into a design, this just gets the
ball rolling :].
> > /* RCU-protected iteration */
> > static LIST_HEAD(zswap_pools);
> > /* protects zswap_pools list modification */
> > @@ -716,7 +732,13 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
> > static void zswap_entry_free(struct zswap_entry *entry)
> > {
> > zswap_lru_del(&zswap_list_lru, entry);
> > - zs_free(entry->pool->zs_pool, entry->handle);
> > + if (entry->direct) {
> > + struct page *page = (struct page *)entry->handle;
>
> Would it be cleaner to add a union in zswap_entry that has entry->handle
> and entry->page?
>
Absolutely. Ack.
> > + /* Skip nodes we've already tried and failed */
> > + if (node_isset(nid, tried_nodes))
> > + continue;
>
> Why do we need this? Does for_each_node_mask() iterate each node more
> than once?
>
This is just me being stupid, i will clean this up. I think i wrote
this when i was using a _next nodemask variant that can loop around and
just left this in when i got it working.
> I think we can drop the 'found' label by moving things around, would
> this be simpler?
> for_each_node_mask(..) {
> ...
> ret = node_private_allocated(dst);
> if (!ret)
> break;
>
> __free_page(dst);
> dst = NULL;
> }
>
ack, thank you.
> So the CXL code tells zswap what nodes are usable, then zswap tries
> getting a page from these nodes and checking them using APIs provided by
> the CXL code.
>
> Wouldn't it be a better abstraction if the nodemask lived in the CXL
> code and an API was exposed to zswap just to allocate a page to copy to?
> Or we can abstract the copy as well and provide an API that directly
> tries to copy the page to the compressible node.
>
> IOW move zswap_compress_direct() (probably under a different name?) and
> zswap_direct_nodes into CXL code since it's not really zswap logic.
>
> Also, I am not sure if the zswap_compress_direct() call and check would
> introduce any latency, since almost all existing callers will pay for it
> without benefiting.
>
> If we move the function into CXL code, we could probably have an inline
> wrapper in a header with a static key guarding it to make there is no
> overhead for existing users.
>
CXL is also the wrong place to put it - cxl is just one potential
source of such a node. We'd want that abstracted...
So this looks like a good use of memor-tiers.c - do dispatch there and
have it set static branches for various features on node registration.
struct page* mt_migrate_page_to(NODE_TYPE, src, &size);
-> on success return dst page and the size of the page on hardware
(target_size would address your accounting notes below)
Then have the migrate function in mt do all the node_private callbacks.
So that would limit the zswap internal change to
if (zswap_node_check()) { /* static branch check */
cpage = mt_migrate_page_to(NODE_PRIVATE_ZSWAP, src, &size);
if (compressed_page) {
entry->page_handle = cpage;
entry->length = size;
entry->direct = true;
return true;
}
}
/* Fallthrough */
ack. this is all great, thank you.
... snip ...
> > entry->length = size
>
> I don't think this works. Setting entry->length = PAGE_SIZE will cause a
> few problems, off the top of my head:
>
> 1. An entire page of memory will be charged to the memcg, so swapping
> out the page won't reduce the memcg usage, which will cause thrashing
> (reclaim with no progress when hitting the limit).
>
> Ideally we'd get the compressed length from HW and record it here to
> charge it appropriately, but I am not sure how we actually want to
> charge memory on a compressed node. Do we charge the compressed size as
> normal memory? Does it need separate charging and a separate limit?
>
> There are design discussions to be had before we commit to something.
I have a feeling tracking individual page usage would be way too
granular / inefficient, but I will consult with some folks on whether
this can be quieried. If so, we can add way to get that info.
node_private_page_size(page) -> returns device reported page size.
or work it directly into the migrate() call like above
--- assuming there isn't a way and we have to deal with fuzzy math ---
The goal should definitely be to leave the charging statistics the same
from the perspective of services - i.e zswap should charge a whole page,
because according to the OS it just used a whole page.
What this would mean is memcg would have to work with fuzzy data.
If 1GB is charged and the compression ratio is 4:1, reclaim should
operate (by way of callback) like it has used 256MB.
I think this is the best you can do without tracking individual pages.
>
> 2. The page will be incorrectly counted in
> zswap_stored_incompressible_pages.
>
If we can track individual page size, then we can fix that.
If we can't, then we'd need zswap_stored_direct_pages and to do the
accounting a bit differently. Probably want direct_pages accounting
anyway, so i might just add that.
> Aside from that, zswap_total_pages() will be wrong now, as it gets the
> pool size from zsmalloc and these pages are not allocated from zsmalloc.
> This is used when checking the pool limits and is exposed in stats.
>
This is ignorance of zswap on my part, and yeah good point. Will look
into this accounting a little more.
> > + memcpy_folio(folio, 0, zfolio, 0, PAGE_SIZE);
>
> Why are we using memcpy_folio() here but copy_mc_highpage() on the
> compression path? Are they equivalent?
>
both are in include/linux/highmem.h
I was avoiding page->folio conversions in the compression path because
I had a struct page already.
tl;dr: I'm still looking for the "right" way to do this. I originally
had a "HACK:" tag here previously but seems I definitely dropped it
prematurely.
(I also think this code can be pushed into mt_ or callbacks)
> > + if (entry->direct) {
> > + struct page *freepage = (struct page *)entry->handle;
> > +
> > + node_private_freed(freepage);
> > + __free_page(freepage);
> > + } else
> > + zs_free(pool->zs_pool, entry->handle);
>
> This code is repeated in zswap_entry_free(), we should probably wrap it
> in a helper that frees the private page or the zsmalloc entry based on
> entry->direct.
>
ack.
Thank you again for taking a look, this has been enlightening. Good
takeaways for the rest of the N_PRIVATE design.
I think we can minimize zswap changes even further given this.
~Gregory
next prev parent reply other threads:[~2026-01-09 21:40 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-08 20:37 [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 1/8] numa,memory_hotplug: create N_PRIVATE (Private Nodes) Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 2/8] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 3/8] mm: restrict slub, compaction, and page_alloc to sysram Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 4/8] cpuset: introduce cpuset.mems.sysram Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 5/8] Documentation/admin-guide/cgroups: update docs for mems_allowed Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 6/8] drivers/cxl/core/region: add private_region Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration Gregory Price
2026-01-09 16:00 ` Yosry Ahmed
2026-01-09 17:03 ` Gregory Price
2026-01-09 21:40 ` Gregory Price [this message]
2026-01-08 20:37 ` [RFC PATCH v3 8/8] drivers/cxl: add zswap private_region type Gregory Price
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aWF1uDdP75gOCGLm@gourry-fedora-PF4VCD3F \
--to=gourry@gourry.net \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=apopple@nvidia.com \
--cc=axelrasmussen@google.com \
--cc=baohua@kernel.org \
--cc=bhe@redhat.com \
--cc=byungchul@sk.com \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=chrisl@kernel.org \
--cc=cl@gentwo.org \
--cc=corbet@lwn.net \
--cc=dakr@kernel.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=david@kernel.org \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=ira.weiny@intel.com \
--cc=jackmanb@google.com \
--cc=jonathan.cameron@huawei.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux@rasmusvillemoes.dk \
--cc=longman@redhat.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=osalvador@suse.de \
--cc=rafael@kernel.org \
--cc=rakie.kim@sk.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=surenb@google.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=vishal.l.verma@intel.com \
--cc=weixugc@google.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yosry.ahmed@linux.dev \
--cc=yuanchu@google.com \
--cc=yury.norov@gmail.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox