From: Gregory Price <gourry@gourry.net>
To: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-cxl@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
kernel-team@meta.com, longman@redhat.com, tj@kernel.org,
hannes@cmpxchg.org, mkoutny@suse.com, corbet@lwn.net,
gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org,
dave@stgolabs.net, jonathan.cameron@huawei.com,
dave.jiang@intel.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, akpm@linux-foundation.org,
vbabka@suse.cz, surenb@google.com, mhocko@suse.com,
jackmanb@google.com, ziy@nvidia.com, david@kernel.org,
lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
rppt@kernel.org, axelrasmussen@google.com, yuanchu@google.com,
weixugc@google.com, yury.norov@gmail.com,
linux@rasmusvillemoes.dk, rientjes@google.com,
shakeel.butt@linux.dev, chrisl@kernel.org, kasong@tencent.com,
shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com,
baohua@kernel.org, chengming.zhou@linux.dev,
roman.gushchin@linux.dev, muchun.song@linux.dev,
osalvador@suse.de, matthew.brost@intel.com,
joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com,
ying.huang@linux.alibaba.com, apopple@nvidia.com, cl@gentwo.org,
harry.yoo@oracle.com, zhengqi.arch@bytedance.com
Subject: Re: [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration
Date: Mon, 12 Jan 2026 18:33:16 -0500 [thread overview]
Message-ID: <aWWEvAaUmpA_0ERP@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <4ftthovin57fi4blr2mardw4elwfsiv6vrkhrjqjsfvvuuugjj@uivjc5uzj5ys>
On Mon, Jan 12, 2026 at 09:13:26PM +0000, Yosry Ahmed wrote:
> On Fri, Jan 09, 2026 at 04:40:08PM -0500, Gregory Price wrote:
> > On Fri, Jan 09, 2026 at 04:00:00PM +0000, Yosry Ahmed wrote:
> > > On Thu, Jan 08, 2026 at 03:37:54PM -0500, Gregory Price wrote:
> >
> > Hardware Says : 8GB
> > Hardware Has : 1GB
> > Node Capacity : 8GB
> >
> > The capacity numbers are static. Even with hotplug, they must be
> > considered static - because the runtime compression ratio can change.
> >
> > If the device fails to achieve a 4:1 compression ratio, and real usage
> > starts to exceed real capacity - the system will fail.
> > (dropped writes, poisons, machine checks, etc).
> >
> > We can mitigate this with strong write-controls and querying the device
> > for compression ratio data prior to actually migrating a page.
>
> I am a little bit confused about this. Why do we only need to query the
> device before migrating the page?
>
Because there is no other interposition point at which we could.
Everything is memory semantic - it reduces to memcpy().
The actual question you're asking is "What happens if we write the page
and we're out of memory?"
The answer is: The page gets poisoned and the write gets dropped.
That's it. The writer does not get notified. The next reader of that
memory will hit POISON and the failure process will happen (MCE or
SIGBUS, essentially).
> Are we checking if the device has enough memory for the worst case
> scenario (i.e. PAGE_SIZE)?
>
> Or are we checking if the device can compress this specific page and
> checking if it can compress it and store it? This seems like it could be
> racy and there might be some throwaway work.
>
We essentially need to capture the current compression ratio and
real-usage to determine whether there's another page available.
It is definitely racey, and the best we can do is set reasonable
real-memory-usage limits to prevent ever finding ourselves in that
scenario. That most likely means requiring the hardware send an
interrupt when usage and/or ratio hit some threshhold and setting a
"NO ALLOCATION ALLOWED" bit.
But in software we can also try to query/track this as well, but we may
not be able to query the device at allocation time (or at least that
would be horribly non-performant).
So yeah, it's racy.
> I guess my question is: why not just give the page to the device and get
> either: successfully compressed and stored OR failed?
>
Yeah this is what I meant by this whole thing being sunk into the
callback. I think that's reasonable.
> Another question, can the device or driver be configured such that we
> reject pages that compress poorly to avoid wasting memory and BW on the
> device for little savings?
>
Memory semantics :]
memcpy(dst, src) -> no indication of compression ratio
> > on *write* access:
> > - promote to real page
> > - clean up the compressed page
>
> This makes sense. I am assuming the main benefit of zswap.c over cram.c
> in this scenario is limiting read accesses as well.
>
For the first go, yeah. A cram.c would need special page table handling
bits that will take a while to get right. We can make use of the
hardware differently in the meantime.
> > --- assuming there isn't a way and we have to deal with fuzzy math ---
> >
> > The goal should definitely be to leave the charging statistics the same
> > from the perspective of services - i.e zswap should charge a whole page,
> > because according to the OS it just used a whole page.
> >
> > What this would mean is memcg would have to work with fuzzy data.
> > If 1GB is charged and the compression ratio is 4:1, reclaim should
> > operate (by way of callback) like it has used 256MB.
> >
> > I think this is the best you can do without tracking individual pages.
>
> This part needs more thought. Zswap cannot charge a full page because
> then from the memcg perspective reclaim is not making any progress.
> OTOH, as you mention, from the system perspective we just consumed a
> full page, so not charging that would be inconsistent.
>
> This is not a zswap-specific thing though, even with cram.c we have to
> figure out how to charge memory on the compressed node to the memcg.
> It's perhaps not as much of a problem as with zswap because we are not
> dealing with reclaim not making progress.
>
> Maybe the memcg limits need to be "enlightened" about different tiers?
> We did have such discussions in the past outside the context of
> compressed memory, for memory tiering in general.
>
> Not sure if this is the right place to discuss this, but I see the memcg
> folks CC'd so maybe it is :)
>
I will probably need some help to get the accounting right if I'm being
honest. I can't say I fully understanding the implications here, but
what you describe makes sense.
One of the assumptions you have in zswap is that there's some known
REAL chunk of memory X-GB, and the compression ratio dictates that you
get to cram more than X-GB of data in there.
This device flips that on its head. It lies to the system and says
there's X-GB, and you can only actually use a fraction of it in the
worst case - and in the best case you use all of it.
So in that sense, zswap has "infinite upside" (if you're infinitely
compressible), whereas this device has "limited upside" (node capacity).
That changes how you account for things entirely, and that's why
entry->length always has to be PAGE_SIZE. Even if the device can tell
us the real size, i'm not sure how useful that is - you still have to
charge for an entire `struct page`.
Time for a good long :think:
> >
> > This is ignorance of zswap on my part, and yeah good point. Will look
> > into this accounting a little more.
>
> This is similar-ish to the memcg charging problem, how do we count the
> compressed memory usage toward the global zswap limit? Do we keep this
> limit for the top-tier? If not, do we charge full size for pages in
> c.zswap or compressed size?
>
> Do we need a separate limit for c.zswap? Probably not if the whole node
> is dedicated for zswap usage.
>
Since we're accounting for entire `struct page` usage vs the hard cap of
(device_capcity / PAGE_SIZE) - then this might actually be the answer.
> >
> > Thank you again for taking a look, this has been enlightening. Good
> > takeaways for the rest of the N_PRIVATE design.
>
> Thanks for kicking off the discussion here, an interesting problem to
> solve for sure :)
>
One of the more interesting ones i've had in a few years :]
Cheers,
~Gregory
next prev parent reply other threads:[~2026-01-12 23:33 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-08 20:37 [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 1/8] numa,memory_hotplug: create N_PRIVATE (Private Nodes) Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 2/8] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 3/8] mm: restrict slub, compaction, and page_alloc to sysram Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 4/8] cpuset: introduce cpuset.mems.sysram Gregory Price
2026-01-12 17:56 ` Yury Norov
2026-01-08 20:37 ` [RFC PATCH v3 5/8] Documentation/admin-guide/cgroups: update docs for mems_allowed Gregory Price
2026-01-12 14:30 ` Michal Koutný
2026-01-12 15:25 ` Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 6/8] drivers/cxl/core/region: add private_region Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration Gregory Price
2026-01-09 16:00 ` Yosry Ahmed
2026-01-09 17:03 ` Gregory Price
2026-01-09 21:40 ` Gregory Price
2026-01-12 21:13 ` Yosry Ahmed
2026-01-12 23:33 ` Gregory Price [this message]
2026-01-12 23:46 ` Gregory Price
2026-01-08 20:37 ` [RFC PATCH v3 8/8] drivers/cxl: add zswap private_region type Gregory Price
2026-01-12 11:12 ` [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory Balbir Singh
2026-01-12 14:36 ` Gregory Price
2026-01-12 17:18 ` Yury Norov
2026-01-12 17:36 ` Gregory Price
2026-01-12 21:24 ` dan.j.williams
2026-01-12 21:57 ` Balbir Singh
2026-01-12 22:10 ` dan.j.williams
2026-01-12 22:54 ` Balbir Singh
2026-01-12 23:40 ` Gregory Price
2026-01-13 1:12 ` Balbir Singh
2026-01-13 1:17 ` dan.j.williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aWWEvAaUmpA_0ERP@gourry-fedora-PF4VCD3F \
--to=gourry@gourry.net \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=apopple@nvidia.com \
--cc=axelrasmussen@google.com \
--cc=baohua@kernel.org \
--cc=bhe@redhat.com \
--cc=byungchul@sk.com \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=chrisl@kernel.org \
--cc=cl@gentwo.org \
--cc=corbet@lwn.net \
--cc=dakr@kernel.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=david@kernel.org \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=ira.weiny@intel.com \
--cc=jackmanb@google.com \
--cc=jonathan.cameron@huawei.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux@rasmusvillemoes.dk \
--cc=longman@redhat.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=osalvador@suse.de \
--cc=rafael@kernel.org \
--cc=rakie.kim@sk.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=surenb@google.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=vishal.l.verma@intel.com \
--cc=weixugc@google.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yosry.ahmed@linux.dev \
--cc=yuanchu@google.com \
--cc=yury.norov@gmail.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox