From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id F41A3D26D6A
	for <linux-mm@archiver.kernel.org>; Fri,  9 Jan 2026 16:00:24 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 082C76B0088; Fri,  9 Jan 2026 11:00:24 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 02F686B0089; Fri,  9 Jan 2026 11:00:23 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E53976B008A; Fri,  9 Jan 2026 11:00:23 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id D17B36B0088
	for <linux-mm@kvack.org>; Fri,  9 Jan 2026 11:00:23 -0500 (EST)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 728ED14025B
	for <linux-mm@kvack.org>; Fri,  9 Jan 2026 16:00:23 +0000 (UTC)
X-FDA: 84312887526.15.F1A8131
Received: from out-185.mta0.migadu.com (out-185.mta0.migadu.com [91.218.175.185])
	by imf27.hostedemail.com (Postfix) with ESMTP id 28B904001C
	for <linux-mm@kvack.org>; Fri,  9 Jan 2026 16:00:20 +0000 (UTC)
Authentication-Results: imf27.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=duTcYuly;
	spf=pass (imf27.hostedemail.com: domain of yosry.ahmed@linux.dev designates 91.218.175.185 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1767974421;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=q5LAdhOCCc8N41wLK+KHHr42/lhH/TW4Y/qOP6oJbhM=;
	b=mqsJuPuI+282LRcdjMaP6NRSsdusnGr6mRamafxi8ny8yc2yK/tCHSiYkXNDIzugDQ+f1M
	g3oVV6Bve/iJtHMySO3w8rEgRIHRJNjuqwwRgE/3o55TmQ7nm2NfRf9yDi/jFoRbAlfWgs
	WqRTTugEMTejw8bL3DQ0IkV4XIRYfu4=
ARC-Authentication-Results: i=1;
	imf27.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=duTcYuly;
	spf=pass (imf27.hostedemail.com: domain of yosry.ahmed@linux.dev designates 91.218.175.185 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1767974421; a=rsa-sha256;
	cv=none;
	b=DfYP6rZDR4E1NUbiUcuQNveQ3/soA/zGV/+sck7KSE/BwTqiem52tG8/DzvSgpJcSyBJgT
	qviW1X+nSXjZgaO0plkopb7d5H3Q+BfPEtnCOKHhzIoKVDHuQ97GtGUnEI+JPQMTxMKCgG
	0/0lvq3i5LxGWI2nZRis3MInR3emxBg=
Date: Fri, 9 Jan 2026 16:00:00 +0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1767974418;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=q5LAdhOCCc8N41wLK+KHHr42/lhH/TW4Y/qOP6oJbhM=;
	b=duTcYulyLdWHtvP4Q8YAlW6mTv845miu1bdA/mIk/vToHI6CugjQw322Jf709n/7q2LkUI
	iwmuedRmZe5i0/w1YslWme1xHC/2GmJcu7XkAYZ/3pLBQhCZ2036aJeXlSsk37SEdcMIDP
	VHyJwH0lH61D2s6KyefTV3iQPtSt5wU=
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Yosry Ahmed <yosry.ahmed@linux.dev>
To: Gregory Price <gourry@gourry.net>
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-cxl@vger.kernel.org, 
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, 
	kernel-team@meta.com, longman@redhat.com, tj@kernel.org, hannes@cmpxchg.org, 
	mkoutny@suse.com, corbet@lwn.net, gregkh@linuxfoundation.org, rafael@kernel.org, 
	dakr@kernel.org, dave@stgolabs.net, jonathan.cameron@huawei.com, 
	dave.jiang@intel.com, alison.schofield@intel.com, vishal.l.verma@intel.com, 
	ira.weiny@intel.com, dan.j.williams@intel.com, akpm@linux-foundation.org, 
	vbabka@suse.cz, surenb@google.com, mhocko@suse.com, jackmanb@google.com, 
	ziy@nvidia.com, david@kernel.org, lorenzo.stoakes@oracle.com, 
	Liam.Howlett@oracle.com, rppt@kernel.org, axelrasmussen@google.com, yuanchu@google.com, 
	weixugc@google.com, yury.norov@gmail.com, linux@rasmusvillemoes.dk, 
	rientjes@google.com, shakeel.butt@linux.dev, chrisl@kernel.org, kasong@tencent.com, 
	shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, 
	chengming.zhou@linux.dev, roman.gushchin@linux.dev, muchun.song@linux.dev, 
	osalvador@suse.de, matthew.brost@intel.com, joshua.hahnjy@gmail.com, 
	rakie.kim@sk.com, byungchul@sk.com, ying.huang@linux.alibaba.com, 
	apopple@nvidia.com, cl@gentwo.org, harry.yoo@oracle.com, zhengqi.arch@bytedance.com
Subject: Re: [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration
Message-ID: <i6o5k4xumd5i3ehl6ifk3554sowd2qe7yul7vhaqlh2zo6y7is@z2ky4m432wd6>
References: <20260108203755.1163107-1-gourry@gourry.net>
 <20260108203755.1163107-8-gourry@gourry.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20260108203755.1163107-8-gourry@gourry.net>
X-Migadu-Flow: FLOW_OUT
X-Rspam-User: 
X-Stat-Signature: xy9ztu4wfnq839b8p6zizucs19q417hx
X-Rspamd-Queue-Id: 28B904001C
X-Rspamd-Server: rspam04
X-HE-Tag: 1767974420-4389
X-HE-Meta: U2FsdGVkX1/ocYqodaevpsHNdQGiKi/rctnN9qJtpk/uQzltwhPLh3Gv+Rkejehz0fbRXtZN/JPy8CStm1zxV9cfHxWSBcKn0bWepdpgnf9AaO94i3dbcbehPu4pkMyn5cJQ9b5GBB5uJT3oMpjW2u/E69zXxVI9QccqnJr5AwZhgsbv3TpdiBcKwe4u0xceED0u7FqXjSAO96fGD0u//t4FeY8xsc7cl3BdiHA/ZFYsnaPjVgf9spNcwimExQHa5bWbIsf7WLAi2WVujJsiotLq++w5bbANJHhYdaKtdcttZcgUEd/nxYE6P/UeFoA8hgdZgWQQd1Y+A2Xik6dAiZbWZZoihMUNRZeYbLgDj65LwhEK8Ez8HI7nT/wZnqrOFUtvkeaJEecq5gZ6C6rhBx9Qoj53JBttxxrtW1Xsu8fQrojJLI45Hu1GtS2vJLT05f8tSgca5/cnf+vVdZumbOCSXAolTuROkRFq98BG87uChMbBw6VnFD5C8d4nwuNZ1jvVNgpOonUfVsbLcgyTsWuO8pE+vRLNKc4Z3/ukPxDsVaEhvx8xWpIFaBolJ1pGfR7dtaQ4AaI82UDuq35tiCaC6Ph/q25AGdXftcRiq61Q2xuKDH3a0GuPH5bFJWniqYkRLr2UoiJz5spMkJj/lorlYRGZti6BPXYCIAr66ZfzNE9aj1gI2Uy7+15XM008hlAP4S7SACoqnz0XODgrUxWm70zFACopfy9oXQaCY17DX569XGryqnPZT1Y45CiSKXY8/XesbauJnqjklrWS/W/U6AjEfF2muQpGtPwJiZRmCInLm0MbdZvSIEKSA9Hnv4Tvqo5sKQxMtsWF99XKhpKkavQzDodUKVwtobjpteRTQmbNke7hO7lvI+ZY6LMFoM7JW6C644guw7fXkzSUnyzUyIb9ynfDhdzs/hjMZTJVuDdsyibfxwAsGzHyTtQucie9BRwQRsOluSsTKd3
 dAKufvSX
 jVxjLeFsZEo0aT/zU//dMY/OQR+6nnKZiLKz2+QxhqH+z0rKtgB91peOLAYW2wW02cc/zr5Pl45k/36s2PqMoutLCzodGTszZxhfEMGcTH8LJJpEPXgvsNLwaP3drxKcThvjouv9wOsr0E1ay0itU8VGwYKalQgaPjrB0TcJfRCFC1YR4VXiz5nypiuuOFApbT6IkRU2K9KCaicLtJTitkfW/y7g/zqaYF7eF0Icb0umrWE5ag9QtI3oGGb/4Y76x+7qJYBn8auvxmWy7hmY1Lnes5R3PMqgpM+W1h+JK0dQaXtB4Q6UZ/jybvYDlSNxmDgBLiEwv2IANBFkpzju83EFw5Q==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Jan 08, 2026 at 03:37:54PM -0500, Gregory Price wrote:
> If a private zswap-node is available, skip the entire software
> compression process and memcpy directly to a compressed memory
> folio, and store the newly allocated compressed memory page as
> the zswap entry->handle.
> 
> On decompress we do the opposite: copy directly from the stored
> page to the destination, and free the compressed memory page.
> 
> The driver callback is responsible for preventing run-away
> compression ratio failures by checking that the allocated page is
> safe to use (i.e. a compression ratio limit hasn't been crossed).
> 
> Signed-off-by: Gregory Price <gourry@gourry.net>

Hi Gregory,

Thanks for sending this, I have a lot of questions/comments below, but
from a high-level I am trying to understand the benefit of using a
compressed node for zswap rather than as a second tier.

If the memory is byte-addressable, using it as a second tier makes it
directly accessible without page faults, so the access latency is much
better than a swapped out page in zswap.

Are there some HW limitations that allow a node to be used as a backend
for zswap but not a second tier?

Or is the idea to make promotions from compressed memory to normal
memory fault-driver instead of relying on page hotness?

I also think there are some design decisions that need to be made before
we commit to this, see the comments below for more.

> ---
>  include/linux/zswap.h |   5 ++
>  mm/zswap.c            | 106 +++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 109 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 30c193a1207e..4b52fe447e7e 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -35,6 +35,8 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
>  void zswap_folio_swapin(struct folio *folio);
>  bool zswap_is_enabled(void);
>  bool zswap_never_enabled(void);
> +void zswap_add_direct_node(int nid);
> +void zswap_remove_direct_node(int nid);
>  #else
>  
>  struct zswap_lruvec_state {};
> @@ -69,6 +71,9 @@ static inline bool zswap_never_enabled(void)
>  	return true;
>  }
>  
> +static inline void zswap_add_direct_node(int nid) {}
> +static inline void zswap_remove_direct_node(int nid) {}
> +
>  #endif
>  
>  #endif /* _LINUX_ZSWAP_H */
> diff --git a/mm/zswap.c b/mm/zswap.c
> index de8858ff1521..aada588c957e 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -35,6 +35,7 @@
>  #include <linux/workqueue.h>
>  #include <linux/list_lru.h>
>  #include <linux/zsmalloc.h>
> +#include <linux/node.h>
>  
>  #include "swap.h"
>  #include "internal.h"
> @@ -190,6 +191,7 @@ struct zswap_entry {
>  	swp_entry_t swpentry;
>  	unsigned int length;
>  	bool referenced;
> +	bool direct;
>  	struct zswap_pool *pool;
>  	unsigned long handle;
>  	struct obj_cgroup *objcg;
> @@ -199,6 +201,20 @@ struct zswap_entry {
>  static struct xarray *zswap_trees[MAX_SWAPFILES];
>  static unsigned int nr_zswap_trees[MAX_SWAPFILES];
>  
> +/* Nodemask for compressed RAM nodes used by zswap_compress_direct */
> +static nodemask_t zswap_direct_nodes = NODE_MASK_NONE;
> +
> +void zswap_add_direct_node(int nid)
> +{
> +	node_set(nid, zswap_direct_nodes);
> +}
> +
> +void zswap_remove_direct_node(int nid)
> +{
> +	if (!node_online(nid))
> +		node_clear(nid, zswap_direct_nodes);
> +}
> +
>  /* RCU-protected iteration */
>  static LIST_HEAD(zswap_pools);
>  /* protects zswap_pools list modification */
> @@ -716,7 +732,13 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
>  static void zswap_entry_free(struct zswap_entry *entry)
>  {
>  	zswap_lru_del(&zswap_list_lru, entry);
> -	zs_free(entry->pool->zs_pool, entry->handle);
> +	if (entry->direct) {
> +		struct page *page = (struct page *)entry->handle;

Would it be cleaner to add a union in zswap_entry that has entry->handle
and entry->page?

> +
> +		node_private_freed(page);
> +		__free_page(page);
> +	} else
> +		zs_free(entry->pool->zs_pool, entry->handle);
>  	zswap_pool_put(entry->pool);
>  	if (entry->objcg) {
>  		obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
> @@ -849,6 +871,58 @@ static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx)
>  	mutex_unlock(&acomp_ctx->mutex);
>  }
>  
> +static struct page *zswap_compress_direct(struct page *src,
> +					  struct zswap_entry *entry)
> +{
> +	int nid;
> +	struct page *dst;
> +	gfp_t gfp;
> +	nodemask_t tried_nodes = NODE_MASK_NONE;
> +
> +	if (nodes_empty(zswap_direct_nodes))
> +		return NULL;
> +
> +	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE |
> +	      __GFP_THISNODE;
> +
> +	for_each_node_mask(nid, zswap_direct_nodes) {
> +		int ret;
> +
> +		/* Skip nodes we've already tried and failed */
> +		if (node_isset(nid, tried_nodes))
> +			continue;

Why do we need this? Does for_each_node_mask() iterate each node more
than once?

> +
> +		dst = __alloc_pages(gfp, 0, nid, &zswap_direct_nodes);
> +		if (!dst)
> +			continue;
> +
> +		/*
> +		 * Check with the device driver that this page is safe to use.
> +		 * If the device reports an error (e.g., compression ratio is
> +		 * too low and the page can't safely store data), free the page
> +		 * and try another node.
> +		 */
> +		ret = node_private_allocated(dst);
> +		if (ret) {
> +			__free_page(dst);
> +			node_set(nid, tried_nodes);
> +			continue;
> +		}

I think we can drop the 'found' label by moving things around, would
this be simpler?

	for_each_node_mask(..) {
		...
		ret = node_private_allocated(dst);
		if (!ret)
			break;

		__free_page(dst);
		dst = NULL;
	}

	if (!dst)
		return NULL;

	if (copy_mc_highpage(..) {
		..
	}
	return dst;
		

> +
> +		goto found;
> +	}
> +
> +	return NULL;
> +
> +found:
> +	/* If we fail to copy at this point just fallback */
> +	if (copy_mc_highpage(dst, src)) {
> +		__free_page(dst);
> +		dst = NULL;
> +	}
> +	return dst;
> +}
> +

So the CXL code tells zswap what nodes are usable, then zswap tries
getting a page from these nodes and checking them using APIs provided by
the CXL code.

Wouldn't it be a better abstraction if the nodemask lived in the CXL
code and an API was exposed to zswap just to allocate a page to copy to?
Or we can abstract the copy as well and provide an API that directly
tries to copy the page to the compressible node.

IOW move zswap_compress_direct() (probably under a different name?) and
zswap_direct_nodes into CXL code since it's not really zswap logic.

Also, I am not sure if the zswap_compress_direct() call and check would
introduce any latency, since almost all existing callers will pay for it
without benefiting.

If we move the function into CXL code, we could probably have an inline
wrapper in a header with a static key guarding it to make there is no
overhead for existing users.

>  static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  			   struct zswap_pool *pool)
>  {
> @@ -860,6 +934,17 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	gfp_t gfp;
>  	u8 *dst;
>  	bool mapped = false;
> +	struct page *zpage;
> +
> +	/* Try to shunt directly to compressed ram */
> +	zpage = zswap_compress_direct(page, entry);
> +	if (zpage) {
> +		entry->handle = (unsigned long)zpage;
> +		entry->length = PAGE_SIZE;
> +		entry->direct = true;
> +		return true;
> +	}

I don't think this works. Setting entry->length = PAGE_SIZE will cause a
few problems, off the top of my head:

1. An entire page of memory will be charged to the memcg, so swapping
out the page won't reduce the memcg usage, which will cause thrashing
(reclaim with no progress when hitting the limit).

Ideally we'd get the compressed length from HW and record it here to
charge it appropriately, but I am not sure how we actually want to
charge memory on a compressed node. Do we charge the compressed size as
normal memory? Does it need separate charging and a separate limit?

There are design discussions to be had before we commit to something.

2. The page will be incorrectly counted in
zswap_stored_incompressible_pages.

Aside from that, zswap_total_pages() will be wrong now, as it gets the
pool size from zsmalloc and these pages are not allocated from zsmalloc.
This is used when checking the pool limits and is exposed in stats.

> +	/* otherwise fallback to normal zswap */
>  
>  	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
>  	dst = acomp_ctx->buffer;
> @@ -913,6 +998,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	zs_obj_write(pool->zs_pool, handle, dst, dlen);
>  	entry->handle = handle;
>  	entry->length = dlen;
> +	entry->direct = false;
>  
>  unlock:
>  	if (mapped)
> @@ -936,6 +1022,15 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  	int decomp_ret = 0, dlen = PAGE_SIZE;
>  	u8 *src, *obj;
>  
> +	/* compressed ram page */
> +	if (entry->direct) {
> +		struct page *src = (struct page *)entry->handle;
> +		struct folio *zfolio = page_folio(src);
> +
> +		memcpy_folio(folio, 0, zfolio, 0, PAGE_SIZE);

Why are we using memcpy_folio() here but copy_mc_highpage() on the
compression path? Are they equivalent?

> +		goto direct_done;
> +	}
> +
>  	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
>  	obj = zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx->buffer);
>  
> @@ -969,6 +1064,7 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  	zs_obj_read_end(pool->zs_pool, entry->handle, obj);
>  	acomp_ctx_put_unlock(acomp_ctx);
>  
> +direct_done:
>  	if (!decomp_ret && dlen == PAGE_SIZE)
>  		return true;
>  
> @@ -1483,7 +1579,13 @@ static bool zswap_store_page(struct page *page,
>  	return true;
>  
>  store_failed:
> -	zs_free(pool->zs_pool, entry->handle);
> +	if (entry->direct) {
> +		struct page *freepage = (struct page *)entry->handle;
> +
> +		node_private_freed(freepage);
> +		__free_page(freepage);
> +	} else
> +		zs_free(pool->zs_pool, entry->handle);

This code is repeated in zswap_entry_free(), we should probably wrap it
in a helper that frees the private page or the zsmalloc entry based on
entry->direct.

>  compress_failed:
>  	zswap_entry_cache_free(entry);
>  	return false;
> -- 
> 2.52.0
>