[RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems
@ 2025-03-29 11:02 Nhat Pham
  2025-03-29 11:02 ` [RFC PATCH 1/2] zsmalloc: let callers select NUMA node to store the compressed objects Nhat Pham
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Nhat Pham @ 2025-03-29 11:02 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, yosry.ahmed, chengming.zhou, sj, kernel-team,
	linux-kernel, gourry, willy, ying.huang, jonathan.cameron,
	dan.j.williams, linux-cxl, minchan, senozhatsky

Currently, systems with CXL-based memory tiering can encounter the
following inversion with zswap: the coldest pages demoted to the CXL
tier can return to the high tier when they are zswapped out,
creating memory pressure on the high tier.

This happens because zsmalloc, zswap's backend memory allocator, does
not enforce any memory policy. If the task reclaiming memory follows
the local-first policy for example, the memory requested for zswap can
be served by the upper tier, leading to the aformentioned inversion.

This RFC fixes this inversion by adding a new memory allocation mode
for zswap (exposed through a zswap sysfs knob), intended for
hosts with CXL, where the memory for the compressed object is requested
preferentially from the same node that the original page resides on.

With the new zswap allocation mode enabled, we should observe the
following dynamics:

1. When demotion is turned on, under reasonable conditions, zswap will
   prefer CXL memory by default, since top-tier memory being reclaimed
   will typically be demoted instead of swapped.

2. This should prevent reclaim on the lower tier from causing high-tier
   memory pressure due to new allocations.

3. This should avoid a quiet promotion of cold memory (memory being
   zswapped is cold, but is promoted when put into the zswap pool
   because the memory allocated for the compressed copy comes from the
   high tier).

4. However, this may actually cause pressure on the CXL tier, which may
   actually result in further demotion (to swap, etc). This needs to be
   tested.

I'm still testing and collecting more data, but figure I should send
this out as an RFC to spark the discussion:

1. Is this the right policy? Do we need a more complicated policy?
   Should we instead go for the "lowest" node (which would require new
   memory tiering API)? Or maybe trying each node from current node
   to the lowest node in the hierarchy?

   Also, I hack together this fix with CXL in mind, but if there are
   other cases that I should also address we can explore a more general
   memory allocation strategy or interface.

2. Similarly, is this the right zsmalloc API? For instance, we can build
   build a full-fledged mempolicy-based API for zsmalloc, but I haven't
   found a use case for it yet.

3. Assuming this is the right policy, what should be the semantics? Not
   very good at naming things, so same_node_mode might not be it :)

Nhat Pham (2):
  zsmalloc: let callers select NUMA node to store the compressed objects
  zswap: add sysfs knob for same node mode

 Documentation/admin-guide/mm/zswap.rst |  9 +++++++++
 include/linux/zpool.h                  |  4 ++--
 mm/zpool.c                             |  8 +++++---
 mm/zsmalloc.c                          | 28 +++++++++++++++++++-------
 mm/zswap.c                             | 10 +++++++--
 5 files changed, 45 insertions(+), 14 deletions(-)

base-commit: 4135040c342ba080328891f1b7e523c8f2f04c58
-- 
2.47.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 1/2] zsmalloc: let callers select NUMA node to store the compressed objects
  2025-03-29 11:02 [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems Nhat Pham
@ 2025-03-29 11:02 ` Nhat Pham
  2025-03-31 22:17   ` Dan Williams
  2025-03-29 11:02 ` [RFC PATCH 2/2] zswap: add sysfs knob for same node mode Nhat Pham
  2025-03-29 19:53 ` [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems Yosry Ahmed
  2 siblings, 1 reply; 13+ messages in thread
From: Nhat Pham @ 2025-03-29 11:02 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, yosry.ahmed, chengming.zhou, sj, kernel-team,
	linux-kernel, gourry, willy, ying.huang, jonathan.cameron,
	dan.j.williams, linux-cxl, minchan, senozhatsky

Curerntly, zsmalloc does not specify any memory policy when it allocates
memory for the compressed objects.

Let users select the NUMA node for the memory allocation, through the
zpool-based API. Direct callers (i.e zram) should not observe any
behavioral change.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/zpool.h |  4 ++--
 mm/zpool.c            |  8 +++++---
 mm/zsmalloc.c         | 28 +++++++++++++++++++++-------
 mm/zswap.c            |  2 +-
 4 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/include/linux/zpool.h b/include/linux/zpool.h
index 52f30e526607..0df8722e13d7 100644
--- a/include/linux/zpool.h
+++ b/include/linux/zpool.h
@@ -22,7 +22,7 @@ const char *zpool_get_type(struct zpool *pool);
 void zpool_destroy_pool(struct zpool *pool);
 
 int zpool_malloc(struct zpool *pool, size_t size, gfp_t gfp,
-			unsigned long *handle);
+			unsigned long *handle, int *nid);
 
 void zpool_free(struct zpool *pool, unsigned long handle);
 
@@ -64,7 +64,7 @@ struct zpool_driver {
 	void (*destroy)(void *pool);
 
 	int (*malloc)(void *pool, size_t size, gfp_t gfp,
-				unsigned long *handle);
+				unsigned long *handle, int *nid);
 	void (*free)(void *pool, unsigned long handle);
 
 	void *(*obj_read_begin)(void *pool, unsigned long handle,
diff --git a/mm/zpool.c b/mm/zpool.c
index 6d6d88930932..591a13b99755 100644
--- a/mm/zpool.c
+++ b/mm/zpool.c
@@ -226,20 +226,22 @@ const char *zpool_get_type(struct zpool *zpool)
  * @size:	The amount of memory to allocate.
  * @gfp:	The GFP flags to use when allocating memory.
  * @handle:	Pointer to the handle to set
+ * @nid:	Pointer to the preferred node id.
  *
  * This allocates the requested amount of memory from the pool.
  * The gfp flags will be used when allocating memory, if the
  * implementation supports it.  The provided @handle will be
- * set to the allocated object handle.
+ * set to the allocated object handle. If @nid is provided, the
+ * allocation will prefer the specified node.
  *
  * Implementations must guarantee this to be thread-safe.
  *
  * Returns: 0 on success, negative value on error.
  */
 int zpool_malloc(struct zpool *zpool, size_t size, gfp_t gfp,
-			unsigned long *handle)
+			unsigned long *handle, int *nid)
 {
-	return zpool->driver->malloc(zpool->pool, size, gfp, handle);
+	return zpool->driver->malloc(zpool->pool, size, gfp, handle, nid);
 }
 
 /**
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 961b270f023c..35f61f14c32e 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -243,9 +243,14 @@ static inline void zpdesc_dec_zone_page_state(struct zpdesc *zpdesc)
 	dec_zone_page_state(zpdesc_page(zpdesc), NR_ZSPAGES);
 }
 
-static inline struct zpdesc *alloc_zpdesc(gfp_t gfp)
+static inline struct zpdesc *alloc_zpdesc(gfp_t gfp, int *nid)
 {
-	struct page *page = alloc_page(gfp);
+	struct page *page;
+
+	if (nid)
+		page = alloc_pages_node(*nid, gfp, 0);
+	else
+		page = alloc_page(gfp);
 
 	return page_zpdesc(page);
 }
@@ -461,10 +466,13 @@ static void zs_zpool_destroy(void *pool)
 	zs_destroy_pool(pool);
 }
 
+static unsigned long zs_malloc_node(struct zs_pool *pool, size_t size,
+				gfp_t gfp, int *nid);
+
 static int zs_zpool_malloc(void *pool, size_t size, gfp_t gfp,
-			unsigned long *handle)
+			unsigned long *handle, int *nid)
 {
-	*handle = zs_malloc(pool, size, gfp);
+	*handle = zs_malloc_node(pool, size, gfp, nid);
 
 	if (IS_ERR_VALUE(*handle))
 		return PTR_ERR((void *)*handle);
@@ -1044,7 +1052,7 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage,
  */
 static struct zspage *alloc_zspage(struct zs_pool *pool,
 					struct size_class *class,
-					gfp_t gfp)
+					gfp_t gfp, int *nid)
 {
 	int i;
 	struct zpdesc *zpdescs[ZS_MAX_PAGES_PER_ZSPAGE];
@@ -1061,7 +1069,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
 	for (i = 0; i < class->pages_per_zspage; i++) {
 		struct zpdesc *zpdesc;
 
-		zpdesc = alloc_zpdesc(gfp);
+		zpdesc = alloc_zpdesc(gfp, nid);
 		if (!zpdesc) {
 			while (--i >= 0) {
 				zpdesc_dec_zone_page_state(zpdescs[i]);
@@ -1342,6 +1350,12 @@ static unsigned long obj_malloc(struct zs_pool *pool,
  * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail.
  */
 unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
+{
+	return zs_malloc_node(pool, size, gfp, NULL);
+}
+
+static unsigned long zs_malloc_node(struct zs_pool *pool, size_t size,
+				gfp_t gfp, int *nid)
 {
 	unsigned long handle;
 	struct size_class *class;
@@ -1376,7 +1390,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 
 	spin_unlock(&class->lock);
 
-	zspage = alloc_zspage(pool, class, gfp);
+	zspage = alloc_zspage(pool, class, gfp, nid);
 	if (!zspage) {
 		cache_free_handle(pool, handle);
 		return (unsigned long)ERR_PTR(-ENOMEM);
diff --git a/mm/zswap.c b/mm/zswap.c
index 204fb59da33c..89b6d4ade4cd 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -981,7 +981,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 
 	zpool = pool->zpool;
 	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
-	alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle);
+	alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle, NULL);
 	if (alloc_ret)
 		goto unlock;
 
-- 
2.47.1



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 2/2] zswap: add sysfs knob for same node mode
  2025-03-29 11:02 [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems Nhat Pham
  2025-03-29 11:02 ` [RFC PATCH 1/2] zsmalloc: let callers select NUMA node to store the compressed objects Nhat Pham
@ 2025-03-29 11:02 ` Nhat Pham
  2025-03-29 19:53 ` [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems Yosry Ahmed
  2 siblings, 0 replies; 13+ messages in thread
From: Nhat Pham @ 2025-03-29 11:02 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, yosry.ahmed, chengming.zhou, sj, kernel-team,
	linux-kernel, gourry, willy, ying.huang, jonathan.cameron,
	dan.j.williams, linux-cxl, minchan, senozhatsky

Taking advantage of the new node-selection capability of zsmalloc, allow
zswap to keep the compressed copy in the same node as the original page.

The main use case is for CXL systems, where pages in CXL tier should
stay in CXL when they are zswapped so as not to create memory pressure
in higher tier.

This new behavior is opted-in only, and can be enabled as follows:

echo Y > /sys/module/zswap/parameters/same_node_mode

Suggested-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 Documentation/admin-guide/mm/zswap.rst |  9 +++++++++
 mm/zswap.c                             | 10 ++++++++--
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index fd3370aa43fe..be8953acc15e 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -142,6 +142,15 @@ User can enable it as follows::
 This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON`` is
 selected.
 
+In a NUMA system, sometimes we want the compressed copy to reside in the same
+node as the original page. For instance, if we use the NUMA nodes to represent
+a CXL-based memory tiering system, we do not want the pages demoted to the
+lower tier to accidentally return to the higher tier via zswap, creating
+memory pressure in the higher tier. The same-node behavior can be enabled
+as follows::
+
+	echo Y > /sys/module/zswap/parameters/same_node_mode
+
 A debugfs interface is provided for various statistic about pool size, number
 of pages stored, same-value filled pages and various counters for the reasons
 pages are rejected.
diff --git a/mm/zswap.c b/mm/zswap.c
index 89b6d4ade4cd..2eee57648750 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -129,6 +129,9 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
 		CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
 module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
 
+static bool zswap_same_node_mode;
+module_param_named(same_node_mode, zswap_same_node_mode, bool, 0644);
+
 bool zswap_is_enabled(void)
 {
 	return zswap_enabled;
@@ -942,7 +945,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 {
 	struct crypto_acomp_ctx *acomp_ctx;
 	struct scatterlist input, output;
-	int comp_ret = 0, alloc_ret = 0;
+	int comp_ret = 0, alloc_ret = 0, nid = page_to_nid(page);
 	unsigned int dlen = PAGE_SIZE;
 	unsigned long handle;
 	struct zpool *zpool;
@@ -981,7 +984,10 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 
 	zpool = pool->zpool;
 	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
-	alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle, NULL);
+	if (zswap_same_node_mode)
+		alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle, &nid);
+	else
+		alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle, NULL);
 	if (alloc_ret)
 		goto unlock;
 
-- 
2.47.1



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems
  2025-03-29 11:02 [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems Nhat Pham
  2025-03-29 11:02 ` [RFC PATCH 1/2] zsmalloc: let callers select NUMA node to store the compressed objects Nhat Pham
  2025-03-29 11:02 ` [RFC PATCH 2/2] zswap: add sysfs knob for same node mode Nhat Pham
@ 2025-03-29 19:53 ` Yosry Ahmed
  2025-03-29 22:13   ` Nhat Pham
                     ` (2 more replies)
  2 siblings, 3 replies; 13+ messages in thread
From: Yosry Ahmed @ 2025-03-29 19:53 UTC (permalink / raw)
  To: Nhat Pham, linux-mm
  Cc: akpm, hannes, chengming.zhou, sj, kernel-team, linux-kernel,
	gourry, willy, ying.huang, jonathan.cameron, dan.j.williams,
	linux-cxl, minchan, senozhatsky

March 29, 2025 at 1:02 PM, "Nhat Pham" <nphamcs@gmail.com> wrote:

> Currently, systems with CXL-based memory tiering can encounter the
> following inversion with zswap: the coldest pages demoted to the CXL
> tier can return to the high tier when they are zswapped out,
> creating memory pressure on the high tier.
> This happens because zsmalloc, zswap's backend memory allocator, does
> not enforce any memory policy. If the task reclaiming memory follows
> the local-first policy for example, the memory requested for zswap can
> be served by the upper tier, leading to the aformentioned inversion.
> This RFC fixes this inversion by adding a new memory allocation mode
> for zswap (exposed through a zswap sysfs knob), intended for
> hosts with CXL, where the memory for the compressed object is requested
> preferentially from the same node that the original page resides on.

I didn't look too closely, but why not just prefer the same node by default? Why is a knob needed?

Or maybe if there's a way to tell the "tier" of the node we can prefer to allocate from the same "tier"?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems
  2025-03-29 19:53 ` [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems Yosry Ahmed
@ 2025-03-29 22:13   ` Nhat Pham
  2025-03-29 22:17     ` Nhat Pham
  2025-03-31 16:53   ` Johannes Weiner
  2025-03-31 17:06   ` Gregory Price
  2 siblings, 1 reply; 13+ messages in thread
From: Nhat Pham @ 2025-03-29 22:13 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-mm, akpm, hannes, chengming.zhou, sj, kernel-team,
	linux-kernel, gourry, willy, ying.huang, jonathan.cameron,
	dan.j.williams, linux-cxl, minchan, senozhatsky

On Sat, Mar 29, 2025 at 12:53 PM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
>
> March 29, 2025 at 1:02 PM, "Nhat Pham" <nphamcs@gmail.com> wrote:
>
> > Currently, systems with CXL-based memory tiering can encounter the
> > following inversion with zswap: the coldest pages demoted to the CXL
> > tier can return to the high tier when they are zswapped out,
> > creating memory pressure on the high tier.
> > This happens because zsmalloc, zswap's backend memory allocator, does
> > not enforce any memory policy. If the task reclaiming memory follows
> > the local-first policy for example, the memory requested for zswap can
> > be served by the upper tier, leading to the aformentioned inversion.
> > This RFC fixes this inversion by adding a new memory allocation mode
> > for zswap (exposed through a zswap sysfs knob), intended for
> > hosts with CXL, where the memory for the compressed object is requested
> > preferentially from the same node that the original page resides on.
>
> I didn't look too closely, but why not just prefer the same node by default? Why is a knob needed?

Good question, yeah the knob is to maintain the old behavior :) It
might not be optimal, or even advisable, for all set up.

For hosts with node-based memory tiering, then yeah it's a good idea
in general, but I don't quite know how to have information about that
from the kernel's perspective.

>
> Or maybe if there's a way to tell the "tier" of the node we can prefer to allocate from the same "tier"?

Is there an abstraction of the "tier" that we can use here?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems
  2025-03-29 22:13   ` Nhat Pham
@ 2025-03-29 22:17     ` Nhat Pham
  0 siblings, 0 replies; 13+ messages in thread
From: Nhat Pham @ 2025-03-29 22:17 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-mm, akpm, hannes, chengming.zhou, sj, kernel-team,
	linux-kernel, gourry, willy, ying.huang, jonathan.cameron,
	dan.j.williams, linux-cxl, minchan, senozhatsky

On Sat, Mar 29, 2025 at 3:13 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Good question, yeah the knob is to maintain the old behavior :) It
> might not be optimal, or even advisable, for all set up.
>
> For hosts with node-based memory tiering, then yeah it's a good idea
> in general, but I don't quite know how to have information about that
> from the kernel's perspective.
>
> >
> > Or maybe if there's a way to tell the "tier" of the node we can prefer to allocate from the same "tier"?
>
> Is there an abstraction of the "tier" that we can use here?

Maaaybe "struct memory_tier" (memory-tiers.c)? Lemme take a look at that...


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems
  2025-03-29 19:53 ` [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems Yosry Ahmed
  2025-03-29 22:13   ` Nhat Pham
@ 2025-03-31 16:53   ` Johannes Weiner
  2025-03-31 17:32     ` Nhat Pham
  2025-03-31 17:06   ` Gregory Price
  2 siblings, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2025-03-31 16:53 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Nhat Pham, linux-mm, akpm, chengming.zhou, sj, kernel-team,
	linux-kernel, gourry, willy, ying.huang, jonathan.cameron,
	dan.j.williams, linux-cxl, minchan, senozhatsky

On Sat, Mar 29, 2025 at 07:53:23PM +0000, Yosry Ahmed wrote:
> March 29, 2025 at 1:02 PM, "Nhat Pham" <nphamcs@gmail.com> wrote:
> 
> > Currently, systems with CXL-based memory tiering can encounter the
> > following inversion with zswap: the coldest pages demoted to the CXL
> > tier can return to the high tier when they are zswapped out,
> > creating memory pressure on the high tier.
> > This happens because zsmalloc, zswap's backend memory allocator, does
> > not enforce any memory policy. If the task reclaiming memory follows
> > the local-first policy for example, the memory requested for zswap can
> > be served by the upper tier, leading to the aformentioned inversion.
> > This RFC fixes this inversion by adding a new memory allocation mode
> > for zswap (exposed through a zswap sysfs knob), intended for
> > hosts with CXL, where the memory for the compressed object is requested
> > preferentially from the same node that the original page resides on.
> 
> I didn't look too closely, but why not just prefer the same node by
> default? Why is a knob needed?

+1 It should really be the default.

Even on regular NUMA setups this behavior makes more sense. Consider a
direct reclaimer scanning nodes in order of allocation preference. If
it ventures into remote nodes, the memory it compresses there should
stay there. Trying to shift those contents over to the reclaiming
thread's preferred node further *increases* its local pressure, and
provoking more spills. The remote node is also the most likely to
refault this data again. This is just bad for everybody.

> Or maybe if there's a way to tell the "tier" of the node we can
> prefer to allocate from the same "tier"?

Presumably, other nodes in the same tier would come first in the
fallback zonelist of that node, so page_to_nid() should just work.

I wouldn't complicate this until somebody has real systems where it
does the wrong thing.

My vote is to stick with page_to_nid(), but do it unconditionally.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems
  2025-03-29 19:53 ` [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems Yosry Ahmed
  2025-03-29 22:13   ` Nhat Pham
  2025-03-31 16:53   ` Johannes Weiner
@ 2025-03-31 17:06   ` Gregory Price
  2 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2025-03-31 17:06 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Nhat Pham, linux-mm, akpm, hannes, chengming.zhou, sj,
	kernel-team, linux-kernel, willy, ying.huang, jonathan.cameron,
	dan.j.williams, linux-cxl, minchan, senozhatsky

On Sat, Mar 29, 2025 at 07:53:23PM +0000, Yosry Ahmed wrote:
> March 29, 2025 at 1:02 PM, "Nhat Pham" <nphamcs@gmail.com> wrote:
> 
> > Currently, systems with CXL-based memory tiering can encounter the
> > following inversion with zswap: the coldest pages demoted to the CXL
> > tier can return to the high tier when they are zswapped out,
> > creating memory pressure on the high tier.
> > This happens because zsmalloc, zswap's backend memory allocator, does
> > not enforce any memory policy. If the task reclaiming memory follows
> > the local-first policy for example, the memory requested for zswap can
> > be served by the upper tier, leading to the aformentioned inversion.
> > This RFC fixes this inversion by adding a new memory allocation mode
> > for zswap (exposed through a zswap sysfs knob), intended for
> > hosts with CXL, where the memory for the compressed object is requested
> > preferentially from the same node that the original page resides on.
> 
> I didn't look too closely, but why not just prefer the same node by default? Why is a knob needed?
> 

Bit of an open question: does this hurt zswap performance?

And of course the begged question: Does that matter?

Probably the answer is not really and no, but nice to have the knob for
testing.  I imagine we'd drop it with the RFC tag.

> Or maybe if there's a way to tell the "tier" of the node we can prefer to allocate from the same "tier"?

In almost every system, tier=node for any sane situation, though nodes
across sockets can end up lumped into the same tier - which maybe
doesn't matter for zswap but isn't useful for almost anything else.

But maybe there's an argument for adding new tier-policies.

:think: 

int memtier_get_node(enum memtier_policy, int nid);

enum memtier_policy {
    MEMTIER_SAME_TIER,     // get a different node from same tier
    MEMTIER_DEMOTE_ONE,    // demote one step
    MEMTIER_DEMOTE_FAR,    // demote one step away from swap
    MEMTIER_PROMOTE_ONE,   // promote one step
    MEMTIER_PROMOTE_LOCAL, // promote to local on topology
};

Might be worth investigating.  Just spitballing here.

The issue is really fallback allocations.  In most cases, we know what
we'd like to do, but when the system is under pressure the question is
what behavior do we want from these components.  I'd hesistate to make a
strong claim about whether zswap should/should not fall back to a
higher-tier node under system pressure without strong data.

~Gregory


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems
  2025-03-31 16:53   ` Johannes Weiner
@ 2025-03-31 17:32     ` Nhat Pham
  0 siblings, 0 replies; 13+ messages in thread
From: Nhat Pham @ 2025-03-31 17:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yosry Ahmed, linux-mm, akpm, chengming.zhou, sj, kernel-team,
	linux-kernel, gourry, willy, ying.huang, jonathan.cameron,
	dan.j.williams, linux-cxl, minchan, senozhatsky

On Mon, Mar 31, 2025 at 9:53 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Sat, Mar 29, 2025 at 07:53:23PM +0000, Yosry Ahmed wrote:
> > March 29, 2025 at 1:02 PM, "Nhat Pham" <nphamcs@gmail.com> wrote:
> >
> > > Currently, systems with CXL-based memory tiering can encounter the
> > > following inversion with zswap: the coldest pages demoted to the CXL
> > > tier can return to the high tier when they are zswapped out,
> > > creating memory pressure on the high tier.
> > > This happens because zsmalloc, zswap's backend memory allocator, does
> > > not enforce any memory policy. If the task reclaiming memory follows
> > > the local-first policy for example, the memory requested for zswap can
> > > be served by the upper tier, leading to the aformentioned inversion.
> > > This RFC fixes this inversion by adding a new memory allocation mode
> > > for zswap (exposed through a zswap sysfs knob), intended for
> > > hosts with CXL, where the memory for the compressed object is requested
> > > preferentially from the same node that the original page resides on.
> >
> > I didn't look too closely, but why not just prefer the same node by
> > default? Why is a knob needed?
>
> +1 It should really be the default.
>
> Even on regular NUMA setups this behavior makes more sense. Consider a
> direct reclaimer scanning nodes in order of allocation preference. If
> it ventures into remote nodes, the memory it compresses there should
> stay there. Trying to shift those contents over to the reclaiming
> thread's preferred node further *increases* its local pressure, and
> provoking more spills. The remote node is also the most likely to
> refault this data again. This is just bad for everybody.

Makes a lot of sense. I'll include this in the v2 of the patch series,
and rephrase this as a generic, NUMA system fix (with CXL as one of
the examples/motivations).

Thanks for the comment, Johannes! I'll remove this knob altogether and
make this the default behavior.

>
> > Or maybe if there's a way to tell the "tier" of the node we can
> > prefer to allocate from the same "tier"?
>
> Presumably, other nodes in the same tier would come first in the
> fallback zonelist of that node, so page_to_nid() should just work.
>
> I wouldn't complicate this until somebody has real systems where it
> does the wrong thing.
>
> My vote is to stick with page_to_nid(), but do it unconditionally.

SGTM.

>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] zsmalloc: let callers select NUMA node to store the compressed objects
  2025-03-29 11:02 ` [RFC PATCH 1/2] zsmalloc: let callers select NUMA node to store the compressed objects Nhat Pham
@ 2025-03-31 22:17   ` Dan Williams
  2025-03-31 23:03     ` Nhat Pham
  0 siblings, 1 reply; 13+ messages in thread
From: Dan Williams @ 2025-03-31 22:17 UTC (permalink / raw)
  To: Nhat Pham, linux-mm
  Cc: akpm, hannes, yosry.ahmed, chengming.zhou, sj, kernel-team,
	linux-kernel, gourry, willy, ying.huang, jonathan.cameron,
	dan.j.williams, linux-cxl, minchan, senozhatsky

Nhat Pham wrote:
> Curerntly, zsmalloc does not specify any memory policy when it allocates
> memory for the compressed objects.
> 
> Let users select the NUMA node for the memory allocation, through the
> zpool-based API. Direct callers (i.e zram) should not observe any
> behavioral change.
> 
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> ---
>  include/linux/zpool.h |  4 ++--
>  mm/zpool.c            |  8 +++++---
>  mm/zsmalloc.c         | 28 +++++++++++++++++++++-------
>  mm/zswap.c            |  2 +-
>  4 files changed, 29 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/zpool.h b/include/linux/zpool.h
> index 52f30e526607..0df8722e13d7 100644
> --- a/include/linux/zpool.h
> +++ b/include/linux/zpool.h
> @@ -22,7 +22,7 @@ const char *zpool_get_type(struct zpool *pool);
>  void zpool_destroy_pool(struct zpool *pool);
>  
>  int zpool_malloc(struct zpool *pool, size_t size, gfp_t gfp,
> -			unsigned long *handle);
> +			unsigned long *handle, int *nid);

I agree with Johannes about the policy knob, so I'll just comment on the
implementation.

Why not just pass a "const int" for @nid, and use "NUMA_NO_NODE" for the
"default" case. alloc_pages_node_noprof() is already prepared for a
NUMA_NO_NODE argument.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] zsmalloc: let callers select NUMA node to store the compressed objects
  2025-03-31 22:17   ` Dan Williams
@ 2025-03-31 23:03     ` Nhat Pham
  2025-03-31 23:22       ` Dan Williams
  0 siblings, 1 reply; 13+ messages in thread
From: Nhat Pham @ 2025-03-31 23:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, akpm, hannes, yosry.ahmed, chengming.zhou, sj,
	kernel-team, linux-kernel, gourry, willy, ying.huang,
	jonathan.cameron, linux-cxl, minchan, senozhatsky

On Mon, Mar 31, 2025 at 3:18 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Nhat Pham wrote:
> > Curerntly, zsmalloc does not specify any memory policy when it allocates
> > memory for the compressed objects.
> >
> > Let users select the NUMA node for the memory allocation, through the
> > zpool-based API. Direct callers (i.e zram) should not observe any
> > behavioral change.
> >
> > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> > ---
> >  include/linux/zpool.h |  4 ++--
> >  mm/zpool.c            |  8 +++++---
> >  mm/zsmalloc.c         | 28 +++++++++++++++++++++-------
> >  mm/zswap.c            |  2 +-
> >  4 files changed, 29 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/zpool.h b/include/linux/zpool.h
> > index 52f30e526607..0df8722e13d7 100644
> > --- a/include/linux/zpool.h
> > +++ b/include/linux/zpool.h
> > @@ -22,7 +22,7 @@ const char *zpool_get_type(struct zpool *pool);
> >  void zpool_destroy_pool(struct zpool *pool);
> >
> >  int zpool_malloc(struct zpool *pool, size_t size, gfp_t gfp,
> > -                     unsigned long *handle);
> > +                     unsigned long *handle, int *nid);
>
> I agree with Johannes about the policy knob, so I'll just comment on the
> implementation.
>
> Why not just pass a "const int" for @nid, and use "NUMA_NO_NODE" for the
> "default" case. alloc_pages_node_noprof() is already prepared for a
> NUMA_NO_NODE argument.

Gregory and Johannes gave me that suggestion too! However, my
understanding was that alloc_pages_node(NUMA_NO_NODE, ...) !=
alloc_page(...), and I was trying to preserve the latter since it is
the "original behavior" (primarily for !same_node_mode, but also for
zram, which I tried not to change in this patch).

Please correct me if I'm mistaken, but IIUC:

1. alloc_pages_node(NUMA_NO_NODE, ...) would go to the local/closest node:

/*
 * Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE,
 * prefer the current CPU's closest node. Otherwise node must be valid and
 * online.
 */
static inline struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask,
   unsigned int order)
{
    if (nid == NUMA_NO_NODE)
        nid = numa_mem_id();

2. whereas, alloc_page(...) (i.e the "original" behavior) would
actually adopt the allocation policy of the task entering reclaim, by
calling:

struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned order)
{
    struct mempolicy *pol = &default_policy;

    /*
    * No reference counting needed for current->mempolicy
    * nor system default_policy
    */
    if (!in_interrupt() && !(gfp & __GFP_THISNODE))
        pol = get_task_policy(current);

Now, I think the "original" behavior is dumb/broken, and should be
changed altogether. We should probably always pass the page's node id.
On the zswap side, in the next version I'll remove same_node_mode
sysfs knob and always pass the page's node id to zsmalloc and the page
allocator. That will clean up the zpool path per your (and Johannes'
and Gregory's) suggestion.

That still leaves zram though. zram is more complicated than zswap -
it has multiple allocation paths, so I don't want to touch it quite
yet (and preferably a zram maintainer/developer should do it). :) Or
if zram maintainers are happy with NUMA_NO_NODE, then we can
completely get rid of the pointer arguments etc.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] zsmalloc: let callers select NUMA node to store the compressed objects
  2025-03-31 23:03     ` Nhat Pham
@ 2025-03-31 23:22       ` Dan Williams
  2025-04-01  1:13         ` Nhat Pham
  0 siblings, 1 reply; 13+ messages in thread
From: Dan Williams @ 2025-03-31 23:22 UTC (permalink / raw)
  To: Nhat Pham, Dan Williams
  Cc: linux-mm, akpm, hannes, yosry.ahmed, chengming.zhou, sj,
	kernel-team, linux-kernel, gourry, willy, ying.huang,
	jonathan.cameron, linux-cxl, minchan, senozhatsky

Nhat Pham wrote:
[..]
> That still leaves zram though. zram is more complicated than zswap -
> it has multiple allocation paths, so I don't want to touch it quite
> yet (and preferably a zram maintainer/developer should do it). :) Or
> if zram maintainers are happy with NUMA_NO_NODE, then we can
> completely get rid of the pointer arguments etc.

At a minimum make the argument a "const int *" so it does not look like
the value can be changed by the leaf functions.

...but I would challenge zram folks to ack/nak that change rather than
maintain old behavior based purely on momentum. I.e. add to the commit
message an "exclude zswap from this policy for $explicit_reason"
statement rather than the current $implicit_guess that the old behavior
is there for "reasons".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] zsmalloc: let callers select NUMA node to store the compressed objects
  2025-03-31 23:22       ` Dan Williams
@ 2025-04-01  1:13         ` Nhat Pham
  0 siblings, 0 replies; 13+ messages in thread
From: Nhat Pham @ 2025-04-01  1:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, akpm, hannes, yosry.ahmed, chengming.zhou, sj,
	kernel-team, linux-kernel, gourry, willy, ying.huang,
	jonathan.cameron, linux-cxl, minchan, senozhatsky

On Mon, Mar 31, 2025 at 4:23 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Nhat Pham wrote:
> [..]
> > That still leaves zram though. zram is more complicated than zswap -
> > it has multiple allocation paths, so I don't want to touch it quite
> > yet (and preferably a zram maintainer/developer should do it). :) Or
> > if zram maintainers are happy with NUMA_NO_NODE, then we can
> > completely get rid of the pointer arguments etc.
>
> At a minimum make the argument a "const int *" so it does not look like
> the value can be changed by the leaf functions.

That's a good idea! I'll incorporate it into the next version.

>
> ...but I would challenge zram folks to ack/nak that change rather than
> maintain old behavior based purely on momentum. I.e. add to the commit
> message an "exclude zswap from this policy for $explicit_reason"
> statement rather than the current $implicit_guess that the old behavior
> is there for "reasons".

Yeah I'll take a look at the zram code. There's no conceptual reason
why zram can not or should not adopt the same behavior, from my POV -
it's just a bit more complicated than zswap, and as such would require
more hacking. So I'm leaning towards making the minimal change
required to support zswap first, and letting the zram
maintainers/developers work it out on the zram side :) We'll see!


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-04-01  1:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-29 11:02 [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems Nhat Pham
2025-03-29 11:02 ` [RFC PATCH 1/2] zsmalloc: let callers select NUMA node to store the compressed objects Nhat Pham
2025-03-31 22:17   ` Dan Williams
2025-03-31 23:03     ` Nhat Pham
2025-03-31 23:22       ` Dan Williams
2025-04-01  1:13         ` Nhat Pham
2025-03-29 11:02 ` [RFC PATCH 2/2] zswap: add sysfs knob for same node mode Nhat Pham
2025-03-29 19:53 ` [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems Yosry Ahmed
2025-03-29 22:13   ` Nhat Pham
2025-03-29 22:17     ` Nhat Pham
2025-03-31 16:53   ` Johannes Weiner
2025-03-31 17:32     ` Nhat Pham
2025-03-31 17:06   ` Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox