linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems
@ 2025-03-29 11:02 Nhat Pham
  2025-03-29 11:02 ` [RFC PATCH 1/2] zsmalloc: let callers select NUMA node to store the compressed objects Nhat Pham
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Nhat Pham @ 2025-03-29 11:02 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, yosry.ahmed, chengming.zhou, sj, kernel-team,
	linux-kernel, gourry, willy, ying.huang, jonathan.cameron,
	dan.j.williams, linux-cxl, minchan, senozhatsky

Currently, systems with CXL-based memory tiering can encounter the
following inversion with zswap: the coldest pages demoted to the CXL
tier can return to the high tier when they are zswapped out,
creating memory pressure on the high tier.

This happens because zsmalloc, zswap's backend memory allocator, does
not enforce any memory policy. If the task reclaiming memory follows
the local-first policy for example, the memory requested for zswap can
be served by the upper tier, leading to the aformentioned inversion.

This RFC fixes this inversion by adding a new memory allocation mode
for zswap (exposed through a zswap sysfs knob), intended for
hosts with CXL, where the memory for the compressed object is requested
preferentially from the same node that the original page resides on.

With the new zswap allocation mode enabled, we should observe the
following dynamics:

1. When demotion is turned on, under reasonable conditions, zswap will
   prefer CXL memory by default, since top-tier memory being reclaimed
   will typically be demoted instead of swapped.

2. This should prevent reclaim on the lower tier from causing high-tier
   memory pressure due to new allocations.

3. This should avoid a quiet promotion of cold memory (memory being
   zswapped is cold, but is promoted when put into the zswap pool
   because the memory allocated for the compressed copy comes from the
   high tier).
   
4. However, this may actually cause pressure on the CXL tier, which may
   actually result in further demotion (to swap, etc). This needs to be
   tested.

I'm still testing and collecting more data, but figure I should send
this out as an RFC to spark the discussion:

1. Is this the right policy? Do we need a more complicated policy?
   Should we instead go for the "lowest" node (which would require new
   memory tiering API)? Or maybe trying each node from current node
   to the lowest node in the hierarchy?

   Also, I hack together this fix with CXL in mind, but if there are
   other cases that I should also address we can explore a more general
   memory allocation strategy or interface.

2. Similarly, is this the right zsmalloc API? For instance, we can build
   build a full-fledged mempolicy-based API for zsmalloc, but I haven't
   found a use case for it yet.

3. Assuming this is the right policy, what should be the semantics? Not
   very good at naming things, so same_node_mode might not be it :)

Nhat Pham (2):
  zsmalloc: let callers select NUMA node to store the compressed objects
  zswap: add sysfs knob for same node mode

 Documentation/admin-guide/mm/zswap.rst |  9 +++++++++
 include/linux/zpool.h                  |  4 ++--
 mm/zpool.c                             |  8 +++++---
 mm/zsmalloc.c                          | 28 +++++++++++++++++++-------
 mm/zswap.c                             | 10 +++++++--
 5 files changed, 45 insertions(+), 14 deletions(-)


base-commit: 4135040c342ba080328891f1b7e523c8f2f04c58
-- 
2.47.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-04-01  1:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-29 11:02 [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems Nhat Pham
2025-03-29 11:02 ` [RFC PATCH 1/2] zsmalloc: let callers select NUMA node to store the compressed objects Nhat Pham
2025-03-31 22:17   ` Dan Williams
2025-03-31 23:03     ` Nhat Pham
2025-03-31 23:22       ` Dan Williams
2025-04-01  1:13         ` Nhat Pham
2025-03-29 11:02 ` [RFC PATCH 2/2] zswap: add sysfs knob for same node mode Nhat Pham
2025-03-29 19:53 ` [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems Yosry Ahmed
2025-03-29 22:13   ` Nhat Pham
2025-03-29 22:17     ` Nhat Pham
2025-03-31 16:53   ` Johannes Weiner
2025-03-31 17:32     ` Nhat Pham
2025-03-31 17:06   ` Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox