From: Nhat Pham <nphamcs@gmail.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
yosry.ahmed@linux.dev, chengming.zhou@linux.dev, sj@kernel.org,
kernel-team@meta.com, linux-kernel@vger.kernel.org,
gourry@gourry.net, ying.huang@linux.alibaba.com,
jonathan.cameron@huawei.com, dan.j.williams@intel.com,
linux-cxl@vger.kernel.org, minchan@kernel.org,
senozhatsky@chromium.org
Subject: Re: [PATCH] zswap/zsmalloc: prefer the the original page's node for compressed data
Date: Wed, 2 Apr 2025 13:25:39 -0700 [thread overview]
Message-ID: <CAKEwX=PJ-LQO3xffXx5-MpRBXWaCC+QQdM1kCY0=NFX4sNEhgA@mail.gmail.com> (raw)
In-Reply-To: <20250402202416.GE198651@cmpxchg.org>
On Wed, Apr 2, 2025 at 1:24 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Apr 02, 2025 at 01:09:29PM -0700, Nhat Pham wrote:
> > On Wed, Apr 2, 2025 at 12:57 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Wed, Apr 02, 2025 at 12:11:45PM -0700, Nhat Pham wrote:
> > > > Currently, zsmalloc, zswap's backend memory allocator, does not enforce
> > > > any policy for the allocation of memory for the compressed data,
> > > > instead just adopting the memory policy of the task entering reclaim,
> > > > or the default policy (prefer local node) if no such policy is
> > > > specified. This can lead to several pathological behaviors in
> > > > multi-node NUMA systems:
> > > >
> > > > 1. Systems with CXL-based memory tiering can encounter the following
> > > > inversion with zswap: the coldest pages demoted to the CXL tier
> > > > can return to the high tier when they are zswapped out, creating
> > > > memory pressure on the high tier.
> > > >
> > > > 2. Consider a direct reclaimer scanning nodes in order of allocation
> > > > preference. If it ventures into remote nodes, the memory it
> > > > compresses there should stay there. Trying to shift those contents
> > > > over to the reclaiming thread's preferred node further *increases*
> > > > its local pressure, and provoking more spills. The remote node is
> > > > also the most likely to refault this data again. This undesirable
> > > > behavior was pointed out by Johannes Weiner in [1].
> > > >
> > > > 3. For zswap writeback, the zswap entries are organized in
> > > > node-specific LRUs, based on the node placement of the original
> > > > pages, allowing for targeted zswap writeback for specific nodes.
> > > >
> > > > However, the compressed data of a zswap entry can be placed on a
> > > > different node from the LRU it is placed on. This means that reclaim
> > > > targeted at one node might not free up memory used for zswap entries
> > > > in that node, but instead reclaiming memory in a different node.
> > > >
> > > > All of these issues will be resolved if the compressed data go to the
> > > > same node as the original page. This patch encourages this behavior by
> > > > having zswap pass the node of the original page to zsmalloc, and have
> > > > zsmalloc prefer the specified node if we need to allocate new (zs)pages
> > > > for the compressed data.
> > > >
> > > > Note that we are not strictly binding the allocation to the preferred
> > > > node. We still allow the allocation to fall back to other nodes when
> > > > the preferred node is full, or if we have zspages with slots available
> > > > on a different node. This is OK, and still a strict improvement over
> > > > the status quo:
> > > >
> > > > 1. On a system with demotion enabled, we will generally prefer
> > > > demotions over zswapping, and only zswap when pages have
> > > > already gone to the lowest tier. This patch should achieve the
> > > > desired effect for the most part.
> > > >
> > > > 2. If the preferred node is out of memory, letting the compressed data
> > > > going to other nodes can be better than the alternative (OOMs,
> > > > keeping cold memory unreclaimed, disk swapping, etc.).
> > > >
> > > > 3. If the allocation go to a separate node because we have a zspage
> > > > with slots available, at least we're not creating extra immediate
> > > > memory pressure (since the space is already allocated).
> > > >
> > > > 3. While there can be mixings, we generally reclaim pages in
> > > > same-node batches, which encourage zspage grouping that is more
> > > > likely to go to the right node.
> > > >
> > > > 4. A strict binding would require partitioning zsmalloc by node, which
> > > > is more complicated, and more prone to regression, since it reduces
> > > > the storage density of zsmalloc. We need to evaluate the tradeoff
> > > > and benchmark carefully before adopting such an involved solution.
> > > >
> > > > This patch does not fix zram, leaving its memory allocation behavior
> > > > unchanged. We leave this effort to future work.
> > >
> > > zram's zs_malloc() calls all have page context. It seems a lot easier
> > > to just fix the bug for them as well than to have two allocation APIs
> > > and verbose commentary?
> >
> > I think the recompress path doesn't quite have the context at the callsite:
> >
> > static int recompress_slot(struct zram *zram, u32 index, struct page *page,
> > u64 *num_recomp_pages, u32 threshold, u32 prio,
> > u32 prio_max)
> >
> > Note that the "page" argument here is allocated by zram internally,
> > and not the original page. We can get the original page's node by
> > asking zsmalloc to return it when it returns the compressed data, but
> > that's quite involved, and potentially requires further zsmalloc API
> > change.
>
> Yeah, that path currently allocates the target page on the node of
> whoever is writing to the "recompress" file.
>
> I think it's fine to use page_to_nid() on that one. It's no worse than
> the current behavior. Add an /* XXX */ to recompress_store() and
> should somebody care to make that path generally NUMA-aware they can
> do so without having to garbage-collect dependencies in zsmalloc code.
SGTM. I'll fix that.
prev parent reply other threads:[~2025-04-02 20:25 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-02 19:11 Nhat Pham
2025-04-02 19:57 ` Johannes Weiner
2025-04-02 20:09 ` Nhat Pham
2025-04-02 20:24 ` Johannes Weiner
2025-04-02 20:25 ` Nhat Pham [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAKEwX=PJ-LQO3xffXx5-MpRBXWaCC+QQdM1kCY0=NFX4sNEhgA@mail.gmail.com' \
--to=nphamcs@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=chengming.zhou@linux.dev \
--cc=dan.j.williams@intel.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=jonathan.cameron@huawei.com \
--cc=kernel-team@meta.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=minchan@kernel.org \
--cc=senozhatsky@chromium.org \
--cc=sj@kernel.org \
--cc=ying.huang@linux.alibaba.com \
--cc=yosry.ahmed@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox