From: Johannes Weiner <hannes@cmpxchg.org>
To: Nhat Pham <nphamcs@gmail.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
yosry.ahmed@linux.dev, chengming.zhou@linux.dev, sj@kernel.org,
kernel-team@meta.com, linux-kernel@vger.kernel.org,
gourry@gourry.net, ying.huang@linux.alibaba.com,
jonathan.cameron@huawei.com, dan.j.williams@intel.com,
linux-cxl@vger.kernel.org, minchan@kernel.org,
senozhatsky@chromium.org
Subject: Re: [PATCH] zswap/zsmalloc: prefer the the original page's node for compressed data
Date: Wed, 2 Apr 2025 16:24:16 -0400 [thread overview]
Message-ID: <20250402202416.GE198651@cmpxchg.org> (raw)
In-Reply-To: <CAKEwX=Pjk=7Ec3rE2c1SOUL9zeYGcyEOCVQqSffC6Qw077dBHQ@mail.gmail.com>
On Wed, Apr 02, 2025 at 01:09:29PM -0700, Nhat Pham wrote:
> On Wed, Apr 2, 2025 at 12:57 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Apr 02, 2025 at 12:11:45PM -0700, Nhat Pham wrote:
> > > Currently, zsmalloc, zswap's backend memory allocator, does not enforce
> > > any policy for the allocation of memory for the compressed data,
> > > instead just adopting the memory policy of the task entering reclaim,
> > > or the default policy (prefer local node) if no such policy is
> > > specified. This can lead to several pathological behaviors in
> > > multi-node NUMA systems:
> > >
> > > 1. Systems with CXL-based memory tiering can encounter the following
> > > inversion with zswap: the coldest pages demoted to the CXL tier
> > > can return to the high tier when they are zswapped out, creating
> > > memory pressure on the high tier.
> > >
> > > 2. Consider a direct reclaimer scanning nodes in order of allocation
> > > preference. If it ventures into remote nodes, the memory it
> > > compresses there should stay there. Trying to shift those contents
> > > over to the reclaiming thread's preferred node further *increases*
> > > its local pressure, and provoking more spills. The remote node is
> > > also the most likely to refault this data again. This undesirable
> > > behavior was pointed out by Johannes Weiner in [1].
> > >
> > > 3. For zswap writeback, the zswap entries are organized in
> > > node-specific LRUs, based on the node placement of the original
> > > pages, allowing for targeted zswap writeback for specific nodes.
> > >
> > > However, the compressed data of a zswap entry can be placed on a
> > > different node from the LRU it is placed on. This means that reclaim
> > > targeted at one node might not free up memory used for zswap entries
> > > in that node, but instead reclaiming memory in a different node.
> > >
> > > All of these issues will be resolved if the compressed data go to the
> > > same node as the original page. This patch encourages this behavior by
> > > having zswap pass the node of the original page to zsmalloc, and have
> > > zsmalloc prefer the specified node if we need to allocate new (zs)pages
> > > for the compressed data.
> > >
> > > Note that we are not strictly binding the allocation to the preferred
> > > node. We still allow the allocation to fall back to other nodes when
> > > the preferred node is full, or if we have zspages with slots available
> > > on a different node. This is OK, and still a strict improvement over
> > > the status quo:
> > >
> > > 1. On a system with demotion enabled, we will generally prefer
> > > demotions over zswapping, and only zswap when pages have
> > > already gone to the lowest tier. This patch should achieve the
> > > desired effect for the most part.
> > >
> > > 2. If the preferred node is out of memory, letting the compressed data
> > > going to other nodes can be better than the alternative (OOMs,
> > > keeping cold memory unreclaimed, disk swapping, etc.).
> > >
> > > 3. If the allocation go to a separate node because we have a zspage
> > > with slots available, at least we're not creating extra immediate
> > > memory pressure (since the space is already allocated).
> > >
> > > 3. While there can be mixings, we generally reclaim pages in
> > > same-node batches, which encourage zspage grouping that is more
> > > likely to go to the right node.
> > >
> > > 4. A strict binding would require partitioning zsmalloc by node, which
> > > is more complicated, and more prone to regression, since it reduces
> > > the storage density of zsmalloc. We need to evaluate the tradeoff
> > > and benchmark carefully before adopting such an involved solution.
> > >
> > > This patch does not fix zram, leaving its memory allocation behavior
> > > unchanged. We leave this effort to future work.
> >
> > zram's zs_malloc() calls all have page context. It seems a lot easier
> > to just fix the bug for them as well than to have two allocation APIs
> > and verbose commentary?
>
> I think the recompress path doesn't quite have the context at the callsite:
>
> static int recompress_slot(struct zram *zram, u32 index, struct page *page,
> u64 *num_recomp_pages, u32 threshold, u32 prio,
> u32 prio_max)
>
> Note that the "page" argument here is allocated by zram internally,
> and not the original page. We can get the original page's node by
> asking zsmalloc to return it when it returns the compressed data, but
> that's quite involved, and potentially requires further zsmalloc API
> change.
Yeah, that path currently allocates the target page on the node of
whoever is writing to the "recompress" file.
I think it's fine to use page_to_nid() on that one. It's no worse than
the current behavior. Add an /* XXX */ to recompress_store() and
should somebody care to make that path generally NUMA-aware they can
do so without having to garbage-collect dependencies in zsmalloc code.
next prev parent reply other threads:[~2025-04-02 20:24 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-02 19:11 Nhat Pham
2025-04-02 19:57 ` Johannes Weiner
2025-04-02 20:09 ` Nhat Pham
2025-04-02 20:24 ` Johannes Weiner [this message]
2025-04-02 20:25 ` Nhat Pham
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250402202416.GE198651@cmpxchg.org \
--to=hannes@cmpxchg.org \
--cc=akpm@linux-foundation.org \
--cc=chengming.zhou@linux.dev \
--cc=dan.j.williams@intel.com \
--cc=gourry@gourry.net \
--cc=jonathan.cameron@huawei.com \
--cc=kernel-team@meta.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=minchan@kernel.org \
--cc=nphamcs@gmail.com \
--cc=senozhatsky@chromium.org \
--cc=sj@kernel.org \
--cc=ying.huang@linux.alibaba.com \
--cc=yosry.ahmed@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox