linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Nhat Pham <nphamcs@gmail.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
	yosry.ahmed@linux.dev, chengming.zhou@linux.dev, sj@kernel.org,
	kernel-team@meta.com, linux-kernel@vger.kernel.org,
	gourry@gourry.net, ying.huang@linux.alibaba.com,
	jonathan.cameron@huawei.com, dan.j.williams@intel.com,
	linux-cxl@vger.kernel.org, minchan@kernel.org,
	senozhatsky@chromium.org
Subject: Re: [PATCH] zswap/zsmalloc: prefer the the original page's node for compressed data
Date: Wed, 2 Apr 2025 16:24:16 -0400	[thread overview]
Message-ID: <20250402202416.GE198651@cmpxchg.org> (raw)
In-Reply-To: <CAKEwX=Pjk=7Ec3rE2c1SOUL9zeYGcyEOCVQqSffC6Qw077dBHQ@mail.gmail.com>

On Wed, Apr 02, 2025 at 01:09:29PM -0700, Nhat Pham wrote:
> On Wed, Apr 2, 2025 at 12:57 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Apr 02, 2025 at 12:11:45PM -0700, Nhat Pham wrote:
> > > Currently, zsmalloc, zswap's backend memory allocator, does not enforce
> > > any policy for the allocation of memory for the compressed data,
> > > instead just adopting the memory policy of the task entering reclaim,
> > > or the default policy (prefer local node) if no such policy is
> > > specified. This can lead to several pathological behaviors in
> > > multi-node NUMA systems:
> > >
> > > 1. Systems with CXL-based memory tiering can encounter the following
> > >    inversion with zswap: the coldest pages demoted to the CXL tier
> > >    can return to the high tier when they are zswapped out, creating
> > >    memory pressure on the high tier.
> > >
> > > 2. Consider a direct reclaimer scanning nodes in order of allocation
> > >    preference. If it ventures into remote nodes, the memory it
> > >    compresses there should stay there. Trying to shift those contents
> > >    over to the reclaiming thread's preferred node further *increases*
> > >    its local pressure, and provoking more spills. The remote node is
> > >    also the most likely to refault this data again. This undesirable
> > >    behavior was pointed out by Johannes Weiner in [1].
> > >
> > > 3. For zswap writeback, the zswap entries are organized in
> > >    node-specific LRUs, based on the node placement of the original
> > >    pages, allowing for targeted zswap writeback for specific nodes.
> > >
> > >    However, the compressed data of a zswap entry can be placed on a
> > >    different node from the LRU it is placed on. This means that reclaim
> > >    targeted at one node might not free up memory used for zswap entries
> > >    in that node, but instead reclaiming memory in a different node.
> > >
> > > All of these issues will be resolved if the compressed data go to the
> > > same node as the original page. This patch encourages this behavior by
> > > having zswap pass the node of the original page to zsmalloc, and have
> > > zsmalloc prefer the specified node if we need to allocate new (zs)pages
> > > for the compressed data.
> > >
> > > Note that we are not strictly binding the allocation to the preferred
> > > node. We still allow the allocation to fall back to other nodes when
> > > the preferred node is full, or if we have zspages with slots available
> > > on a different node. This is OK, and still a strict improvement over
> > > the status quo:
> > >
> > > 1. On a system with demotion enabled, we will generally prefer
> > >    demotions over zswapping, and only zswap when pages have
> > >    already gone to the lowest tier. This patch should achieve the
> > >    desired effect for the most part.
> > >
> > > 2. If the preferred node is out of memory, letting the compressed data
> > >    going to other nodes can be better than the alternative (OOMs,
> > >    keeping cold memory unreclaimed, disk swapping, etc.).
> > >
> > > 3. If the allocation go to a separate node because we have a zspage
> > >    with slots available, at least we're not creating extra immediate
> > >    memory pressure (since the space is already allocated).
> > >
> > > 3. While there can be mixings, we generally reclaim pages in
> > >    same-node batches, which encourage zspage grouping that is more
> > >    likely to go to the right node.
> > >
> > > 4. A strict binding would require partitioning zsmalloc by node, which
> > >    is more complicated, and more prone to regression, since it reduces
> > >    the storage density of zsmalloc. We need to evaluate the tradeoff
> > >    and benchmark carefully before adopting such an involved solution.
> > >
> > > This patch does not fix zram, leaving its memory allocation behavior
> > > unchanged. We leave this effort to future work.
> >
> > zram's zs_malloc() calls all have page context. It seems a lot easier
> > to just fix the bug for them as well than to have two allocation APIs
> > and verbose commentary?
> 
> I think the recompress path doesn't quite have the context at the callsite:
> 
> static int recompress_slot(struct zram *zram, u32 index, struct page *page,
>    u64 *num_recomp_pages, u32 threshold, u32 prio,
>    u32 prio_max)
> 
> Note that the "page" argument here is allocated by zram internally,
> and not the original page. We can get the original page's node by
> asking zsmalloc to return it when it returns the compressed data, but
> that's quite involved, and potentially requires further zsmalloc API
> change.

Yeah, that path currently allocates the target page on the node of
whoever is writing to the "recompress" file.

I think it's fine to use page_to_nid() on that one. It's no worse than
the current behavior. Add an /* XXX */ to recompress_store() and
should somebody care to make that path generally NUMA-aware they can
do so without having to garbage-collect dependencies in zsmalloc code.


  reply	other threads:[~2025-04-02 20:24 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-02 19:11 Nhat Pham
2025-04-02 19:57 ` Johannes Weiner
2025-04-02 20:09   ` Nhat Pham
2025-04-02 20:24     ` Johannes Weiner [this message]
2025-04-02 20:25       ` Nhat Pham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250402202416.GE198651@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=chengming.zhou@linux.dev \
    --cc=dan.j.williams@intel.com \
    --cc=gourry@gourry.net \
    --cc=jonathan.cameron@huawei.com \
    --cc=kernel-team@meta.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=nphamcs@gmail.com \
    --cc=senozhatsky@chromium.org \
    --cc=sj@kernel.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yosry.ahmed@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox