linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nhat Pham <nphamcs@gmail.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Usama Arif <usamaarif642@gmail.com>,
	linux-mm@kvack.org, akpm@linux-foundation.org,  hughd@google.com,
	yosry.ahmed@linux.dev, mhocko@kernel.org,
	 roman.gushchin@linux.dev, shakeel.butt@linux.dev,
	muchun.song@linux.dev,  len.brown@intel.com,
	chengming.zhou@linux.dev, kasong@tencent.com,  chrisl@kernel.org,
	huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
	 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
	 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu,
	pavel@kernel.org,  kernel-team@meta.com,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	 linux-pm@vger.kernel.org
Subject: Re: [RFC PATCH 00/14] Virtual Swap Space
Date: Tue, 8 Apr 2025 09:25:57 -0700	[thread overview]
Message-ID: <CAKEwX=M3do_7SJGKwfZQ8vOSQN4aM1ZU04Q3E99CW=UTCkUMOQ@mail.gmail.com> (raw)
In-Reply-To: <20250408154547.GC816@cmpxchg.org>

On Tue, Apr 8, 2025 at 8:45 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Apr 08, 2025 at 02:04:06PM +0100, Usama Arif wrote:
> >
> >
> > On 08/04/2025 00:42, Nhat Pham wrote:
> > >
> > > V. Benchmarking
> > >
> > > As a proof of concept, I run the prototype through some simple
> > > benchmarks:
> > >
> > > 1. usemem: 16 threads, 2G each, memory.max = 16G
> > >
> > > I benchmarked the following usemem commands:
> > >
> > > time usemem --init-time -w -O -s 10 -n 16 2g
> > >
> > > Baseline:
> > > real: 33.96s
> > > user: 25.31s
> > > sys: 341.09s
> > > average throughput: 111295.45 KB/s
> > > average free time: 2079258.68 usecs
> > >
> > > New Design:
> > > real: 35.87s
> > > user: 25.15s
> > > sys: 373.01s
> > > average throughput: 106965.46 KB/s
> > > average free time: 3192465.62 usecs
> > >
> > > To root cause this regression, I ran perf on the usemem program, as
> > > well as on the following stress-ng program:
> > >
> > > perf record -ag -e cycles -G perf_cg -- ./stress-ng/stress-ng  --pageswap $(nproc) --pageswap-ops 100000
> > >
> > > and observed the (predicted) increase in lock contention on swap cache
> > > accesses. This regression is alleviated if I put together the
> > > following hack: limit the virtual swap space to a sufficient size for
> > > the benchmark, range partition the swap-related data structures (swap
> > > cache, zswap tree, etc.) based on the limit, and distribute the
> > > allocation of virtual swap slotss among these partitions (on a per-CPU
> > > basis):
> > >
> > > real: 34.94s
> > > user: 25.28s
> > > sys: 360.25s
> > > average throughput: 108181.15 KB/s
> > > average free time: 2680890.24 usecs
> > >
> > > As mentioned above, I will implement proper dynamic swap range
> > > partitioning in a follow up work.
> > >
> > > 2. Kernel building: zswap enabled, 52 workers (one per processor),
> > > memory.max = 3G.
> > >
> > > Baseline:
> > > real: 183.55s
> > > user: 5119.01s
> > > sys: 655.16s
> > >
> > > New Design:
> > > real: mean: 184.5s
> > > user: mean: 5117.4s
> > > sys: mean: 695.23s
> > >
> > > New Design (Static Partition)
> > > real: 183.95s
> > > user: 5119.29s
> > > sys: 664.24s
> > >
> >
> > Hi Nhat,
> >
> > Thanks for the patches! I have glanced over a couple of them, but this was the main question that came to my mind.
> >
> > Just wanted to check if you had a look at the memory regression during these benchmarks?
> >
> > Also what is sizeof(swp_desc)? Maybe we can calculate the memory overhead as sizeof(swp_desc) * swap size/PAGE_SIZE?
> >
> > For a 64G swap that is filled with private anon pages, the overhead in MB might be (sizeof(swp_desc) in bytes * 16M) - 16M (zerobitmap) - 16M*8 (swap map)?
> >
> > This looks like a sizeable memory regression?
>
> One thing to keep in mind is that the swap descriptor is currently
> blatantly explicit, and many conversions and optimizations have not
> been done yet. There are some tradeoffs made here regarding code
> reviewability, but I agree it makes it hard to see what this would
> look like fully realized.
>
> I think what's really missing is an analysis of what the goal is and
> what the overhead will be then.
>
> The swapin path currently consults the swapcache, then the zeromap,
> then zswap, and finally the backend. The external swap_cgroup array is
> consulted to determine who to charge for the new page.
>
> With vswap, the descriptor is looked up and resolves to a type,
> location, cgroup ownership, a refcount. This means it replaces the
> swapcache, the zeromap, the cgroup map, and largely the swap_map.
>
> Nhat was not quite sure yet if the swap_map can be a single bit per
> entry or two bits to represent bad slots. In any case, it's a large
> reduction in static swap space overhead, and eliminates the tricky
> swap count continuation code.

You're right. I haven't touched the swapfile swap map and the zeromap
bitmap at all, primarily because it's non-functional change
(optimization only). It also adds more ifdefs to the final codebase :)

In the next version, I can tag on one patch to:

1. remove zeromap bitmap. This one is pretty much straightforward -
we're not using it at all.

2. Swap map reduction. I'm like 70% sure we don't need SWAP_MAP_BAD
state. With the vswap reverse map and the swapfile inuse counters, we
should be able to convert the swapmap into a pure bitmap. If we can't,
then it's 2 bits per physical swapfiles.


  reply	other threads:[~2025-04-08 16:26 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-07 23:42 Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 01/14] swapfile: rearrange functions Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 02/14] mm: swap: add an abstract API for locking out swapoff Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 03/14] mm: swap: add a separate type for physical swap slots Nhat Pham
2025-04-08 14:15   ` Johannes Weiner
2025-04-08 15:11     ` Nhat Pham
2025-04-22 14:41     ` Yosry Ahmed
     [not found]     ` <6807ab09.670a0220.152ca3.502fSMTPIN_ADDED_BROKEN@mx.google.com>
2025-04-22 15:50       ` Nhat Pham
2025-04-22 18:50         ` Kairui Song
2025-04-07 23:42 ` [RFC PATCH 04/14] mm: swap: swap cache support for virtualized swap Nhat Pham
2025-04-08 15:00   ` Johannes Weiner
2025-04-08 15:34     ` Nhat Pham
2025-04-08 15:43       ` Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 05/14] zswap: unify zswap tree " Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 06/14] mm: swap: allocate a virtual swap slot for each swapped out page Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 07/14] swap: implement the swap_cgroup API using virtual swap Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 08/14] swap: manage swap entry lifetime at the virtual swap layer Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 09/14] swap: implement locking out swapoff using virtual swap slot Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 10/14] mm: swap: decouple virtual swap slot from backing store Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 11/14] memcg: swap: only charge physical swap slots Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 12/14] vswap: support THP swapin and batch free_swap_and_cache Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 13/14] swap: simplify swapoff using virtual swap Nhat Pham
2025-04-07 23:42 ` [RFC PATCH 14/14] zswap: do not start zswap shrinker if there is no physical swap slots Nhat Pham
2025-04-08 13:04 ` [RFC PATCH 00/14] Virtual Swap Space Usama Arif
2025-04-08 15:20   ` Nhat Pham
2025-04-08 15:45   ` Johannes Weiner
2025-04-08 16:25     ` Nhat Pham [this message]
2025-04-08 16:27       ` Nhat Pham
2025-04-08 16:22 ` Kairui Song
2025-04-08 16:47   ` Nhat Pham
2025-04-08 16:59     ` Kairui Song
2025-04-22 14:43       ` Yosry Ahmed
2025-04-22 14:56 ` Yosry Ahmed
     [not found] ` <6807afd0.a70a0220.2ae8b9.e07cSMTPIN_ADDED_BROKEN@mx.google.com>
2025-04-22 17:15   ` Nhat Pham
2025-04-22 19:29     ` Nhat Pham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKEwX=M3do_7SJGKwfZQ8vOSQN4aM1ZU04Q3E99CW=UTCkUMOQ@mail.gmail.com' \
    --to=nphamcs@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=christophe.leroy@csgroup.eu \
    --cc=hannes@cmpxchg.org \
    --cc=huang.ying.caritas@gmail.com \
    --cc=hughd@google.com \
    --cc=kasong@tencent.com \
    --cc=kernel-team@meta.com \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    --cc=pavel@kernel.org \
    --cc=roman.gushchin@linux.dev \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=usamaarif642@gmail.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=yosry.ahmed@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox