From: Barry Song <21cnbao@gmail.com>
To: Usama Arif <usamaarif642@gmail.com>
Cc: Nhat Pham <nphamcs@gmail.com>,
ying.huang@intel.com, linux-mm@kvack.org,
akpm@linux-foundation.org, axboe@kernel.dk,
bala.seshasayee@linux.intel.com, chrisl@kernel.org,
david@redhat.com, hannes@cmpxchg.org,
kanchana.p.sridhar@intel.com, kasong@tencent.com,
linux-block@vger.kernel.org, minchan@kernel.org,
senozhatsky@chromium.org, surenb@google.com, terrelln@fb.com,
v-songbaohua@oppo.com, wajdi.k.feghali@intel.com,
willy@infradead.org, yosryahmed@google.com, yuzhao@google.com,
zhengtangquan@oppo.com, zhouchengming@bytedance.com,
ryan.roberts@arm.com
Subject: Re: [PATCH RFC v2 0/2] mTHP-friendly compression in zsmalloc and zram based on multi-pages
Date: Tue, 19 Nov 2024 09:51:40 +1300 [thread overview]
Message-ID: <CAGsJ_4w7bpQ+20jEQ2stmoS_Y+MWcP40i6CrgdzuS=r6uq8VyA@mail.gmail.com> (raw)
In-Reply-To: <92b25b7b-63e8-4eb1-b2a6-9c221de2b7e4@gmail.com>
On Tue, Nov 19, 2024 at 9:29 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 18/11/2024 02:27, Barry Song wrote:
> > On Tue, Nov 12, 2024 at 10:37 AM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Tue, Nov 12, 2024 at 8:30 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >>>
> >>> On Thu, Nov 7, 2024 at 2:10 AM Barry Song <21cnbao@gmail.com> wrote:
> >>>>
> >>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>
> >>>> When large folios are compressed at a larger granularity, we observe
> >>>> a notable reduction in CPU usage and a significant improvement in
> >>>> compression ratios.
> >>>>
> >>>> mTHP's ability to be swapped out without splitting and swapped back in
> >>>> as a whole allows compression and decompression at larger granularities.
> >>>>
> >>>> This patchset enhances zsmalloc and zram by adding support for dividing
> >>>> large folios into multi-page blocks, typically configured with a
> >>>> 2-order granularity. Without this patchset, a large folio is always
> >>>> divided into `nr_pages` 4KiB blocks.
> >>>>
> >>>> The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER`
> >>>> setting, where the default of 2 allows all anonymous THP to benefit.
> >>>>
> >>>> Examples include:
> >>>> * A 16KiB large folio will be compressed and stored as a single 16KiB
> >>>> block.
> >>>> * A 64KiB large folio will be compressed and stored as four 16KiB
> >>>> blocks.
> >>>>
> >>>> For example, swapping out and swapping in 100MiB of typical anonymous
> >>>> data 100 times (with 16KB mTHP enabled) using zstd yields the following
> >>>> results:
> >>>>
> >>>> w/o patches w/ patches
> >>>> swap-out time(ms) 68711 49908
> >>>> swap-in time(ms) 30687 20685
> >>>> compression ratio 20.49% 16.9%
> >>>
> >>> The data looks very promising :) My understanding is it also results
> >>> in memory saving as well right? Since zstd operates better on bigger
> >>> inputs.
> >>>
> >>> Is there any end-to-end benchmarking? My intuition is that this patch
> >>> series overall will improve the situations, assuming we don't fallback
> >>> to individual zero order page swapin too often, but it'd be nice if
> >>> there is some data backing this intuition (especially with the
> >>> upstream setup, i.e without any private patches). If the fallback
> >>> scenario happens frequently, the patch series can make a page fault
> >>> more expensive (since we have to decompress the entire chunk, and
> >>> discard everything but the single page being loaded in), so it might
> >>> make a difference.
> >>>
> >>> Not super qualified to comment on zram changes otherwise - just a
> >>> casual observer to see if we can adopt this for zswap. zswap has the
> >>> added complexity of not supporting THP zswap in (until Usama's patch
> >>> series lands), and the presence of mixed backing states (due to zswap
> >>> writeback), increasing the likelihood of fallback :)
> >>
> >> Correct. As I mentioned to Usama[1], this could be a problem, and we are
> >> collecting data. The simplest approach to work around the issue is to fall
> >> back to four small folios instead of just one, which would prevent the need
> >> for three extra decompressions.
> >>
> >> [1] https://lore.kernel.org/linux-mm/CAGsJ_4yuZLOE0_yMOZj=KkRTyTotHw4g5g-t91W=MvS5zA4rYw@mail.gmail.com/
> >>
> >
> > Hi Nhat, Usama, Ying,
> >
> > I committed to providing data for cases where large folio allocation fails and
> > swap-in falls back to swapping in small folios. Here is the data that Tangquan
> > helped collect:
> >
> > * zstd, 100MB typical anon memory swapout+swapin 100times
> >
> > 1. 16kb mTHP swapout + 16kb mTHP swapin + w/o zsmalloc large block
> > (de)compression
> > swap-out(ms) 63151
> > swap-in(ms) 31551
> > 2. 16kb mTHP swapout + 16kb mTHP swapin + w/ zsmalloc large block
> > (de)compression
> > swap-out(ms) 43925
> > swap-in(ms) 21763
> > 3. 16kb mTHP swapout + 100% fallback to small folios swap-in + w/
> > zsmalloc large block (de)compression
> > swap-out(ms) 43423
> > swap-in(ms) 68660
> >
>
> Hi Barry,
>
> Thanks for the numbers!
>
> In what condition was it falling back to small folios. Did you just added a hack
> in alloc_swap_folio to just jump to fallback? or was it due to cgroup limited memory
> pressure?
In real scenarios, even without memcg, fallbacks mainly occur due to memory
fragmentation, which prevents the allocation of mTHP (contiguous pages) from
the buddy system. While cgroup memory pressure isn't the primary issue here,
it can also contribute to fallbacks.
Note that this fallback occurs universally for both do_anonymous_page() and
filesystem mTHP.
>
> Would it be good to test with something like kernel build test (or something else that
> causes swap thrashing) to see if the regression worsens with large granularity decompression?
> i.e. would be good to have numbers for real world applications.
I’m confident that the data will be reliable as long as memory isn’t fragmented,
but fragmentation depends on when the case is run. For example, on a fresh
system, memory is not fragmented at all, but after running various workloads
for a few hours, serious fragmentation may occur.
I recall reporting that a phone using 64KB mTHP had a high mTHP allocation
success rate in the first hour, but this dropped to less than 10% after a few
hours of use.
In my understanding, the performance of mTHP can vary significantly depending
on the system's fragmentation state. This is why efforts like Yu Zhao's TAO are
being developed to address the mTHP allocation success rate issue.
>
> > Thus, "swap-in(ms) 68660," where mTHP allocation always fails, is significantly
> > slower than "swap-in(ms) 21763," where mTHP allocation succeeds.
> >
> > If there are no objections, I could send a v3 patch to fall back to 4
> > small folios
> > instead of one. However, this would significantly increase the complexity of
> > do_swap_page(). My gut feeling is that the added complexity might not be
> > well-received :-)
> >
>
> If there is space for 4 small folios, then maybe it might be worth passing
> __GFP_DIRECT_RECLAIM? as that can trigger compaction and give a large folio.
>
Small folios are always much *easier* to obtain from the system.
Triggering compaction
won't necessarily yield a large folio if unmovable small folios are scattered.
For small folios, reclamation is already the case for memcg. as a small folio
is charged by GFP_KERNEL as it was before.
static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
struct folio *folio;
swp_entry_t entry;
folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address);
if (!folio)
return NULL;
entry = pte_to_swp_entry(vmf->orig_pte);
if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
GFP_KERNEL, entry)) {
folio_put(folio);
return NULL;
}
return folio;
}
Thanks
Barry
next prev parent reply other threads:[~2024-11-18 20:51 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-07 10:10 Barry Song
2024-11-07 10:10 ` [PATCH RFC v2 1/2] mm: zsmalloc: support objects compressed based on multiple pages Barry Song
2024-11-07 10:10 ` [PATCH RFC v2 2/2] zram: support compression at the granularity of multi-pages Barry Song
2024-11-08 5:19 ` [PATCH RFC v2 0/2] mTHP-friendly compression in zsmalloc and zram based on multi-pages Huang, Ying
2024-11-08 6:51 ` Barry Song
2024-11-11 16:43 ` Usama Arif
2024-11-11 20:31 ` Barry Song
2024-11-18 9:56 ` Sergey Senozhatsky
2024-11-18 20:27 ` Barry Song
2024-11-19 2:45 ` Sergey Senozhatsky
2024-11-19 2:51 ` Barry Song
2024-11-12 1:07 ` Huang, Ying
2024-11-12 1:25 ` Barry Song
2024-11-12 1:25 ` Huang, Ying
2024-11-11 19:30 ` Nhat Pham
2024-11-11 21:37 ` Barry Song
2024-11-18 10:27 ` Barry Song
2024-11-18 20:00 ` Nhat Pham
2024-11-18 20:28 ` Usama Arif
2024-11-18 20:51 ` Barry Song [this message]
2024-11-18 21:48 ` Barry Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAGsJ_4w7bpQ+20jEQ2stmoS_Y+MWcP40i6CrgdzuS=r6uq8VyA@mail.gmail.com' \
--to=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=bala.seshasayee@linux.intel.com \
--cc=chrisl@kernel.org \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=kanchana.p.sridhar@intel.com \
--cc=kasong@tencent.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=minchan@kernel.org \
--cc=nphamcs@gmail.com \
--cc=ryan.roberts@arm.com \
--cc=senozhatsky@chromium.org \
--cc=surenb@google.com \
--cc=terrelln@fb.com \
--cc=usamaarif642@gmail.com \
--cc=v-songbaohua@oppo.com \
--cc=wajdi.k.feghali@intel.com \
--cc=willy@infradead.org \
--cc=ying.huang@intel.com \
--cc=yosryahmed@google.com \
--cc=yuzhao@google.com \
--cc=zhengtangquan@oppo.com \
--cc=zhouchengming@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox