From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 15375CDDE5D
	for <linux-mm@archiver.kernel.org>; Wed, 23 Oct 2024 10:27:03 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A7A1E6B0088; Wed, 23 Oct 2024 06:27:02 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A50B86B0089; Wed, 23 Oct 2024 06:27:02 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8FA5A6B008A; Wed, 23 Oct 2024 06:27:02 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 6DC8E6B0088
	for <linux-mm@kvack.org>; Wed, 23 Oct 2024 06:27:02 -0400 (EDT)
Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 855C4120829
	for <linux-mm@kvack.org>; Wed, 23 Oct 2024 10:26:46 +0000 (UTC)
X-FDA: 82704488244.07.35B2F14
Received: from mail-vs1-f41.google.com (mail-vs1-f41.google.com [209.85.217.41])
	by imf21.hostedemail.com (Postfix) with ESMTP id 89B571C0004
	for <linux-mm@kvack.org>; Wed, 23 Oct 2024 10:26:27 +0000 (UTC)
Authentication-Results: imf21.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=ayxylcli;
	spf=pass (imf21.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.41 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729679143; a=rsa-sha256;
	cv=none;
	b=kK3goZYZZeVR73kRHYbyzp1q3n19AJDXbcxurMca55tWjPaq4UIoe/yeqnHDewHJed5OxN
	gqyPmHAdHWp+ueAUtXDeyf9bcqo5i99vkpmjbHO6bR+jfABQoTiKyolu5uBPijgic2goTN
	HWwt4NfmfI7tCIO36fVFSaDn1UP0M5Q=
ARC-Authentication-Results: i=1;
	imf21.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=ayxylcli;
	spf=pass (imf21.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.41 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1729679143;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=cD3WF3EmzIj/UjMOYgnbsaubBoun6n700zqtkOjtj/0=;
	b=OLKsvFkBWMIftK1bjclqPBeHjCQtl8m9jkea8MfB0grcx8/RQ+M2oYtzAtqI9Rbaz5POjz
	ct9feOgTtkkT5ObbSG4DoHwovtWI+9fjQgKqSmDU/ziaGbkv8wJ835KNBX+oIZzvi7PFwr
	nNAkSWrrwm+aliQ4HlnJvlWoUAJhQII=
Received: by mail-vs1-f41.google.com with SMTP id ada2fe7eead31-4a5b15cedd6so1941121137.2
        for <linux-mm@kvack.org>; Wed, 23 Oct 2024 03:27:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1729679219; x=1730284019; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=cD3WF3EmzIj/UjMOYgnbsaubBoun6n700zqtkOjtj/0=;
        b=ayxylcliuAUUgw4W5KZecPRFC45udtydaCUSWPqJEUMQShxGvRex0uVxqoXLsr34Cu
         4wOOJ9MlyAueBhZbd5hUXoYy14TkNkQXe7cotlXrFmcJ08OYTWZmFXhDWFBs2OmXAHoo
         +2Pmvr1hB5R0l3iLF6dNWdQLwvyFP9KoP+nGEHkcHclt2aBH9lmx1A1EfQvAdAOi848g
         SqjgRejs9dwj/gwPz2qs+36tshcoDHi18oNt7DjxZphEe1PgNo6lNh4ozpPxZTbuS3v2
         C402jFQ8x0fb/SBaOLcutE0eyEoR5v7/iNE/93hHaYjn88x+PzAYvsQgL+JVBU4n2HD6
         er/Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1729679219; x=1730284019;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=cD3WF3EmzIj/UjMOYgnbsaubBoun6n700zqtkOjtj/0=;
        b=GXsXD+FEfqTsPJ/gZm7wtkl6WQQbJndmWtLqYNrzL1NW6Awr2iQwOBRKpM6LIZjn5d
         XI42ollIsiQvJFuOYmWh8x7pdnSQVxwJsVE9wkITEJBaXuns1Dww4cdc9OXdCIMLE7EW
         RRSifUD1FiRyvmgdXu322KwuAXgf73as3lkZX9oO9Qbph4HURGxAUzTtztktMrHPhH6m
         R/WHhuZnx5XMx3rF1B5NY4VwLvEo2MpUC3OfNSP9Zu0/aadax+sOY6rtH0tWnt/JJqWD
         IhUa4A4bCTQ0LA59xqs0wS8AQR4g6SHCG+o74ThovDKHJsgHdhSfYkhuZj90EV+RH7bO
         FTtA==
X-Forwarded-Encrypted: i=1; AJvYcCUWXofUqwN1VHoBqrQqYaOz3anyMJVE4nbQ9JGj+DKl0h2bh/H6xwPv7V8CJEa2YWIEBjImSyBR8w==@kvack.org
X-Gm-Message-State: AOJu0YwjyRpVMVrsAIzSzDhHUX+2hnI9MOOdb3ErRGr6zNSbCh4oycR0
	p5G7YlrIUeXDk5U3UGNiQ/3/6259csAbJsSWCV3Xm8+xn+S2Tf+spz5Q46aB3WQ2z6SDQ8CEA6C
	i7gse5aqQbLETauJpFOUBdM/bnMM=
X-Google-Smtp-Source: AGHT+IHoSZ9iJCWtN80lyMqQitxDGgAShQFIS0lvgU/Pc6cfEG/3CV/yHOpOqP7fXB3VcEHlkC+ua8YEP2qjdiHnOR4=
X-Received: by 2002:a05:6102:4411:b0:4a3:cce7:8177 with SMTP id
 ada2fe7eead31-4a751c07bddmr2531920137.16.1729679219093; Wed, 23 Oct 2024
 03:26:59 -0700 (PDT)
MIME-Version: 1.0
References: <20241018105026.2521366-1-usamaarif642@gmail.com>
 <CAGsJ_4xweuSwMUBuLSr2eUy69mtQumeDpMZ1g2jFPGq6nFn9fg@mail.gmail.com>
 <5313c721-9cf1-4ecd-ac23-1eeddabd691f@gmail.com> <b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com>
 <CAGsJ_4wykOyJupLhcqkSPe27rdANd=bOJhqxL74vcdZ+T9f==g@mail.gmail.com>
 <eab11780-e671-4d09-86a6-af4cf3589392@gmail.com> <CAGsJ_4wWf7QnibY_uU8B=efuEACrvFaJJ=bJTD+9KrxFtfoMmQ@mail.gmail.com>
In-Reply-To: <CAGsJ_4wWf7QnibY_uU8B=efuEACrvFaJJ=bJTD+9KrxFtfoMmQ@mail.gmail.com>
From: Barry Song <21cnbao@gmail.com>
Date: Wed, 23 Oct 2024 23:26:47 +1300
Message-ID: <CAGsJ_4w5XLMok4F6Xw7aTAdV6rY9OvCVPM3U+hzFnKyTXBUpOA@mail.gmail.com>
Subject: Re: [RFC 0/4] mm: zswap: add support for zswapin of large folios
To: Usama Arif <usamaarif642@gmail.com>
Cc: senozhatsky@chromium.org, minchan@kernel.org, hanchuanhua@oppo.com, 
	v-songbaohua@oppo.com, akpm@linux-foundation.org, linux-mm@kvack.org, 
	hannes@cmpxchg.org, david@redhat.com, willy@infradead.org, 
	kanchana.p.sridhar@intel.com, yosryahmed@google.com, nphamcs@gmail.com, 
	chengming.zhou@linux.dev, ryan.roberts@arm.com, ying.huang@intel.com, 
	riel@surriel.com, shakeel.butt@linux.dev, kernel-team@meta.com, 
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: nwzcyy8pjd6edgjiwic1b8skk9g1xhzi
X-Rspamd-Queue-Id: 89B571C0004
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-HE-Tag: 1729679187-370412
X-HE-Meta: U2FsdGVkX19bmkaoJBdvDLgs/sGq2pgzVGW6XaEHYS6oWiBDqEg22TK5fdOjmLTSrtFCoHu77ZDdVv92cxF5bf7BQq7G7jHRiFzzago0MOFDLROymqqDJASRsYEnKQvF6I7vgLy2eQTOFs2aNVl08Pzy26m1VoGiYSOt8HGX2wmi7g4lCH28uhRuRz3QF4950l1e5rz/equrNMbsEj7zSzjN84Mcxx40BoILCmMbSdd2Zxweavj3aS5xAaEEhcC6T8FFqr31htywi43IyEdh4QHJ5Rehz5Ny+suI2lW4W9WAkQKRa6ywtAFIM6Ri1aUJrC84IEVZCGuyFhqhepMHeisBitOIEjCOTeB5IEZKxNuECLjtwp/9I+4O3EQAuZk6motdpWHxpdaRKdziOlyRaZPn+sxVq6dTfS/hwaSDGVX9tYDZvTpOm8OSkfcNb/DWuzUQHwe4jD3Inzg3GfH8vtBPUKsPgBxkdJWtggLH+kGQvM2bGznbW5OyOQn3IhL05u4hs2DZiTsWS1ssK3u7RiKjZFA/401t0T9pbenkH0E7Qn6xinNlqK+wAE4GEnWUtznfdMlVyLiT8DwPwacsDxYLSTo4gXmWDOsNUwAVdNBcmUGtaBYFxdE/baTM74FsBjIhsTKtTCKG7OGAgJ9/JMmq8nJg3nQQeQufo/vVvuG7plOPBXTlZr90cuWKYGrbb9HELGgoEefp8yTkh+Z2SeYR85jx+o9N3wODEOibR3EAkECZl/vHgMyDAFtcOyiaSIL9j+ZiDzDUD/yyMjFp+MZW+04at5s63UsiZotC8nNSrDJTwTRA0Bi/deLTen8oqQofXlcLFB6XGsa77xqPyo9qDhBjvVyPN96Kz1oo1kBE/9IvJiUDNp+4V03lUKw1sUA3KxxIhwGmlXHizyXVh8IHE0/fhqL93ila5ZPQvdPRskbHoUR1uug9XicdaGmjOhAq+dDFX/Z5t4422F3
 5I6QmpFA
 FSGYhHCb++BeMt5Rfvk0rrsxIbMaOc+XeDDJ3lYjw8tFcITr/M7ZEYH0WQZnrpEhoxkEWmyJkp88nlWeZQ2Smn2lkDgEmjRSOXFVuwe50cBYcX8iNX26mjd1kKdpw+Wn4fOFBlrWSuA4xcijr0M0c/gdj/B2352Fz8n8O7PfUahMjiz4C2s9QoO8yxjI6Op1Yg88dWAg4Z5HxdAPW4teEwrK2MwHDsMbYw8Lmo0ICh+xAtdlHUyRZ0dl+LWLPfVJT8X97YAQxro1Gr27xhD71rQ9XW8tZD9SdMeNHBdRYlC9qVgQe7An9lw2UbPIkUeGp/69pg6w/3gjXNe4YsqDOJkBRI3HvBoXNEtlV65vDIcUQ80R0m+gWIRtM1k3myOmfXOibOWDRpeOpqM+vwcYr+DND7sbU9ddJEk+weg3vNWl0b3Gxb/pslg/HFOoc3na2dJrgMFEOhuzw2JM+nZ2T/78iAo8deGluZmvp1AsHzX8Oo9gAVvolZSYsvg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Oct 23, 2024 at 11:07=E2=80=AFAM Barry Song <21cnbao@gmail.com> wro=
te:
>
> On Wed, Oct 23, 2024 at 10:17=E2=80=AFAM Usama Arif <usamaarif642@gmail.c=
om> wrote:
> >
> >
> >
> > On 22/10/2024 21:46, Barry Song wrote:
> > > On Wed, Oct 23, 2024 at 4:26=E2=80=AFAM Usama Arif <usamaarif642@gmai=
l.com> wrote:
> > >>
> > >>
> > >>
> > >> On 21/10/2024 11:40, Usama Arif wrote:
> > >>>
> > >>>
> > >>> On 21/10/2024 06:09, Barry Song wrote:
> > >>>> On Fri, Oct 18, 2024 at 11:50=E2=80=AFPM Usama Arif <usamaarif642@=
gmail.com> wrote:
> > >>>>>
> > >>>>> After large folio zswapout support added in [1], this patch adds
> > >>>>> support for zswapin of large folios to bring it on par with zram.
> > >>>>> This series makes sure that the benefits of large folios (fewer
> > >>>>> page faults, batched PTE and rmap manipulation, reduced lru list,
> > >>>>> TLB coalescing (for arm64 and amd)) are not lost at swap out when
> > >>>>> using zswap.
> > >>>>>
> > >>>>> It builds on top of [2] which added large folio swapin support fo=
r
> > >>>>> zram and provides the same level of large folio swapin support as
> > >>>>> zram, i.e. only supporting swap count =3D=3D 1.
> > >>>>>
> > >>>>> Patch 1 skips swapcache for swapping in zswap pages, this should =
improve
> > >>>>> no readahead swapin performance [3], and also allows us to build =
on large
> > >>>>> folio swapin support added in [2], hence is a prerequisite for pa=
tch 3.
> > >>>>>
> > >>>>> Patch 3 adds support for large folio zswapin. This patch does not=
 add
> > >>>>> support for hybrid backends (i.e. folios partly present swap and =
zswap).
> > >>>>>
> > >>>>> The main performance benefit comes from maintaining large folios =
*after*
> > >>>>> swapin, large folio performance improvements have been mentioned =
in previous
> > >>>>> series posted on it [2],[4], so have not added those. Below is a =
simple
> > >>>>> microbenchmark to measure the time needed *for* zswpin of 1G memo=
ry (along
> > >>>>> with memory integrity check).
> > >>>>>
> > >>>>>                                 |  no mTHP (ms) | 1M mTHP enabled=
 (ms)
> > >>>>> Base kernel                     |   1165        |    1163
> > >>>>> Kernel with mTHP zswpin series  |   1203        |     738
> > >>>>
> > >>>> Hi Usama,
> > >>>> Do you know where this minor regression for non-mTHP comes from?
> > >>>> As you even have skipped swapcache for small folios in zswap in pa=
tch1,
> > >>>> that part should have some gain? is it because of zswap_present_te=
st()?
> > >>>>
> > >>>
> > >>> Hi Barry,
> > >>>
> > >>> The microbenchmark does a sequential read of 1G of memory, so it pr=
obably
> > >>> isnt very representative of real world usecases. This also means th=
at
> > >>> swap_vma_readahead is able to readahead accurately all pages in its=
 window.
> > >>> With this patch series, if doing 4K swapin, you get 1G/4K calls of =
fast
> > >>> do_swap_page. Without this patch, you get 1G/(4K*readahead window) =
of slow
> > >>> do_swap_page calls. I had added some prints and I was seeing 8 page=
s being
> > >>> readahead in 1 do_swap_page. The larger number of calls causes the =
slight
> > >>> regression (eventhough they are quite fast). I think in a realistic=
 scenario,
> > >>> where readahead window wont be as large, there wont be a regression=
.
> > >>> The cost of zswap_present_test in the whole call stack of swapping =
page is
> > >>> very low and I think can be ignored.
> > >>>
> > >>> I think the more interesting thing is what Kanchana pointed out in
> > >>> https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gm=
ail.com/
> > >>> I am curious, did you see this when testing large folio swapin and =
compression
> > >>> at 4K granuality? Its looks like swap thrashing so I think it would=
 be common
> > >>> between zswap and zram. I dont have larger granuality zswap compres=
sion done,
> > >>> which is why I think there is a regression in time taken. (It could=
 be because
> > >>> its tested on intel as well).
> > >>>
> > >>> Thanks,
> > >>> Usama
> > >>>
> > >>
> > >> Hi,
> > >>
> > >> So I have been doing some benchmarking after Kanchana pointed out a =
performance
> > >> regression in [1] of swapping in large folio. I would love to get th=
oughts from
> > >> zram folks on this, as thats where large folio swapin was first adde=
d [2].
> > >> As far as I can see, the current support in zram is doing large foli=
o swapin
> > >> at 4K granuality. The large granuality compression in [3] which was =
posted
> > >> in March is not merged, so I am currently comparing upstream zram wi=
th this series.
> > >>
> > >> With the microbenchmark below of timing 1G swapin, there was a very =
large improvement
> > >> in performance by using this series. I think similar numbers would b=
e seen in zram.
> > >
> > > Imagine running several apps on a phone and switching
> > > between them: A =E2=86=92 B =E2=86=92 C =E2=86=92 D =E2=86=92 E =E2=
=80=A6 =E2=86=92 A =E2=86=92 B =E2=80=A6 The app
> > > currently on the screen retains its memory, while the ones
> > > sent to the background are swapped out. When we bring
> > > those apps back to the foreground, their memory is restored.
> > > This behavior is quite similar to what you're seeing with
> > > your microbenchmark.
> > >
> >
> > Hi Barry,
> >
> > Thanks for explaining this! Do you know if there is some open source be=
nchmark
> > we could use to show an improvement in app switching with large folios?
> >
>
> I=E2=80=99m fairly certain the Android team has this benchmark, but it=E2=
=80=99s not
> open source.
>
> A straightforward way to simulate this is to use a script that
> cyclically launches multiple applications, such as Chrome, Firefox,
> Office, PDF, and others.
>
> for example:
>
> launch chrome;
> launch firefox;
> launch youtube;
> ....
> launch chrome;
> launch firefox;
> ....
>
> On Android, we have "Android activity manager 'am' command" to do that.
> https://gist.github.com/tsohr/5711945
>
> Not quite sure if other windows managers have similar tools.
>
> > Also I guess swap thrashing can happen when apps are brought back to fo=
reground?
> >
>
> Typically, the foreground app doesn't experience much swapping,
> as it is the most recently or frequently used. However, this may
> not hold for very low-end phones, where memory is significantly
> less than the app's working set. For instance, we can't expect a
> good user experience when playing a large game that requires 8GB
> of memory on a 4GB phone! :-)
> And for low-end phones, we never even enable mTHP.
>
> > >>
> > >> But when doing kernel build test, Kanchana saw a regression in [1]. =
I believe
> > >> its because of swap thrashing (causing large zswap activity), due to=
 larger page swapin.
> > >> The part of the code that decides large folio swapin is the same bet=
ween zswap and zram,
> > >> so I believe this would be observed in zram as well.
> > >
> > > Is this an extreme case where the workload's working set far
> > > exceeds the available memory by memcg limitation? I doubt mTHP
> > > would provide any real benefit from the start if the workload is boun=
d to
> > > experience swap thrashing. What if we disable mTHP entirely?
> > >
> >
> > I would agree, this is an extreme case. I wanted (z)swap activity to ha=
ppen so limited
> > memory.max to 4G.
> >
> > mTHP is beneficial in kernel test benchmarking going from no mTHP to 16=
K:
> >
> > ARM make defconfig; time make -j$(nproc) Image, cgroup memory.max=3D4G
> > metric         no mTHP         16K mTHP=3Dalways
> > real           1m0.613s         0m52.008s
> > user           25m23.028s       25m19.488s
> > sys            25m45.466s       18m11.640s
> > zswpin         1911194          3108438
> > zswpout        6880815          9374628
> > pgfault        120430166        48976658
> > pgmajfault     1580674          2327086
> >
> >
>
> Interesting! We never use a phone to build the Linux kernel, but
> let me see if I can find some other machines to reproduce your data.

Hi Usama,

I suspect the regression occurs because you're running an edge case
where the memory cgroup stays nearly full most of the time (this isn't
an inherent issue with large folio swap-in). As a result, swapping in
mTHP quickly triggers a memcg overflow, causing a swap-out. The
next swap-in then recreates the overflow, leading to a repeating
cycle.

We need a way to stop the cup from repeatedly filling to the brim and
overflowing. While not a definitive fix, the following change might help
improve the situation:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c

index 17af08367c68..f2fa0eeb2d9a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c

@@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(struct folio
*folio, struct mm_struct *mm,
                memcg =3D get_mem_cgroup_from_mm(mm);
        rcu_read_unlock();

-       ret =3D charge_memcg(folio, memcg, gfp);
+       if (folio_test_large(folio) && mem_cgroup_margin(memcg) <
MEMCG_CHARGE_BATCH)
+               ret =3D -ENOMEM;
+       else
+               ret =3D charge_memcg(folio, memcg, gfp);

        css_put(&memcg->css);
        return ret;
}

Please confirm if it makes the kernel build with memcg limitation
faster. If so, let's
work together to figure out an official patch :-) The above code hasn't con=
sider
the parent memcg's overflow, so not an ideal fix.

>
> >
> >
> > >>
> > >> My initial thought was this might be because its intel, where you do=
nt have the advantage
> > >> of TLB coalescing, so tested on AMD and ARM, but the regression is t=
here on AMD
> > >> and ARM as well, though a bit less (have added the numbers below).
> > >>
> > >> The numbers show that the zswap activity increases and page faults d=
ecrease.
> > >> Overall this does result in sys time increasing and real time slight=
ly increases,
> > >> likely because the cost of increased zswap activity is more than the=
 benefit of
> > >> lower page faults.
> > >> I can see in [3] that pagefaults reduced in zram as well.
> > >>
> > >> Large folio swapin shows good numbers in microbenchmarks that just t=
arget reduce page
> > >> faults and sequential swapin only, but not in kernel build test. Is =
a similar regression
> > >> observed with zram when enabling large folio swapin on kernel build =
test? Maybe large
> > >> folio swapin makes more sense on workloads where mappings are kept f=
or a longer time?
> > >>
> > >
> > > I suspect this is because mTHP doesn't always benefit workloads
> > > when available memory is quite limited compared to the working set.
> > > In that case, mTHP swap-in might introduce more features that
> > > exacerbate the problem. We used to have an extra control "swapin_enab=
led"
> > > for swap-in, but it never gained much traction:
> > > https://lore.kernel.org/linux-mm/20240726094618.401593-5-21cnbao@gmai=
l.com/
> > > We can reconsider whether to include the knob, but if it's better
> > > to disable mTHP entirely for these cases, we can still adhere to
> > > the policy of "enabled".
> > >
> > Yes I think this makes sense to have. The only thing is, its too many k=
nobs!
> > I personally think its already difficult to decide upto which mTHP size=
 we
> > should enable (and I think this changes per workload). But if we add sw=
apin_enabled
> > on top of that it can make things more difficult.
> >
> > > Using large block compression and decompression in zRAM will
> > > significantly reduce CPU usage, likely making the issue unnoticeable.
> > > However, the default minimum size for large block support is currentl=
y
> > > set to 64KB(ZSMALLOC_MULTI_PAGES_ORDER =3D 4).
> > >
> >
> > I saw that the patch was sent in March, and there werent any updates af=
ter?
> > Maybe I can try and cherry-pick that and see if we can develop large
> > granularity compression for zswap.
>
> will provide an updated version next week.
>
> >
> > >>
> > >> Kernel build numbers in cgroup with memory.max=3D4G to trigger zswap
> > >> Command for AMD: make defconfig; time make -j$(nproc) bzImage
> > >> Command for ARM: make defconfig; time make -j$(nproc) Image
> > >>
> > >>
> > >> AMD 16K+32K THP=3Dalways
> > >> metric         mm-unstable      mm-unstable + large folio zswapin se=
ries
> > >> real           1m23.038s        1m23.050s
> > >> user           53m57.210s       53m53.437s
> > >> sys            7m24.592s        7m48.843s
> > >> zswpin         612070           999244
> > >> zswpout        2226403          2347979
> > >> pgfault        20667366         20481728
> > >> pgmajfault     385887           269117
> > >>
> > >> AMD 16K+32K+64K THP=3Dalways
> > >> metric         mm-unstable      mm-unstable + large folio zswapin se=
ries
> > >> real           1m22.975s        1m23.266s
> > >> user           53m51.302s       53m51.069s
> > >> sys            7m40.168s        7m57.104s
> > >> zswpin         676492           1258573
> > >> zswpout        2449839          2714767
> > >> pgfault        17540746         17296555
> > >> pgmajfault     429629           307495
> > >> --------------------------
> > >> ARM 16K+32K THP=3Dalways
> > >> metric         mm-unstable      mm-unstable + large folio zswapin se=
ries
> > >> real           0m51.168s        0m52.086s
> > >> user           25m14.715s       25m15.765s
> > >> sys            17m18.856s       18m8.031s
> > >> zswpin         3904129          7339245
> > >> zswpout        11171295         13473461
> > >> pgfault        37313345         36011338
> > >> pgmajfault     2726253          1932642
> > >>
> > >>
> > >> ARM 16K+32K+64K THP=3Dalways
> > >> metric         mm-unstable      mm-unstable + large folio zswapin se=
ries
> > >> real           0m52.017s        0m53.828s
> > >> user           25m2.742s        25m0.046s
> > >> sys            18m24.525s       20m26.207s
> > >> zswpin         4853571          8908664
> > >> zswpout        12297199         15768764
> > >> pgfault        32158152         30425519
> > >> pgmajfault     3320717          2237015
> > >>
> > >>
> > >> Thanks!
> > >> Usama
> > >>
> > >>
> > >> [1] https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff=
@gmail.com/
> > >> [2] https://lore.kernel.org/all/20240821074541.516249-3-hanchuanhua@=
oppo.com/
> > >> [3] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail=
.com/
> > >>
> > >>>
> > >>>>>
> > >>>>> The time measured was pretty consistent between runs (~1-2% varia=
tion).
> > >>>>> There is 36% improvement in zswapin time with 1M folios. The perc=
entage
> > >>>>> improvement is likely to be more if the memcmp is removed.
> > >>>>>
> > >>>>> diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/=
testing/selftests/cgroup/test_zswap.c
> > >>>>> index 40de679248b8..77068c577c86 100644
> > >>>>> --- a/tools/testing/selftests/cgroup/test_zswap.c
> > >>>>> +++ b/tools/testing/selftests/cgroup/test_zswap.c
> > >>>>> @@ -9,6 +9,8 @@
> > >>>>>  #include <string.h>
> > >>>>>  #include <sys/wait.h>
> > >>>>>  #include <sys/mman.h>
> > >>>>> +#include <sys/time.h>
> > >>>>> +#include <malloc.h>
> > >>>>>
> > >>>>>  #include "../kselftest.h"
> > >>>>>  #include "cgroup_util.h"
> > >>>>> @@ -407,6 +409,74 @@ static int test_zswap_writeback_disabled(con=
st char *root)
> > >>>>>         return test_zswap_writeback(root, false);
> > >>>>>  }
> > >>>>>
> > >>>>> +static int zswapin_perf(const char *cgroup, void *arg)
> > >>>>> +{
> > >>>>> +       long pagesize =3D sysconf(_SC_PAGESIZE);
> > >>>>> +       size_t memsize =3D MB(1*1024);
> > >>>>> +       char buf[pagesize];
> > >>>>> +       int ret =3D -1;
> > >>>>> +       char *mem;
> > >>>>> +       struct timeval start, end;
> > >>>>> +
> > >>>>> +       mem =3D (char *)memalign(2*1024*1024, memsize);
> > >>>>> +       if (!mem)
> > >>>>> +               return ret;
> > >>>>> +
> > >>>>> +       /*
> > >>>>> +        * Fill half of each page with increasing data, and keep =
other
> > >>>>> +        * half empty, this will result in data that is still com=
pressible
> > >>>>> +        * and ends up in zswap, with material zswap usage.
> > >>>>> +        */
> > >>>>> +       for (int i =3D 0; i < pagesize; i++)
> > >>>>> +               buf[i] =3D i < pagesize/2 ? (char) i : 0;
> > >>>>> +
> > >>>>> +       for (int i =3D 0; i < memsize; i +=3D pagesize)
> > >>>>> +               memcpy(&mem[i], buf, pagesize);
> > >>>>> +
> > >>>>> +       /* Try and reclaim allocated memory */
> > >>>>> +       if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) =
{
> > >>>>> +               ksft_print_msg("Failed to reclaim all of the requ=
ested memory\n");
> > >>>>> +               goto out;
> > >>>>> +       }
> > >>>>> +
> > >>>>> +       gettimeofday(&start, NULL);
> > >>>>> +       /* zswpin */
> > >>>>> +       for (int i =3D 0; i < memsize; i +=3D pagesize) {
> > >>>>> +               if (memcmp(&mem[i], buf, pagesize)) {
> > >>>>> +                       ksft_print_msg("invalid memory\n");
> > >>>>> +                       goto out;
> > >>>>> +               }
> > >>>>> +       }
> > >>>>> +       gettimeofday(&end, NULL);
> > >>>>> +       printf ("zswapin took %fms to run.\n", (end.tv_sec - star=
t.tv_sec)*1000 + (double)(end.tv_usec - start.tv_usec) / 1000);
> > >>>>> +       ret =3D 0;
> > >>>>> +out:
> > >>>>> +       free(mem);
> > >>>>> +       return ret;
> > >>>>> +}
> > >>>>> +
> > >>>>> +static int test_zswapin_perf(const char *root)
> > >>>>> +{
> > >>>>> +       int ret =3D KSFT_FAIL;
> > >>>>> +       char *test_group;
> > >>>>> +
> > >>>>> +       test_group =3D cg_name(root, "zswapin_perf_test");
> > >>>>> +       if (!test_group)
> > >>>>> +               goto out;
> > >>>>> +       if (cg_create(test_group))
> > >>>>> +               goto out;
> > >>>>> +
> > >>>>> +       if (cg_run(test_group, zswapin_perf, NULL))
> > >>>>> +               goto out;
> > >>>>> +
> > >>>>> +       ret =3D KSFT_PASS;
> > >>>>> +out:
> > >>>>> +       cg_destroy(test_group);
> > >>>>> +       free(test_group);
> > >>>>> +       return ret;
> > >>>>> +}
> > >>>>> +
> > >>>>>  /*
> > >>>>>   * When trying to store a memcg page in zswap, if the memcg hits=
 its memory
> > >>>>>   * limit in zswap, writeback should affect only the zswapped pag=
es of that
> > >>>>> @@ -584,6 +654,7 @@ struct zswap_test {
> > >>>>>         T(test_zswapin),
> > >>>>>         T(test_zswap_writeback_enabled),
> > >>>>>         T(test_zswap_writeback_disabled),
> > >>>>> +       T(test_zswapin_perf),
> > >>>>>         T(test_no_kmem_bypass),
> > >>>>>         T(test_no_invasive_cgroup_shrink),
> > >>>>>  };
> > >>>>>
> > >>>>> [1] https://lore.kernel.org/all/20241001053222.6944-1-kanchana.p.=
sridhar@intel.com/
> > >>>>> [2] https://lore.kernel.org/all/20240821074541.516249-1-hanchuanh=
ua@oppo.com/
> > >>>>> [3] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-=
minchan@kernel.org/T/#u
> > >>>>> [4] https://lwn.net/Articles/955575/
> > >>>>>
> > >>>>> Usama Arif (4):
> > >>>>>   mm/zswap: skip swapcache for swapping in zswap pages
> > >>>>>   mm/zswap: modify zswap_decompress to accept page instead of fol=
io
> > >>>>>   mm/zswap: add support for large folio zswapin
> > >>>>>   mm/zswap: count successful large folio zswap loads
> > >>>>>
> > >>>>>  Documentation/admin-guide/mm/transhuge.rst |   3 +
> > >>>>>  include/linux/huge_mm.h                    |   1 +
> > >>>>>  include/linux/zswap.h                      |   6 ++
> > >>>>>  mm/huge_memory.c                           |   3 +
> > >>>>>  mm/memory.c                                |  16 +--
> > >>>>>  mm/page_io.c                               |   2 +-
> > >>>>>  mm/zswap.c                                 | 120 ++++++++++++++-=
------
> > >>>>>  7 files changed, 99 insertions(+), 52 deletions(-)
> > >>>>>
> > >>>>> --
> > >>>>> 2.43.5
> > >>>>>
> > >>>>
> > >
>

Thanks
Barry