From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9B6DBE77188
	for <linux-mm@archiver.kernel.org>; Fri, 10 Jan 2025 10:28:59 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 198BD6B00C5; Fri, 10 Jan 2025 05:28:59 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 12BA86B00C6; Fri, 10 Jan 2025 05:28:59 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id F0CDB6B00C6; Fri, 10 Jan 2025 05:28:58 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 0A3DF6B00BF
	for <linux-mm@kvack.org>; Fri, 10 Jan 2025 05:28:58 -0500 (EST)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id AD11912093D
	for <linux-mm@kvack.org>; Fri, 10 Jan 2025 10:28:57 +0000 (UTC)
X-FDA: 82991169114.08.73B7EBC
Received: from mail-vs1-f49.google.com (mail-vs1-f49.google.com [209.85.217.49])
	by imf10.hostedemail.com (Postfix) with ESMTP id C8313C0006
	for <linux-mm@kvack.org>; Fri, 10 Jan 2025 10:28:55 +0000 (UTC)
Authentication-Results: imf10.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=as0VE325;
	spf=pass (imf10.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.49 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736504935; a=rsa-sha256;
	cv=none;
	b=xm/KxR44cLoRhu/Y5VyoUdzBNL/aUZApqWJeYqQxN14dIaJuStnr5tU2bek+clZ7wo+le4
	qALLdisSLIH/R2lQyHmSFBGZA9XbJD1OEv+mkIdANbeMmfEeK+j2Vfasj/3GKccEOvfmIr
	14Z7n9qfPdhU93Z0bth8pxazTGlq1vs=
ARC-Authentication-Results: i=1;
	imf10.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=as0VE325;
	spf=pass (imf10.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.49 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1736504935;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=sxjwba1xIBWyBDhpMRTJjxQqQbIrsYYEjiuYhJO6tdI=;
	b=tCJ1w/d1gagWJot1h/nGpc06xHOrZ+lAAD8grUlBMkzIi1JdAGOlcV+IxP0OnJFXSnQ6iJ
	nndG5isrLhr9vjpIxrvXe3TVbQjQE3mhWfTSk5XDWXKxr8d2H15Vl+0qrH/mnDFV/vpkKu
	CbE1lWrUrMb4lsVZ85ChVk+zgov+w20=
Received: by mail-vs1-f49.google.com with SMTP id ada2fe7eead31-4affd0fb6adso623956137.1
        for <linux-mm@kvack.org>; Fri, 10 Jan 2025 02:28:55 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1736504935; x=1737109735; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=sxjwba1xIBWyBDhpMRTJjxQqQbIrsYYEjiuYhJO6tdI=;
        b=as0VE325kPoD62/6LoPjTTRH2CSMpbU6EUa7dgz/Ut4KhJIkkZmt82irS9uDY5BnlE
         NptBi2hOqQ0tGCqR1UMUZ8Q5DjHDsqGtiojLoxfEp4NaVjQZFJLKFdb+mkdnSC8up6oj
         UglOt7eMKDIy33P10gRJ6Q9+b+91UqB+mmYGGy+l6vsJQduqNmm5YbQUb95Ek+iawuuL
         uJqfyP54Zr5muavE0WSjHxH09csFbd5g9sd9oJP7IK/CfrC/5a3+zCHWJsPTj2P9LTSH
         0xAYkltNNIBborJl+FVrUWH1PE2ogX/Aexf/2eWF0qq9KMwcAsP7yISmNE0epCVWe3qx
         6eJA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1736504935; x=1737109735;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=sxjwba1xIBWyBDhpMRTJjxQqQbIrsYYEjiuYhJO6tdI=;
        b=hE5ORdAcAqxbDAnlhMUZPNzm6xHpLHldL3zP5m944xm7w7cXaJHsygsm+Wi9klIlm3
         ioutkUDfMUWL2xO9xgOeLbL00rph4O1QyiTa6zXBumLYoyoWrz2gyOEM9QzQ6j1brKqo
         rN9pIIxH+vYeXCA9B5Mk4v4qgrttKWCGC7Rql+DlbHz1y0NJI+I9CfA/tMjbQYPs6avm
         6LkW3p58b6HcQnwhJF7G5DwC5O3zuZrepxvbDxu3ND3MmivZj0PK1xxKMRyZM3sbHPOt
         QQfZUpR+moWow+aoS60vNxawAfP+TYjRNxrWX3/YnCMTWS+CeH6s79210uOUwLDbsz2y
         SfwQ==
X-Forwarded-Encrypted: i=1; AJvYcCVP2P/RKD1QsIApt0yG9mLHrmX/YfrCHKmTibRl3QfZF/CtPe+84RzsdAiWF7dSgv576wfRLX0RNQ==@kvack.org
X-Gm-Message-State: AOJu0YzfjDSHqd3CJ+oatfdAEaUe2kvnMMpQwWHCxHgBjaayELKj+lAP
	TYeYHrzFyj6lznoEJUe58swQJcodBNi6/TxS0mNdLyVo0LTJS+Hk41U99i+kMFuEVOt5V5UNav0
	iV1oeHkwvirjKBjbF2rC8Y4CW+Fc=
X-Gm-Gg: ASbGncuEWZWXpzHbWPjzmD5eNDOxBVN4dWB++AIKDtcnU6ngcGJFJjWrhyUUWddmxkt
	Et2tg2UkczBhBdLExO7x3IexBbHCIy7JfJxfbBG0Szgia/ru63GrXd1NBk6H2vk4cajNJLTCv
X-Google-Smtp-Source: AGHT+IGmgtLKsop8RKa2tEuVLkgFIevxqDsfRacxzQM81zNXLNWh6jeVcihTAB17drbSFmc7KJkjERlOwrfKhJNLldg=
X-Received: by 2002:a05:6102:2b9b:b0:4af:dcbe:4767 with SMTP id
 ada2fe7eead31-4b3d0d9fc34mr9934237137.10.1736504934800; Fri, 10 Jan 2025
 02:28:54 -0800 (PST)
MIME-Version: 1.0
References: <58716200-fd10-4487-aed3-607a10e9fdd0@gmail.com> <CAKEwX=PezunYEAjDVi6jumbGCHJEGc9UJaDnfh2nKaX8+UhxFQ@mail.gmail.com>
In-Reply-To: <CAKEwX=PezunYEAjDVi6jumbGCHJEGc9UJaDnfh2nKaX8+UhxFQ@mail.gmail.com>
From: Barry Song <21cnbao@gmail.com>
Date: Fri, 10 Jan 2025 23:28:43 +1300
X-Gm-Features: AbW1kva4_lAo0t3bKGIgum24FIkbxS8wODQEHdbJczfHKd4WeK4aKNQYOEPKwKE
Message-ID: <CAGsJ_4zaBdTuuwOTxTsNnP+UAwOiLPDDsfUsZ1B3Rh-UEuKm8Q@mail.gmail.com>
Subject: Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin
To: Nhat Pham <nphamcs@gmail.com>
Cc: Usama Arif <usamaarif642@gmail.com>, lsf-pc@lists.linux-foundation.org, 
	Linux Memory Management List <linux-mm@kvack.org>, Johannes Weiner <hannes@cmpxchg.org>, 
	Yosry Ahmed <yosryahmed@google.com>, Shakeel Butt <shakeel.butt@linux.dev>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: C8313C0006
X-Stat-Signature: 3s3zmc57388d5zx7zfxb8gzpecoab6gr
X-Rspam-User: 
X-Rspamd-Server: rspam09
X-HE-Tag: 1736504935-53771
X-HE-Meta: U2FsdGVkX1/D7X/L8N2PhIzTrJhcojJIzVLmdjPb6iWBfHTdvKiVlIoXuSy1UoexeRy5FM62aIJBbnzFmimWZ0pO0R0q75q+4XJvKuE9RNIAaRzFzhJDnf1fK6LmqgpyiZTtjl6nrWZME1AgJqKbZ+7u8RAVci7GFtTyggLrIKJnTQlxFTtEVqKBPR1QbY+UBL/YiOAaoPRaEKnaK+KrRqZ6bO+yQn/+QCxA7IazUHPXviyh3L1SXRPxbnZyeA5zel87I6IDqVH20vG+ld9Q9CDSCU5zeOyV1ONu7htDVc+28rCk5vLED68oKO/STX3woGDP3g6g8qj6D1jK7qcXkwUR7OhCDrVO1N+Kbydm7XiO6FY5xqzbtxR0xxPjd73RldpacjcPBiNysuabff8pLghF8oSLLmuATUMu13Q3MdJPw0G6YWg4pNrmXBoHTYkR+2urSNUe1Y7KJ0MMr/UBWpsBGQeBdq39DGxWn5l+kyxOFSASW/c749b3bO/XhTITcgkSeooAg/SW2I+jtixx/MRl2xozb0nY4NTfRaRu2Z1Thu1jRiNTL69RC7UBzzaEdTvAteEU1GkGom+C9LDgtjQdIKR2q0CdAnWxJzAr6XdB+Uk+PaqD4H7cbwhhT8uy8je5Dgjncge0BTl1fzKC5jT8TgSbEBzEnNuUSVU8X6NEWxc15ccMx7bwLIW8ge1xdWB3KcVBjtx5eAYMWLqe0qUoYpn6VjAegf00jFO3EXJotAmdcTbGd0F2lGh6mYC7dieK0jHjV0y6w41Grx2SwrF4yx213ciTmF59h6nsryitA7Xtqf5NWAttqHeSRq42ovL50n8ecVkVPgLb05CwSchqx+8NLt2OnxRhXtJheIbaKLG25DGgPEQ1QPRzXzdWnBK7D22zO3+70tWSPD/k4ZIrj7X8Ir+B6pkqbRhJrK1iMCknT2j64hvNvm8FD2RpriHaFYeGBlDG+imTqij
 6Pu/17Ml
 Isi14WC9+bkJm7Ag+GCNeIdGQn9l3x1VaGMQbrU0cE02H69tpfQ6MYUvUOTU9umCPULzhs/2nLGWrSAq1L0YNQc3Hk7159RTAWJjoy4XVIiI6lh5MH5YSzvffxqYdDDpf9vqCNkyByi1wqI4Y8tR71rZYo6+3N62hkvskTpvAqCcLyEGfXZ1INQqqjt++aKigU9HwkmKV7qWCPytofyG8eY8YhqN3Lp6Mgn+uea/CJ1ZpH4/BSFVjAP52PDOsWQHjBroiqbTNOssif0c0i0ofwa4nH1aSBqXkX0zVjsdiEaXOrsYBZXBR0bPUHHrBcIRYIQE25m+CF6KAc7n5QeKRtCRNBHKsWUj+LY85/nYShI77BP9jcs92QEJLQg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.006271, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Jan 10, 2025 at 5:29=E2=80=AFPM Nhat Pham <nphamcs@gmail.com> wrote=
:
>
> On Fri, Jan 10, 2025 at 3:08=E2=80=AFAM Usama Arif <usamaarif642@gmail.co=
m> wrote:
> >
> > I would like to propose a session to discuss the work going on
> > around large folio swapin, whether its traditional swap or
> > zswap or zram.
>
> I'm interested! Count me in the discussion :)
>
> >
> > Large folios have obvious advantages that have been discussed before
> > like fewer page faults, batched PTE and rmap manipulation, reduced
> > lru list, TLB coalescing (for arm64 and amd).
> > However, swapping in large folios has its own drawbacks like higher
> > swap thrashing.
> > I had initially sent a RFC of zswapin of large folios in [1]
> > but it causes a regression due to swap thrashing in kernel
> > build time, which I am confident is happening with zram large
> > folio swapin as well (which is merged in kernel).
> >
> > Some of the points we could discuss in the session:
> >
> > - What is the right (preferably open source) benchmark to test for
> > swapin of large folios? kernel build time in limited
> > memory cgroup shows a regression, microbenchmarks show a massive
> > improvement, maybe there are benchmarks where TLB misses is
> > a big factor and show an improvement.
> >
> > - We could have something like
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled
> > to enable/disable swapin but its going to be difficult to tune, might
> > have different optimum values based on workloads and are likely to be
>
> Might even be different across memory regions.
>
> > left at their default values. Is there some dynamic way to decide when
> > to swapin large folios and when to fallback to smaller folios?
> > swapin_readahead swapcache path which only supports 4K folios atm has a
> > read ahead window based on hits, however readahead is a folio flag and
> > not a page flag, so this method can't be used as once a large folio
> > is swapped in, we won't get a fault and subsequent hits on other
> > pages of the large folio won't be recorded.
>
> Is this beneficial/useful enough to make it into a page flag?
>
> Can we push this to the swap layer, i.e record the hit information on
> a per-swap-entry basis instead? The space is a bit tight, but we're
> already in the talk for the new swap abstraction layer. If we go the
> dynamic route, we can squeeze this kind of information in the
> dynamically allocated per-swap-entry metadata structure (swap
> descriptor?).
>
> However, the swap entry can go away after a swapin (see
> should_try_to_free_swap()), so that might be busted :)
>
> >
> > - For zswap and zram, it might be that doing larger block compression/
> > decompression might offset the regression from swap thrashing, but it
> > brings about its own issues. For e.g. once a large folio is swapped
> > out, it could fail to swapin as a large folio and fallback
> > to 4K, resulting in redundant decompressions.
> > This will also mean swapin of large folios from traditional swap
> > isn't something we should proceed with?
>
> Yeah the cost/benefit analysis differs between backend. I wonder if a
> one-size-fit-all, backend-agnostic policy could ever work - maybe we
> need some backend-driven algorithm, or some sort of hinting mechanism?
>
> This would make the logic uglier though. We've been here before with
> HDD and SSD swap, except we don't really care about the former, so we
> can prioritize optimizing for SSD swap (in fact looks like we're
> removing the HDD portion of the swap allocator). In this case however,
> zswap, zram, and SSD swap are all valid options, with different
> characteristics that can make the optimal decision differ :)
>
> If we're going the block (de)compression route, there is also this
> pesky block size question. For instance, do we want to store the
> entire 2MB in a single block? That would mean we need to decompress
> the entire 2MB block at load time. It might be more straightforward in
> the mTHP world, but we do need to consider 2MB THP users too.

I don't think we need to save the entire 2MB in a single block. After 64KB,
we don't see much improvement in compression ratio or speed. The most
significant increase was observed between 4KB and 16KB.

For example, for zstd:

File size: 182502912 bytes

4KB Block: Compression time =3D 0.967303 seconds, Decompression time =3D
0.200064 seconds
  Original size: 182502912 bytes
  Compressed size: 66089193 bytes
  Compression ratio: 36.21%

16KB Block: Compression time =3D 0.567167 seconds, Decompression time =3D
0.152807 seconds
  Original size: 182502912 bytes
  Compressed size: 59159073 bytes
  Compression ratio: 32.42%

32KB Block: Compression time =3D 0.543887 seconds, Decompression time =3D
0.136602 seconds
  Original size: 182502912 bytes
  Compressed size: 57958701 bytes
  Compression ratio: 31.76%

64KB Block: Compression time =3D 0.536979 seconds, Decompression time =3D
0.127069 seconds
  Original size: 182502912 bytes
  Compressed size: 56700795 bytes
  Compression ratio: 31.07%

128KB Block: Compression time =3D 0.540505 seconds, Decompression time =3D
0.120685 seconds
  Original size: 182502912 bytes
  Compressed size: 55765775 bytes
  Compression ratio: 30.56%

256KB Block: Compression time =3D 0.575515 seconds, Decompression time =3D
0.125049 seconds
  Original size: 182502912 bytes
  Compressed size: 54203461 bytes
  Compression ratio: 29.70%

512KB Block: Compression time =3D 0.571370 seconds, Decompression time =3D
0.119609 seconds
  Original size: 182502912 bytes
  Compressed size: 53914422 bytes
  Compression ratio: 29.54%

1024KB Block: Compression time =3D 0.556631 seconds, Decompression time
=3D 0.119475 seconds
  Original size: 182502912 bytes
  Compressed size: 53239893 bytes
  Compression ratio: 29.17%

2048KB Block: Compression time =3D 0.539796 seconds, Decompression time
=3D 0.119751 seconds
  Original size: 182502912 bytes
  Compressed size: 52923234 bytes
  Compression ratio: 29.00%

To simplify things(Reduce the potential decompression of large blocks for s=
mall
swap-ins), for a 2MB THP, we are actually saving it as 2MB/16KB blocks in
zsmalloc, as shown in the RFC.

https://lore.kernel.org/linux-mm/20241121222521.83458-1-21cnbao@gmail.com/

>
> Finally, the calculus might change once large folio allocation becomes
> more reliable. Perhaps we can wait until Johannes and Yu make this
> work?
>
> >
> > - Should we even support large folio swapin? You often have high swap
> > activity when the system/cgroup is close to running out of memory, at t=
his
> > point, maybe the best way forward is to just swapin 4K pages and let
> > khugepaged [2], [3] collapse them if the surrounding pages are swapped =
in
> > as well.
>
> Perhaps this is the easiest thing to do :)

Thanks
barry