From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B2867C25B75
	for <linux-mm@archiver.kernel.org>; Wed, 15 May 2024 23:07:36 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E52636B03A5; Wed, 15 May 2024 19:07:35 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E02006B03A6; Wed, 15 May 2024 19:07:35 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CF1096B03A7; Wed, 15 May 2024 19:07:35 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id B25216B03A5
	for <linux-mm@kvack.org>; Wed, 15 May 2024 19:07:35 -0400 (EDT)
Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 3CD60A14D7
	for <linux-mm@kvack.org>; Wed, 15 May 2024 23:07:35 +0000 (UTC)
X-FDA: 82122168870.29.E6C1A2C
Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55])
	by imf17.hostedemail.com (Postfix) with ESMTP id 9AFC34000B
	for <linux-mm@kvack.org>; Wed, 15 May 2024 23:07:32 +0000 (UTC)
Authentication-Results: imf17.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=Nf2BQYRe;
	dmarc=pass (policy=none) header.from=kernel.org;
	spf=pass (imf17.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1715814453;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=vowUZregocFeZ2CPGCthcTNIteZM/5TitnveV2P50Ss=;
	b=OMhbmuir7wckyViwOLq9SczTcCEyVQu/p1m7BT8oCLeyHLjxPcBZDm91siS4o9ajwM07Uo
	JxoK/rvJG+of9JgCt2byzApoIB+FQXcFM8Scw3g8n7NzrQAPL1a1yGo6pRhVQRSqw0Bq3P
	3K+rBvWYjRIPqTjYut940Wn82uK95NY=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715814453; a=rsa-sha256;
	cv=none;
	b=IN1Y6C7ILQr7mQECjFUGmmvuHnVF2XpYXl8ESopYjJxgfK+abH6BQhra0txonAXNtiIZwK
	pHbw1z7vbWwe2dYKgyJl9rflnSTwEhIse+WhuB+294Nbj1yabqIcNruT5xpqKUWInxA8CI
	6MOLsd8frWSEGfh7OnbAQ4rpCsZEfZI=
ARC-Authentication-Results: i=1;
	imf17.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=Nf2BQYRe;
	dmarc=pass (policy=none) header.from=kernel.org;
	spf=pass (imf17.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by sin.source.kernel.org (Postfix) with ESMTP id D73B6CE17F2
	for <linux-mm@kvack.org>; Wed, 15 May 2024 23:07:27 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1E9E8C4AF08
	for <linux-mm@kvack.org>; Wed, 15 May 2024 23:07:27 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1715814447;
	bh=OIm+mJ/Qc6exNEvzCROv39VncoTSXhiPt0Ub+l9nKVM=;
	h=References:In-Reply-To:From:Date:Subject:To:Cc:From;
	b=Nf2BQYRenAQv7bkHcfzLZTo8HPmG5/kdlN5/qeqnEpftTXst7sQ+TAJ+UPyEF8xFe
	 KZ+vtEPcd7FKWUTgJFlf677uKC2lTZd6WAty2dOmv8Toj+TDXYOydI/e3VLde7dRG3
	 jb/MJXOzCscZrCT7mHvMh80ilAnESCv24eNDAawQB2otweTky1S0hoT9/Jm8nUve9X
	 0ErasqdLMl8jCK+jMpw9345jV1qdgugCc4cBFo9PX4pTTlAi6sJCq9YzH9Cpvy1wqe
	 GFvYVlf80/0mAVFV8wVHg421gM43HudW5N395DsZWYa1j4Lt2/5V7eV0Y2bzVOZVIt
	 zhAi0UHlAwSJA==
Received: by mail-lf1-f42.google.com with SMTP id 2adb3069b0e04-51f3a49ff7dso138026e87.2
        for <linux-mm@kvack.org>; Wed, 15 May 2024 16:07:27 -0700 (PDT)
X-Forwarded-Encrypted: i=1; AJvYcCWUGsKNO2GfsdpJuIlm+oh4p+rT2tat5A1/+yUaGMhk8zV2G310j7xy2CIcM0sSyqfC69WiF8hbeSWhrEEfUO9DJbQ=
X-Gm-Message-State: AOJu0Yz3Wjs+wpvDnvZrTI+SOBA8W0VBh6VCubBpZE2fgzF7a/GRfcbQ
	ubDEnR+S7sTm0y3i0rnd3dWLkd1GilHFqp33ztIAy/zamhtOSllFdy2XfTMKg4hLt7bG5Q7VsLQ
	sAYOqPNQXYqOa/uk6YhRJgxHRhA==
X-Google-Smtp-Source: AGHT+IEyH5C4NQHe08VOWxHaJO9/ZFvmsoCezln6C2r3kex4iYsTRyGmSfJhWgrX4+J5UZagDEU1QCADN9lRjyXXQA8=
X-Received: by 2002:a05:6512:2086:b0:51c:d6c9:e9a3 with SMTP id
 2adb3069b0e04-5220fc7d776mr9202336e87.17.1715814445531; Wed, 15 May 2024
 16:07:25 -0700 (PDT)
MIME-Version: 1.0
References: <CAF8kJuMQ7qBZqdHHS52jRyA-ETTfHnPv+V9ChaBsJ_q_G801Lw@mail.gmail.com>
 <039190fb-81da-c9b3-3f33-70069cdb27b0@oppo.com> <20240307140344.4wlumk6zxustylh6@quack3>
 <8da6a093-346b-35cd-818a-a82abfa6a930@oppo.com> <20240314082651.ckfpp2tyslq2hl2c@quack3>
 <CANzGp4Ks_uTj2h=G8cBBZLT+qMhWqbJC229xOTR_uHzrf4LpWw@mail.gmail.com>
In-Reply-To: <CANzGp4Ks_uTj2h=G8cBBZLT+qMhWqbJC229xOTR_uHzrf4LpWw@mail.gmail.com>
From: Chris Li <chrisl@kernel.org>
Date: Wed, 15 May 2024 16:07:13 -0700
X-Gmail-Original-Message-ID: <CANeU7QnPsTouKxdK2QO8Opho6dh1qMGTox2e5kFOV8jKoEJwig@mail.gmail.com>
Message-ID: <CANeU7QnPsTouKxdK2QO8Opho6dh1qMGTox2e5kFOV8jKoEJwig@mail.gmail.com>
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
To: Chuanhua Han <chuanhuahan@gmail.com>
Cc: Jan Kara <jack@suse.cz>, Chuanhua Han <hanchuanhua@oppo.com>, linux-mm <linux-mm@kvack.org>, 
	lsf-pc@lists.linux-foundation.org, ryan.roberts@arm.com, 21cnbao@gmail.com, 
	david@redhat.com, Matthew Wilcox <willy@infradead.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: mp8gdcdiwjebh3p7484pjsfme5jkhrci
X-Rspamd-Queue-Id: 9AFC34000B
X-Rspam-User: 
X-Rspamd-Server: rspam01
X-HE-Tag: 1715814452-562205
X-HE-Meta: U2FsdGVkX1+4rT2QhCOd7op97I5RWND3F4Dn6/s87Off8XiaF7997FlMXOLKX7P8mCyE0pslBRZRBSSY2cTMnzyLHQMFSRw1yzz21y9NcZm8aBJyhAnGIKZscWwZVSHOTCRpMiYmf5EzZH1LBrMZT3uVDJbNgCeZBYeYoDti/d4yWisIoCVpYmOFG+vPXn5lbZvNH14xvlK9Y+LEJRpn5+KAjMG6J5J5w9Pf4xMX+jhEPaBGyea34p2ynXi/NLj5kdiE+H8rdTIGd8TGGvKhMzrHRAkDBQXAwu28OWhln00SsCx4RUWuSnmHMWke8Wp/FDRuTK/dcNycyyndsDTmHCcfV42TZoa6wKNSjVcmyldWRodMV7K+Zmw4GpLcfPWxEtjjQXlcviSDD9Hta0OrGxr+sV6XrvPwgAuDrGGlIHhNaJPcybGveaQcN/x/35K+bHmb5k7DAaKL3gu3JVBqpUPdbJ+SE4rY+SPR/MFdx9xpcbv3qBGQ38cPF8sOOSwqxsuCsZ061dwkd8ZO4bcERrmHBdmBgTdVBzo085/nXIiccWi91f2DQwIbQDFiXkOXsRAawP4AzLZaONZ/N+za5G+5QluQ/XUIafjFw2ttm5kkD3EqA2+5rKRTgT74nUX02t1a12Ou5+4GitQz0P89y2mWhQHYJb/qDcwam6qeIbkoML9oaHaXm2CZlKTXzs/RlJYc0hGcn+qb6JmUbGimj+3tvFBavmRO+JiXsgvsXuP25QGx/SRgvBAYKzVIENEiUo12Vb1iBhV2O8b5dH8OnSFgQtXD4xPg/QMkBw0kv9pHXhXneRQgJKzhAFsnqtQwMI6R+HQ1xH95ia2tWll9mHhsvJj1BToH+HpvgVxpERyd9+2RZLCToG6g3ohXyXLI9Rk2lpXPYmD7gNucpRrqnywpnp0L8OdrIJQ3sZfMKK5KYp3ca4p211iAtMvhZOYnmqUHjO0/uS3kVeg7Lvc
 gmcxYOJa
 Bwp8x+0Y7hakAst/4CmsY4HQWRCAVdmACkYMdcjtK2JCxB/HJX8VSFlQeqaeqg6BEcZo14Su7q3t/3KgQeI62sh2uD3FcaiDVcEVuC5BVSI4KSNLmUcOGVN/WewLA13c36chIne6lqVsjZji5lAcUcqFaudl3yBjHodImiuXUEC8JyixjJ20TPNM3Kmg8rtNd7ZjF91+8E4nLv9WejzpKRFduZDEFYXu5IF2mE/T5WX57wqkSZh6qmej5cIV1dpShAwZiyIHFGcE068xtuLDb+AcAx75WHYtfxhI+8Eqqf60HwA/qBzcQb1w7V1Zi6FwDzeyj2uV32gqvPeodiUsXluOAwOiojDd881xdEFJ5tbDKEAYZJwyMFpAlGjN+FVm97v4aEcHDbQ8eFMImPgoWP+DLicme4vJV0jr39o0XdlzZkrLLaVDX+wraDw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hi,

Here is my slide for today's swap abstraction discussion.

https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view

Chris

On Thu, Mar 14, 2024 at 4:20=E2=80=AFAM Chuanhua Han <chuanhuahan@gmail.com=
> wrote:
>
> Jan Kara <jack@suse.cz> =E4=BA=8E2024=E5=B9=B43=E6=9C=8814=E6=97=A5=E5=91=
=A8=E5=9B=9B 16:28=E5=86=99=E9=81=93=EF=BC=9A
> >
> > On Fri 08-03-24 10:02:20, Chuanhua Han wrote:
> > >
> > > =E5=9C=A8 2024/3/7 22:03, Jan Kara =E5=86=99=E9=81=93:
> > > > On Thu 07-03-24 15:56:57, Chuanhua Han via Lsf-pc wrote:
> > > >> =E5=9C=A8 2024/3/1 17:24, Chris Li =E5=86=99=E9=81=93:
> > > >>> In last year's LSF/MM I talked about a VFS-like swap system. That=
 is
> > > >>> the pony that was chosen.
> > > >>> However, I did not have much chance to go into details.
> > > >>>
> > > >>> This year, I would like to discuss what it takes to re-architect =
the
> > > >>> whole swap back end from scratch?
> > > >>>
> > > >>> Let=E2=80=99s start from the requirements for the swap back end.
> > > >>>
> > > >>> 1) support the existing swap usage (not the implementation).
> > > >>>
> > > >>> Some other design goals::
> > > >>>
> > > >>> 2) low per swap entry memory usage.
> > > >>>
> > > >>> 3) low io latency.
> > > >>>
> > > >>> What are the functions the swap system needs to support?
> > > >>>
> > > >>> At the device level. Swap systems need to support a list of swap =
files
> > > >>> with a priority order. The same priority of swap device will do r=
ound
> > > >>> robin writing on the swap device. The swap device type includes z=
swap,
> > > >>> zram, SSD, spinning hard disk, swap file in a file system.
> > > >>>
> > > >>> At the swap entry level, here is the list of existing swap entry =
usage:
> > > >>>
> > > >>> * Swap entry allocation and free. Each swap entry needs to be
> > > >>> associated with a location of the disk space in the swapfile. (of=
fset
> > > >>> of swap entry).
> > > >>> * Each swap entry needs to track the map count of the entry. (swa=
p_map)
> > > >>> * Each swap entry needs to be able to find the associated memory
> > > >>> cgroup. (swap_cgroup_ctrl->map)
> > > >>> * Swap cache. Lookup folio/shadow from swap entry
> > > >>> * Swap page writes through a swapfile in a file system other than=
 a
> > > >>> block device. (swap_extent)
> > > >>> * Shadow entry. (store in swap cache)
> > > >>>
> > > >>> Any new swap back end might have different internal implementatio=
n,
> > > >>> but needs to support the above usage. For example, using the exis=
ting
> > > >>> file system as swap backend, per vma or per swap entry map to a f=
ile
> > > >>> would mean it needs additional data structure to track the
> > > >>> swap_cgroup_ctrl, combined with the size of the file inode. It wo=
uld
> > > >>> be challenging to meet the design goal 2) and 3) using another fi=
le
> > > >>> system as it is..
> > > >>>
> > > >>> I am considering grouping different swap entry data into one sing=
le
> > > >>> struct and dynamically allocate it so no upfront allocation of
> > > >>> swap_map.
> > > >>>
> > > >>> For the swap entry allocation.Current kernel support swap out 0 o=
rder
> > > >>> or pmd order pages.
> > > >>>
> > > >>> There are some discussions and patches that add swap out for foli=
o
> > > >>> size in between (mTHP)
> > > >>>
> > > >>> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.rob=
erts@arm.com/
> > > >>>
> > > >>> and swap in for mTHP:
> > > >>>
> > > >>> https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail=
.com/
> > > >>>
> > > >>> The introduction of swapping different order of pages will furthe=
r
> > > >>> complicate the swap entry fragmentation issue. The swap back end =
has
> > > >>> no way to predict the life cycle of the swap entries. Repeat allo=
cate
> > > >>> and free swap entry of different sizes will fragment the swap ent=
ries
> > > >>> array. If we can=E2=80=99t allocate the contiguous swap entry for=
 a mTHP, it
> > > >>> will have to split the mTHP to a smaller size to perform the swap=
 in
> > > >>> and out. T
> > > >>>
> > > >>> Current swap only supports 4K pages or pmd size pages. When addin=
g the
> > > >>> other in between sizes, it greatly increases the chance of fragme=
nting
> > > >>> the swap entry space. When no more continuous swap swap entry for
> > > >>> mTHP, it will force the mTHP split into 4K pages. If we don=E2=80=
=99t solve
> > > >>> the fragmentation issue. It will be a constant source of splittin=
g the
> > > >>> mTHP.
> > > >>>
> > > >>> Another limitation I would like to address is that swap_writepage=
 can
> > > >>> only write out IO in one contiguous chunk, not able to perform
> > > >>> non-continuous IO. When the swapfile is close to full, it is like=
ly
> > > >>> the unused entry will spread across different locations. It would=
 be
> > > >>> nice to be able to read and write large folio using discontiguous=
 disk
> > > >>> IO locations.
> > > >>>
> > > >>> Some possible ideas for the fragmentation issue.
> > > >>>
> > > >>> a) buddy allocator for swap entities. Similar to the buddy alloca=
tor
> > > >>> in memory. We can use a buddy allocator system for the swap entry=
 to
> > > >>> avoid the low order swap entry fragment too much of the high orde=
r
> > > >>> swap entry. It should greatly reduce the fragmentation caused by
> > > >>> allocate and free of the swap entry of different sizes. However t=
he
> > > >>> buddy allocator has its own limit as well. Unlike system memory, =
we
> > > >>> can move and compact the memory. There is no rmap for swap entry,=
 it
> > > >>> is much harder to move a swap entry to another disk location. So =
the
> > > >>> buddy allocator for swap will help, but not solve all the
> > > >>> fragmentation issues.
> > > >> I have an idea here=F0=9F=98=81
> > > >>
> > > >> Each swap device is divided into multiple chunks, and each chunk i=
s
> > > >> allocated to meet each order allocation
> > > >> (order indicates the order of swapout's folio, and each chunk is u=
sed
> > > >> for only one order).
> > > >> This can solve the fragmentation problem, which is much simpler th=
an
> > > >> buddy, easier to implement,
> > > >>  and can be compatible with multiple sizes, similar to small slab =
allocator.
> > > >>
> > > >> 1) Add structure members
> > > >> In the swap_info_struct structure, we only need to add the offset =
array
> > > >> representing the offset of each order search.
> > > >> eg:
> > > >>
> > > >> #define MTHP_NR_ORDER 9
> > > >>
> > > >> struct swap_info_struct {
> > > >>     ...
> > > >>     long order_off[MTHP_NR_ORDER];
> > > >>     ...
> > > >> };
> > > >>
> > > >> Note: order_off =3D -1 indicates that this order is not supported.
> > > >>
> > > >> 2) Initialize
> > > >> Set the proportion of swap device occupied by each order.
> > > >> For the sake of simplicity, there are 8 kinds of orders.
> > > >> Number of slots occupied by each order: chunk_size =3D 1/8 * maxpa=
ges
> > > >> (maxpages indicates the maximum number of available slots in the c=
urrent
> > > >> swap device)
> > > > Well, but then if you fill in space of a particular order and need =
to swap
> > > > out a page of that order what do you do? Return ENOSPC prematurely?
> > > If we swapout a subpage of large folio(due to a split in large folio)=
,
> > > Simply search for a free swap entry from order_off[0].
> >
> > I meant what are you going to do if you want to swapout 2MB huge page b=
ut
> > you don't have any free swap entry of the appropriate order? History sh=
ows
> > that these schemes where you partition available space into buckets of
> > pages of different order tends to fragment rather quickly so you need t=
o
> > also implement some defragmentation / compaction scheme and once you do
> > that you are at the complexity of a standard filesystem block allocator=
.
> > That is all I wanted to point at :)
> OK, got it!  It's true that my approach doesn't eliminate
> fragmentation, but it can be
> mitigated to some extent, and the method itself doesn't currently
> involve complex
> file system operations.
> >
> >                                                                 Honza
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
> >
> Thnaks,
> Chuanhua