From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B7C73D66B9A
	for <linux-mm@archiver.kernel.org>; Wed, 27 Nov 2024 00:09:12 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E19956B0083; Tue, 26 Nov 2024 19:09:11 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id DA2576B0085; Tue, 26 Nov 2024 19:09:11 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C43E06B0088; Tue, 26 Nov 2024 19:09:11 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 9E5836B0083
	for <linux-mm@kvack.org>; Tue, 26 Nov 2024 19:09:11 -0500 (EST)
Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 28EB51C84E5
	for <linux-mm@kvack.org>; Wed, 27 Nov 2024 00:09:11 +0000 (UTC)
X-FDA: 82829939934.21.3B3FD45
Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91])
	by imf28.hostedemail.com (Postfix) with ESMTP id 280A0C0007
	for <linux-mm@kvack.org>; Wed, 27 Nov 2024 00:09:00 +0000 (UTC)
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=G5bi6tWz;
	spf=pass (imf28.hostedemail.com: domain of chrisl@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=chrisl@kernel.org;
	dmarc=pass (policy=quarantine) header.from=kernel.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1732666145;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=52afBux2Qq2+8ozCDisT+znnj5ky1SkX2eREwVqJCxQ=;
	b=UUMeBl3+VZ7IBmndKeiXQynJCimchRyFlUnKg2SYDqIyee2ZNGvkwAFukT4muYYXcpaG3W
	eUqY+mFxZi8uOIVScSYimkFOgb3tJBZDlHxGa0QeIOHgUpUXKB9c1luFn56geVJee0VCKG
	IlB+vPFxNUjzgyn+t4dZ/JZtpAlTQ1s=
ARC-Authentication-Results: i=1;
	imf28.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=G5bi6tWz;
	spf=pass (imf28.hostedemail.com: domain of chrisl@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=chrisl@kernel.org;
	dmarc=pass (policy=quarantine) header.from=kernel.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732666145; a=rsa-sha256;
	cv=none;
	b=mXj164+IpvkiO0mH2PdRM4pnQw/RG8qNzs02j/VxIMG67UuZ6SnmmQB1RxUzNbASYArnnI
	1/yJiKOCbJYOhmBPmALK0+96/rQ1jTZNWNXrT57w9XHfLwpl51wBBKe6XSoogNA3edVPld
	5FiLIxtaeufHIQUe8u6rMQLoT4D4GrI=
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by nyc.source.kernel.org (Postfix) with ESMTP id 76C22A41756
	for <linux-mm@kvack.org>; Wed, 27 Nov 2024 00:07:15 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0E696C4CED0
	for <linux-mm@kvack.org>; Wed, 27 Nov 2024 00:09:08 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1732666148;
	bh=/CdDLGoJ1M8kEeGszX5uJUw35m3jeZUaggvfENDOBUs=;
	h=References:In-Reply-To:From:Date:Subject:To:Cc:From;
	b=G5bi6tWzX4ZuvkzJiRiNSlRty24faSfQ7HOtuh6syVJKVjnbOyhsHbp1bV7UMKnRs
	 a1+9vY8mzTuqchiyV/EHO/9CVZQzCgJPNztwwKH7xAYkG1h3l9GXzTDvH/ztFJjZJ8
	 ULNJP4ArfM5L6XSih2W7UQ0i9kQwysy4yLH+NmIuu13iQXmG5RrVzIzpzHIvZGByzE
	 4I0CUJZ8FXhCXz6EHSA2fWTBmlcuLaWch1hxUk/j6Zq2VC2v96Y0csNIdDAvxv/OYS
	 alOsc9BAjkFZ6EM2Ap6AbOwTqeyJR/kg/s+cW2NzS2IDw/n0NFq9qkTvNX2SxEyN9b
	 4KEFv+vVP/l3A==
Received: by mail-yw1-f179.google.com with SMTP id 00721157ae682-6eebb54fc48so55946487b3.1
        for <linux-mm@kvack.org>; Tue, 26 Nov 2024 16:09:08 -0800 (PST)
X-Forwarded-Encrypted: i=1; AJvYcCUvtQPoN/pxdeFbzYsjthirWHtMYMRwlwfdh+4RDzuxNMlJ1gZQAlC6cuuC3Pjy9Qi2tAnkyI1Jkw==@kvack.org
X-Gm-Message-State: AOJu0Yx5iLeYuQCx5ir/PqD2ExZlIURQacrqSRBmKMvzJtSHc3O1MMkZ
	jyr9FTearzSwThg/29v8OO7h4W5KvnaMYyxWFrESUKpkRI6WkiqOOPVwyIg3cfnpwdDT48PDN8t
	cDkNKjqSHDxx/sWaEDhdDmcTdTae4J68WmSc2Og==
X-Google-Smtp-Source: AGHT+IHqcVYjU790T8D/asBW13GT8ImEB95KZ3XVSKArXMRm/5ZayDypabv/Zo+v4QJYaFWkdMfkNsrTf+s6/9a+qmA=
X-Received: by 2002:a05:690c:6e03:b0:6ee:b7cf:4a8f with SMTP id
 00721157ae682-6ef37282b3dmr14585947b3.38.1732666147348; Tue, 26 Nov 2024
 16:09:07 -0800 (PST)
MIME-Version: 1.0
References: <20241116091658.1983491-1-chenridong@huaweicloud.com>
 <20241116091658.1983491-2-chenridong@huaweicloud.com> <Zzq8jsAQNYgDKSGN@casper.infradead.org>
 <CAGsJ_4x0OrdhorQdz8PyLD84GOYVZJ7kLfGV_5yupLG_ZQ_B3w@mail.gmail.com> <ZzrA5nXldoE2PWx4@casper.infradead.org>
In-Reply-To: <ZzrA5nXldoE2PWx4@casper.infradead.org>
From: Chris Li <chrisl@kernel.org>
Date: Tue, 26 Nov 2024 16:08:56 -0800
X-Gmail-Original-Message-ID: <CACePvbWQv1KiNua4nC6L7ph-U+qXHTGPSjHMLUyPNLz-fSz7eQ@mail.gmail.com>
Message-ID: <CACePvbWQv1KiNua4nC6L7ph-U+qXHTGPSjHMLUyPNLz-fSz7eQ@mail.gmail.com>
Subject: Re: [RFC PATCH v2 1/1] mm/vmscan: move the written-back folios to the
 tail of LRU after shrinking
To: Matthew Wilcox <willy@infradead.org>
Cc: Barry Song <21cnbao@gmail.com>, Chen Ridong <chenridong@huaweicloud.com>, 
	akpm@linux-foundation.org, mhocko@suse.com, hannes@cmpxchg.org, 
	yosryahmed@google.com, yuzhao@google.com, david@redhat.com, 
	ryan.roberts@arm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	chenridong@huawei.com, wangweiyang2@huawei.com, xieym_ict@hotmail.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 280A0C0007
X-Rspamd-Server: rspam12
X-Stat-Signature: k5zmoobhkcz99c831khsfy4nwffzbep1
X-Rspam-User: 
X-HE-Tag: 1732666140-900
X-HE-Meta: U2FsdGVkX19LOEf0I68BT9SJKrWKiFLUfuOLLmX6dY9MQVAkB3AcqDhDBGFrv+FAVFytLAB6BwJML5cpMbfQAI6pOL+60Zl+gV+rY5ClNx4hzW5ey3RiJZ5FFHdz/y3rum3tF6PpkIyZZCqaFY1JBnu0gFsC9DYMlul94JNtN3a+F0ys9YnOJMrHzqlaYeKxOJ58QiRTE8HFsZ5XLPS+NUlkeF1zWzxlZdwiv81r+is59JO53MWWYoOsXnn155YWHxzmQrhNbH2XojyMpHv5VfTFduhxeG+CVktJ4s/KB/KhkqCj1y2s+nQPLihTauzRrY5dFm/NxwN6i8CyECbnrB28hUxnE1gjh9W1Fs2UOC1fNn2iud8BlKdtwLNsjq6wWTj+SPK4YHLuIrPJ+duRi/zhnK/K4VrMe9jObPltyhefsvbmVmv+a7N0g2pq9QRLQcdzA1H8X7/85WMu7rcUXZZR7G5Ta0WitVmWNgevhCt0xiGDwxyQl4J1TkpwYEQGU0hnZVnHjbB+5m2RRrrRY+sIxiaEGpTIUnPpjzYcRhUDAS1S5X6SVfm/vVEv6QF5BZHw3caDcvgF4y1FOX/yK82Ug1QpG99l3C4pCF6WaQIQc/9C4UJ+jzlFR5RfRGFrvKEXTwfhgYVigezWMs+Hv/3kkmedypMGZAIA1mcdc4pFs5q3W2ANLrCAYI6Kv7LV7L0NhXguSlD7RQ2FC+azSWdf7czajjjp1osP59TyqxdGwmGllxPO3xGJsznhVtRjP9kILcCWCgDQ1XRzsWVoY3LNYRdFSmYLp3oP2L16KpZGZ1ckePMIXchKjCLFuOh+Bv544G2pLi8K9wLCzzt5F9Fr4vMT46fMkkP09EjMBY8Jxqjyf1kv3Xiokbu7EX6GosaSPx0mphJyli9f1vCOEjjKI1z5+ZFYb3ZTt4Pssoa1Naecrnm0yOAesjtgRAOWmXwKgoyK60g44N1dIHc
 +nB9BDNi
 4nz2JTLxiUPWiptUhQo/5Q8dPjNP/AUUuBGnm6pYVLLGRrZr1LjVsV9xBre3vc+O2gIQHWi3gNN3quHOiWaZjC2UC05WM/W0AKXeXCgmeG+TQyF5ii9TltanH51TqRZ6nCJZYIUYplfwOCB6zPL/EsjPJ9Kd+bWlsH8aNV6lFXoBP7A0MeIhAZHjVybDM16upt9HRfe+yfBN4uR5PKZtkPd5KnApYeKlbiBnEkP5V0l6wZWpUOx2HJhGwjUuPv1iv3tFxcaxHgUmLXTd0gOCNSzqOG0stAqFST9WUqPlY0fKxvAJJL1Or0YWq+g09aTOHWZBp1yp2CIbAyyc0qmmIn2fydQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Sun, Nov 17, 2024 at 8:22=E2=80=AFPM Matthew Wilcox <willy@infradead.org=
> wrote:
>
> On Mon, Nov 18, 2024 at 05:14:14PM +1300, Barry Song wrote:
> > On Mon, Nov 18, 2024 at 5:03=E2=80=AFPM Matthew Wilcox <willy@infradead=
.org> wrote:
> > >
> > > On Sat, Nov 16, 2024 at 09:16:58AM +0000, Chen Ridong wrote:
> > > > 2. In shrink_page_list function, if folioN is THP(2M), it may be sp=
lited
> > > >    and added to swap cache folio by folio. After adding to swap cac=
he,
> > > >    it will submit io to writeback folio to swap, which is asynchron=
ous.
> > > >    When shrink_page_list is finished, the isolated folios list will=
 be
> > > >    moved back to the head of inactive lru. The inactive lru may jus=
t look
> > > >    like this, with 512 filioes have been move to the head of inacti=
ve lru.
> > >
> > > I was hoping that we'd be able to stop splitting the folio when addin=
g
> > > to the swap cache.  Ideally. we'd add the whole 2MB and write it back
> > > as a single unit.
> >
> > This is already the case: adding to the swapcache doesn=E2=80=99t requi=
re splitting
> > THPs, but failing to allocate 2MB of contiguous swap slots will.
>
> Agreed we need to understand why this is happening.  As I've said a few
> times now, we need to stop requiring contiguity.  Real filesystems don't
> need the contiguity (they become less efficient, but they can scatter a
> single 2MB folio to multiple places).
>
> Maybe Chris has a solution to this in the works?

Hi Matthew  and Chenridong,

Sorry for the late reply.

I don't have a working solution yet. I just have some ideas.

One of the big challenges is what to do with swap cache. Currently
when a folio was added to the swap cache, it assumed continued swap
entry. There will be a lot of complexity to break that assumption. To
make things worse, the discontiguous swap entry might belong to a
different xarray due to the 64M swap address sharding.

One idea is that we can have a special kind of swap device to do swap
entry redirecting.

For the swap out path,

Let's say the real swapfile A is almost full. We want to allocate an
order of 4 swap entries to folio F.

If there are contiguous swap entries in A, the swap allocator just
returns entry [A9 ..A12], with A9 as the head swap entry. That is the
same as the normal path we have now.

On the other hand, if there is no contiguous swap entry in A. Only
non-contiguous swap entry A1, A3, A5, A7.

Instead, we allocate from a special redirecting swap device R as R1,
R2, R3, R4 with an IO redirecting array as [R1, A1, A3, A5, A7]. Swap
device R is virtual, there is no real file backing on it, so the swap
file size on R can grow or shrink as needed.

In add_to_swap_cache(), we set folio F->swap =3D R1. Add F into swap
cache S with entry [R1..R4] pointing to folio F. In other words,
S[R1..R4] =3D F.  Add additional lookup xarray L[R1..R4] =3D [R1, A1, A3,
A5, A7]. For the rest of the code, we pass the R1 as the continuous
swap entry to folio F.

The swap_writepage_bdev_async() will recognize R as a special device.
It will do the lookup xarray L[R1] to get the [R1, A1, A3, A5, A7],
use that entry list to build the bio with 4 iovec instead of 1. Fill
up the [A1,A3,A5,A7] into the bio vec. That is the swap write path.

For the swap in, the page fault handler gets a fault at address X and
looks up the pte containing swap entry R3.  Look up the swap cache of
S[R3] and get nothing, folio F is not in the swap cache.
Recognize the R is a remapping device. The swap core will lookup L[R3]
=3D [R1, A1,A3,A5,A7]. If we want to swap in order 2 folio. Then
construct swap_read_folio_bdev_async() with iovec [A1, A3, A5, A7].
If we just want to swap in a 4k page. We can construct iovec as [A3]
alone, given the swap entry starts from R1.

That is the read path.

For the simplicity, there is a lot of detail omitted in the
description. Also on the implementation side, a lot of optimizations
we might be able to do, e.g. using pointer lookup of R1 instead of
xarray, we can use struct to hold R1 and [A1, A3, A5, A7] etc.

This approach avoids a lot of complexity in breaking the continuity
assumption of swap cache entries, at the cost of additional swap cache
address space R. The lookup mapping L[R1..R4] =3D [R1, A1, A3, A5, A7]
are minimally necessary data structures to track the IO remapping. I
think that is unavoidable.

Please let me know if you see any problem with the above approach. As
always, feedback is welcome as well.

Thanks

Chris