From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CAF36C54E60
	for <linux-mm@archiver.kernel.org>; Tue, 19 Mar 2024 06:27:32 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 59AC56B0083; Tue, 19 Mar 2024 02:27:32 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 523FC6B0085; Tue, 19 Mar 2024 02:27:32 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3C44C6B0087; Tue, 19 Mar 2024 02:27:32 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 2608F6B0083
	for <linux-mm@kvack.org>; Tue, 19 Mar 2024 02:27:32 -0400 (EDT)
Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id C78B9A0FCC
	for <linux-mm@kvack.org>; Tue, 19 Mar 2024 06:27:31 +0000 (UTC)
X-FDA: 81912807102.27.C0F80DE
Received: from mail-ua1-f49.google.com (mail-ua1-f49.google.com [209.85.222.49])
	by imf17.hostedemail.com (Postfix) with ESMTP id F02CE4000D
	for <linux-mm@kvack.org>; Tue, 19 Mar 2024 06:27:29 +0000 (UTC)
Authentication-Results: imf17.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="d/RigCTL";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf17.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.49 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710829650;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Vs7vWTjjMcASIAitT8voMJEud9O3qlM3o8yjeSirqcc=;
	b=Tvn692XVuqbtc+JPbEQ2RTYDR33KG5mtuIPtPS/1i86LaWXFBTYGWYI6a9r3nH0QnDWv3o
	3Z4TXXSScqRynt/2ze+Rmgt9XGqIPiEDGawlM5gIIOoTUAErmyPzqYRf0onJakTGZ0z5Xf
	0zQsI109CimQTeCGmRb8ShNZ1TDiPh8=
ARC-Authentication-Results: i=1;
	imf17.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="d/RigCTL";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf17.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.49 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710829650; a=rsa-sha256;
	cv=none;
	b=6ShpYCKQigU24AoBdeKm/ItxIsrv36fxIFTd2CqZBZTR1WIoEFFpSCgZwJ+bemj5Xg0TD1
	pW2dV4kYeETbNRDna+RN794+poA1R6mgDeheXXbCWhjK7O+OFRQanpDKwIBYWu/ns9WzD3
	D9zs8ivkHWv68j6IL90rCm3vmbi8xSk=
Received: by mail-ua1-f49.google.com with SMTP id a1e0cc1a2514c-7e03e591693so1204580241.1
        for <linux-mm@kvack.org>; Mon, 18 Mar 2024 23:27:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1710829649; x=1711434449; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=Vs7vWTjjMcASIAitT8voMJEud9O3qlM3o8yjeSirqcc=;
        b=d/RigCTL0imY8pP+2KS3nOZh5u5lCIKj+hQ10owUdmrHB7X5OAbbynoWUo7NVIRCja
         xFKFI6naRHAUo8bYU3I8oLnzWxbY09/zpGzdNGd5YKQ+qM1EwjUge1gKTKp+gJ/RDDp4
         6aRPRrFGyVy/WrXrDJgd0+Wy4o+5w59BZZfAtFUlojFXkooNP9mA2MDHL4wfWrwJrgxi
         WFNOhy6GYg19qyihQwKL6W2YFBZia9KKLlM3Wpwn5hX/kuX0pPa3tea8QBe+W9Ne5Jxv
         +oJgPJfMYRT4Pkcrr2j2Arp8k7vC56u2gaS3q+OOhycm7PZNLZ3lA1HES1PGySlyUL35
         J3KQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1710829649; x=1711434449;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=Vs7vWTjjMcASIAitT8voMJEud9O3qlM3o8yjeSirqcc=;
        b=Y55iBy0j9lGf9IXlaZhHzS2tPzYmbR15txi61fmott99UeD+NG6uDpUZNihhe27auy
         AKoYgQ5zJomEJP0RtXfBBGf8861bX2mKN3umkK9825BwDDwZtIZb/HBtqGM9cILBbJqs
         1VLaO7D2iG9YWs5jmjWBtBRDaADqccM6lU1TR49xDj38xRDRXtHwSEWokXNcKmlKu+6g
         JfeF0kGIkociQ2pEl076GX3sUoSZdjL7SJoWBpIUvg5pysQG2FoVc8oaVCcYrNTRBpLI
         ERRMjte+hLbV8PBE1wVXYyn3xDMp2/yhLW1DrENEbvpvbZQn+NF76PfiJdrwlK781rm6
         W83g==
X-Forwarded-Encrypted: i=1; AJvYcCWT4LjTGwqLjObzhYF1WocTgdBe9MA9W+a4hXTRZKrOw4detZ2bbfYwhjFx5ZaPCck+OmQb2mnvKfWf/xJSkuuJyKU=
X-Gm-Message-State: AOJu0YyeLnaS5+nPGrSqc7M2hPbJqKpbUXVPHQC1kzN38T5abJsioMYD
	FEon/zW6EgjGcApWGbmgWN/rjepzs2OjXpYbIPgpa32TFQI18OMYyD5RdMMB70CyKpMXFI+eZRn
	nCP1RCs4ZRLIBnYP7S/FGXnJ/uCU=
X-Google-Smtp-Source: AGHT+IGlSLWj7FvqNg95icP/DE4JbNFLu1eeAnxnc7mSkqtYFFe8hed4h08NbE6pekrB61R3PPAICKqmiCdDLarE1cs=
X-Received: by 2002:a05:6122:731:b0:4d4:be1:8196 with SMTP id
 49-20020a056122073100b004d40be18196mr10456951vki.11.1710829648831; Mon, 18
 Mar 2024 23:27:28 -0700 (PDT)
MIME-Version: 1.0
References: <20240304081348.197341-1-21cnbao@gmail.com> <20240304081348.197341-6-21cnbao@gmail.com>
 <87wmq3yji6.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAGsJ_4x+t_X4Tn15=QPbH58e1S1FwOoM3t37T+cUE8-iKoENLw@mail.gmail.com>
 <87sf0rx3d6.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAGsJ_4xna1xKz7J=MWDR3h543UvnS9v0-+ggVc5fFzpFOzfpyA@mail.gmail.com>
 <87jzm0wblq.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAGsJ_4wTU3cmzXMCu+yQRMnEiCEUA8rO5=QQUopgG0RMnHYd5g@mail.gmail.com>
 <9ec62266-26f1-46b6-8bb7-9917d04ed04e@arm.com>
In-Reply-To: <9ec62266-26f1-46b6-8bb7-9917d04ed04e@arm.com>
From: Barry Song <21cnbao@gmail.com>
Date: Tue, 19 Mar 2024 19:27:17 +1300
Message-ID: <CAGsJ_4xBiWWEbyaxC6nhjpA5te6Q8irQmFxZDePCRZtcpF0sVQ@mail.gmail.com>
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: "Huang, Ying" <ying.huang@intel.com>, Matthew Wilcox <willy@infradead.org>, akpm@linux-foundation.org, 
	linux-mm@kvack.org, chengming.zhou@linux.dev, chrisl@kernel.org, 
	david@redhat.com, hannes@cmpxchg.org, kasong@tencent.com, 
	linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, 
	mhocko@suse.com, nphamcs@gmail.com, shy828301@gmail.com, steven.price@arm.com, 
	surenb@google.com, wangkefeng.wang@huawei.com, xiang@kernel.org, 
	yosryahmed@google.com, yuzhao@google.com, Chuanhua Han <hanchuanhua@oppo.com>, 
	Barry Song <v-songbaohua@oppo.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: F02CE4000D
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Stat-Signature: ckdqtibt33h6xhgof1bnsg1jd31yocg9
X-HE-Tag: 1710829649-599612
X-HE-Meta: U2FsdGVkX1/bo/HtIUudNCvB1A5Z8/pJsox123LDbHdfCGLVjntSX2TkM7Mfu/4A4QYyfwo2E9T7RmTHPHwDm92Mvv/fDS8W1Yoxg1LDm1dNJ/a0hzkh+qWHWAKM2SYyXbsygwRr9Gfupg5t9//W/1WwOJk9QeV2Hf23wAlNzxhV2Xz1Y+xL/h6uaqr8kMneltlg6Dof1oT2Ppvwo9Xgf6l13BbfcjUfB5RL4aRcKB8IyNr4nGOO8C5ucGzOpwarw8Jz6lxWHQC7ekoqVKgdJ1Ecr7fukQ/RfD92H37TTJNdl5LNhaXWMt+c6Nawni8R9vEPUA4WTTl+KmuXBMxE9vECUn/wI70tSDXmcF4eaRSPKF8QG8vsmcJUTnoEh75XWCTzX9dXoLjWGEQhiMNWsMugs9TIKlvF+j1TDROKNMEwjckcTnOXkCrY7E494OZ9IZF+UNGh9uIBngpMPMjLaUqSvHgMAcanOXZTH0h+k3gwNdjyAPn+sypSdHRRbC2Bol6T0nFmOrH+icluX4O1yhCXipzrlzoiLs202UXroR/Oi6n2M09sPhpkveOhCCcdC15arMz8X62blmkNFXdaQcQq2A6EYv6YXSMdB6ZhhnO080DLyT/VgMaLsxHh0Okefd17nHoWMSONc+5aEjobRI0Dxzl2KY12bO7/urzP28BsUPGoOe9pvRvnXT7R1e+UzjoRcNI39Pmy13WmPzUj27yXFU4nnlbPAWR7R9618Yyk3kni+yPRCF1hfNzfg1IawYs1/53ddzueoxHrHMbQ8wCCFaw0TD5hdUQRDNyVOMKBY8y3A2vb8iUIHPKGLFVj74P/toidiarFjW+sXsuBTPfV1l8QFZ+1vWvxaqKGb/BaPzBsisAEKjMOoPootwMzW6haQgAVaWBk9Q+UpmTqWRJGzwIQPO7TOxdw5JuCV475a8YZLXGd+8dGhh6LK37ymrAo2yQxTsTkJvwr3ok
 J6sZisug
 d/u6EwdgcxmyIuLOQa1H1mz89RAbI1Fvu/kUcv2aFI0zztnu+7T9aT45uiPS8Tl0h6LZ7mN9zuRdrlTRH1LP5BYQ+9e14mo6JCLUw3GFYQB0yKUKK9R9jsfaxrWekfwd+ywtQ6CsvnqYdrw2jdnEwDphfmuGm/LkIue6/wA8FWOe8wkf6Yk0cna2JWjTGDO3lxDNnu/4iRzjjhpSMiVovwB0QJSsbBkM1uwUMRYoYm3CNCPim17CJ+swMw632n7X1bGwQNUr8MSbJZMao0AnJHQqcUlP9YToyb+1tYKzSK0yaqXUjnp70iTpurnMF3rInaY/V53sDrRPwkLGuKe1MeIOCpwiey7ARscxEuXTPQHcyCmpWaQufVEo2oZ5aDmD/TRluDwmTNgLH7+TLuRMOwgH+r+kLsalXMTWw
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Mar 19, 2024 at 5:45=E2=80=AFAM Ryan Roberts <ryan.roberts@arm.com>=
 wrote:
>
> >>> I agree phones are not the only platform. But Rome wasn't built in a
> >>> day. I can only get
> >>> started on a hardware which I can easily reach and have enough hardwa=
re/test
> >>> resources on it. So we may take the first step which can be applied o=
n
> >>> a real product
> >>> and improve its performance, and step by step, we broaden it and make=
 it
> >>> widely useful to various areas  in which I can't reach :-)
> >>
> >> We must guarantee the normal swap path runs correctly and has no
> >> performance regression when developing SWP_SYNCHRONOUS_IO optimization=
.
> >> So we have to put some effort on the normal path test anyway.
> >>
> >>> so probably we can have a sysfs "enable" entry with default "n" or
> >>> have a maximum
> >>> swap-in order as Ryan's suggestion [1] at the beginning,
> >>>
> >>> "
> >>> So in the common case, swap-in will pull in the same size of folio as=
 was
> >>> swapped-out. Is that definitely the right policy for all folio sizes?=
 Certainly
> >>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'=
m not sure
> >>> it makes sense for 2M THP; As the size increases the chances of actua=
lly needing
> >>> all of the folio reduces so chances are we are wasting IO. There are =
similar
> >>> arguments for CoW, where we currently copy 1 page per fault - it prob=
ably makes
> >>> sense to copy the whole folio up to a certain size.
> >>> "
>
> I thought about this a bit more. No clear conclusions, but hoped this mig=
ht help
> the discussion around policy:
>
> The decision about the size of the THP is made at first fault, with some =
help
> from user space and in future we might make decisions to split based on
> munmap/mremap/etc hints. In an ideal world, the fact that we have had to =
swap
> the THP out at some point in its lifetime should not impact on its size. =
It's
> just being moved around in the system and the reason for our original dec=
ision
> should still hold.

Indeed, this is an ideal framework for smartphones and likely for
widely embedded
Linux systems utilizing zRAM. We set the mTHP size to 64KiB to
leverage CONT-PTE,
given that more than half of the memory on phones may frequently swap out a=
nd
swap in (for instance, when opening and switching between apps). The
ideal approach
would involve adhering to the decision made in do_anonymous_page().

>
> So from that PoV, it would be good to swap-in to the same size that was
> swapped-out. But we only kind-of keep that information around, via the sw=
ap
> entry contiguity and alignment. With that scheme it is possible that mult=
iple
> virtually adjacent but not physically contiguous folios get swapped-out t=
o
> adjacent swap slot ranges and then they would be swapped-in to a single, =
larger
> folio. This is not ideal, and I think it would be valuable to try to main=
tain
> the original folio size information with the swap slot. One way to do thi=
s would
> be to store the original order for which the cluster was allocated in the
> cluster. Then we at least know that a given swap slot is either for a fol=
io of
> that order or an order-0 folio (due to cluster exhaustion/scanning). Can =
we
> steal a bit from swap_map to determine which case it is? Or are there bet=
ter
> approaches?

In the case of non-SWP_SYNCHRONOUS_IO, users will invariably invoke
swap_readahead()
even when __swap_count(entry) equals 1.  This leads to two scenarios:
swap_vma_readahead
and swap_cluster_readahead.

In swap_vma_readahead, when blk_queue_nonrot, physical contiguity
doesn't appear to be a
critical concern. However, for swap_cluster_readahead, the focus
shifts towards the potential
impact of physical discontiguity.

struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
                                struct vm_fault *vmf)
{
        struct mempolicy *mpol;
        pgoff_t ilx;
        struct folio *folio;

        mpol =3D get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
        folio =3D swap_use_vma_readahead() ?
                swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
                swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
        mpol_cond_put(mpol);

        if (!folio)
                return NULL;
        return folio_file_page(folio, swp_offset(entry));
}

In Android and embedded systems, SWP_SYNCHRONOUS_IO is consistently utilize=
d,
rendering physical contiguity less of a concern. Moreover, instances where
swap_readahead() is accessed are rare, typically occurring only in scenario=
s
involving forked but non-CoWed memory.

So I think large folios swap-in will at least need three steps

1. on SWP_SYNCHRONOUS_IO (Android and embedded Linux), this has a very
clear model and has no complex I/O issue.
2. on nonrot block device(bdev_nonrot =3D=3D  true), it cares less about
I/O contiguity.
3. on rot block devices which care about  I/O contiguity.

This patchset primarily addresses the systems utilizing
SWP_SYNCHRONOUS_IO(type1),
such as Android and embedded Linux, a straightforward model is established,
with minimal complexity regarding I/O issues.

>
> Next we (I?) have concerns about wasting IO by swapping-in folios that ar=
e too
> large (e.g. 2M). I'm not sure if this is a real problem or not - intuitiv=
ely I'd
> say yes but I have no data. But on the other hand, memory is aged and
> swapped-out per-folio, so why shouldn't it be swapped-in per folio? If th=
e
> original allocation size policy is good (it currently isn't) then a folio=
 should
> be sized to cover temporally close memory and if we need to access some o=
f it,
> chances are we need all of it.
>
> If we think the IO concern is legitimate then we could define a threshold=
 size
> (sysfs?) for when we start swapping-in the folio in chunks. And how big s=
hould
> those chunks be - one page, or the threshold size itself? Probably the la=
tter?
> And perhaps that threshold could also be used by zRAM to decide its upper=
 limit
> for compression chunk.


Agreed. What about introducing a parameter like
/sys/kernel/mm/transparent_hugepage/max_swapin_order
giving users the opportunity to fine-tune it according to their needs. For =
type1
users specifically, setting it to any value above 4 would be
beneficial. If there's
still a lack of tuning for desktop and server environments (type 2 and type=
 3),
the default value could be set to 0.

>
> Perhaps we can learn from khugepaged here? I think it has programmable
> thresholds for how many swapped-out pages can be swapped-in to aid collap=
se to a
> THP? I guess that exists for the same concerns about increased IO pressur=
e?
>
>
> If we think we will ever be swapping-in folios in chunks less than their
> original size, then we need a separate mechanism to re-foliate them. We h=
ave
> discussed a khugepaged-like approach for doing this asynchronously in the
> background. I know that scares the Android folks, but David has suggested=
 that
> this could well be very cheap compared with khugepaged, because it would =
be
> entirely limited to a single pgtable, so we only need the PTL. If we need=
 this
> mechanism anyway, perhaps we should develop it and see how it performs if
> swap-in remains order-0? Although I guess that would imply not being able=
 to
> benefit from compressing THPs for the zRAM case.

The effectiveness of collapse operation relies on the stability of
forming large folios
to ensure optimal performance. In embedded systems, where more than half of=
 the
memory may be allocated to zRAM, folios might undergo swapping out before
collapsing or immediately after the collapse operation. It seems a
TAO-like optimization
to decrease fallback and latency is more effective.

>
> I see all this as orthogonal to synchronous vs asynchronous swap devices.=
 I
> think the latter just implies that you might want to do some readahead to=
 try to
> cover up the latency? If swap is moving towards being folio-orientated, t=
hen
> readahead also surely needs to be folio-orientated, but I think that shou=
ld be
> the only major difference.
>
> Anyway, just some thoughts!

Thank you very much for your valuable and insightful deliberations.

>
> Thanks,
> Ryan
>

Thanks
Barry