From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A8A6EC6FD1F
	for <linux-mm@archiver.kernel.org>; Wed, 20 Mar 2024 18:38:54 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id EAF796B0096; Wed, 20 Mar 2024 14:38:53 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E5F426B0098; Wed, 20 Mar 2024 14:38:53 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D00236B0099; Wed, 20 Mar 2024 14:38:53 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id BFEEF6B0096
	for <linux-mm@kvack.org>; Wed, 20 Mar 2024 14:38:53 -0400 (EDT)
Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 3DDD91605E6
	for <linux-mm@kvack.org>; Wed, 20 Mar 2024 18:38:53 +0000 (UTC)
X-FDA: 81918278946.05.7C5AFCA
Received: from mail-ua1-f43.google.com (mail-ua1-f43.google.com [209.85.222.43])
	by imf04.hostedemail.com (Postfix) with ESMTP id 6DEDD40009
	for <linux-mm@kvack.org>; Wed, 20 Mar 2024 18:38:51 +0000 (UTC)
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=hTVay4Zb;
	spf=pass (imf04.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.43 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710959931;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=y9GLo3WEc0TbH91CgwMuG9NfCun8rQAOObu/qOhupcY=;
	b=7VOiAmegNgEsQbGB7OzM5kJftoaKIxcl1glQaAfarWEdEYlY9xNR3NomtT4CM7LcPMgQCR
	9wj31Q/qcWqbXgBkm/k+sORIq0ohraGImjuGSQdWESVhirBl43ofKkSce0Btrg7jD2oFcY
	H0sUOGUlmSFH6h2xblfCbfn7A/3h0Io=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710959931; a=rsa-sha256;
	cv=none;
	b=YTV3RxPy0QMGzY7ZNPlzEiE1nvtGoAbjNr6LwZzkLRARLCWzR8u5SXBnYX5rGJprTB8A+d
	Gtz1KiuqeQ6GUDg2F+JML1oy+PJFCLDutb7iALD6DDxmGyuUmZ4tDbbrSwX9v4Z1J1nfzr
	5v3wJGKPeAdr0ER+W/t5UOA7t2NeIc4=
ARC-Authentication-Results: i=1;
	imf04.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=hTVay4Zb;
	spf=pass (imf04.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.43 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-ua1-f43.google.com with SMTP id a1e0cc1a2514c-7e0425e5aa8so60919241.3
        for <linux-mm@kvack.org>; Wed, 20 Mar 2024 11:38:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1710959930; x=1711564730; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=y9GLo3WEc0TbH91CgwMuG9NfCun8rQAOObu/qOhupcY=;
        b=hTVay4Zb2TOtRmtukub7jvAJ3wc+LoVm1sCJW1r9xoRULOFx4muwUmd4ihO3OpxXrY
         bCBERvz7H98NJZKufTqWxHzUDlM0wtm4vKNSStlsCH5+eZ66GefjVBWKuNnyoXSn4pYj
         P5mCSLIzGQSVxbg+0yrAGcj0KjeaZ9FzHTLhRUJ4QyeKkJCDETNPUkbdSDXpoUnilLQy
         lbgvBCa3JYb9KQh7+Em1BarifOTTuTrU38FhX07EhoIOIMOsQe1ae6AoShkBvGb/9/Ub
         SiwrxPLkuFjn6B083qvCe3WgwqyScRLwB8oSQ0FqieNM+OG2pOMO27tUY2pezKK4hteO
         n1lg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1710959930; x=1711564730;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=y9GLo3WEc0TbH91CgwMuG9NfCun8rQAOObu/qOhupcY=;
        b=INdCxFBHNanSS8anPUi2V4YQYk7xj1VNGNSiazKe1JH4uKc70RiOKKG1rNehvEGqbM
         nSPKB6F1TvrPJZBn4DEZs9r1IsXQJ1ZB4+kxEmmQK8aSHp7ITq9d1ROlP4ckV26KM9ID
         g2iEV2539az8hQO5AYVw54tz6W+JE8c+ahz+HhYhWIVEZMRhwFue6h1bSIEI4cZT5+Nx
         jSPZg/zZtqmKjf/3ymJ79AmT0cqlrzYtqCxBQlBwCQS83hGHxhSK0u7Kh9A5WW4B2v0+
         CcSAPeJ9cS4a6+tWJsGGY7FpGLfSwtx2MXf3VsBn2ajbqatnurRpR6l7Qql8NnwJ+oiz
         yvqg==
X-Forwarded-Encrypted: i=1; AJvYcCV8/ijUgYMI2rqZH1kjNRqNw9Y2Sm6sDs2PvZkxuUrqAY7rmhE0ICYNxgn6HfT5QX7t7v8mCoZdEHmq2ra17RJ6q1Y=
X-Gm-Message-State: AOJu0Yztg+GMFor/2PnO6qWkK6HzNRKbvGR9eachq8tTIvfZupLHtkL7
	SACP4rojsxrC5Z7275IBE/9RLNWn9VxfIdjHLxw6LHZOeDcTw+lBmhmrBp756u5VMNXWw9WGpFR
	ZjQSvwVytWYesHGjTqzKI1JlXpKk=
X-Google-Smtp-Source: AGHT+IEs5Eh5D+EFQ6awkQSFrKzkTsQwstEfjurELStBp0tO+tXg6nua+i5QpOBvbNrK1kEEMgXPP+O+8zdTwSv1O9E=
X-Received: by 2002:a67:eb0a:0:b0:473:18d9:caf5 with SMTP id
 a10-20020a67eb0a000000b0047318d9caf5mr16166904vso.24.1710959928887; Wed, 20
 Mar 2024 11:38:48 -0700 (PDT)
MIME-Version: 1.0
References: <20240304081348.197341-1-21cnbao@gmail.com> <20240304081348.197341-6-21cnbao@gmail.com>
 <87wmq3yji6.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAGsJ_4x+t_X4Tn15=QPbH58e1S1FwOoM3t37T+cUE8-iKoENLw@mail.gmail.com>
 <87sf0rx3d6.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAGsJ_4xna1xKz7J=MWDR3h543UvnS9v0-+ggVc5fFzpFOzfpyA@mail.gmail.com>
 <87jzm0wblq.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAGsJ_4wTU3cmzXMCu+yQRMnEiCEUA8rO5=QQUopgG0RMnHYd5g@mail.gmail.com>
 <9ec62266-26f1-46b6-8bb7-9917d04ed04e@arm.com> <87jzlyvar3.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <f918354d-12ee-4349-9356-fc02d2457a26@arm.com> <87zfutsl25.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <CAGsJ_4zuEFnLwM_h7DF1BN2eN3P4S0Sw=omxo90ucKpPT4ampA@mail.gmail.com> <87msqts9u1.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87msqts9u1.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Barry Song <21cnbao@gmail.com>
Date: Thu, 21 Mar 2024 07:38:37 +1300
Message-ID: <CAGsJ_4wP_Zy9LZxZA-5KG13S3sQOak4LJK7rOhs3M=ETNJppNw@mail.gmail.com>
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>, Matthew Wilcox <willy@infradead.org>, akpm@linux-foundation.org, 
	linux-mm@kvack.org, chengming.zhou@linux.dev, chrisl@kernel.org, 
	david@redhat.com, hannes@cmpxchg.org, kasong@tencent.com, 
	linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, 
	mhocko@suse.com, nphamcs@gmail.com, shy828301@gmail.com, steven.price@arm.com, 
	surenb@google.com, wangkefeng.wang@huawei.com, xiang@kernel.org, 
	yosryahmed@google.com, yuzhao@google.com, Chuanhua Han <hanchuanhua@oppo.com>, 
	Barry Song <v-songbaohua@oppo.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: 4gkd643qu3kgi3buwg9n1hxsh8i4ba89
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 6DEDD40009
X-Rspam-User: 
X-HE-Tag: 1710959931-654627
X-HE-Meta: U2FsdGVkX1+nRPRyeXVeDIFoIsPXu9wB/XJx6gR06UWsa+Hp9sdluDzGJXZOWg4V9gpH0IC90kJVGkyzHgmfNQBPWIZL4S+4fcRldt5IgdYqYIIjOxwbno3qLMk3kf8D98kzD+z8Rtnv4j+mJ/dc9MTk3r3yG7cH927E7WYRbfVIGKth53FVMkNiaxY6WcjXaF4SgQRIO1tSBZAucZ1OfqG0pnXMr5+MH4n9Uddcf9VewxTrwOOxp2xFyDO4n45VtnLhLuzuRgn0YV7DM7Aa3NQ7xaA5p07eQKWP/z1jKRIJIwdNpOBLKO7pmsRUnWQILNsHzwczupezlVCS7Z2h+xWHkJimtCFLkzy9WDMEnm0VblfnK9xWRNXwunCOn3xifIAtYTbVENOtcNoPpVdaiMKTxwvbLA8kKVxlnCq7I1qilot11jVK0UExHSG8IlLTRBp0YK0PXbCCtVTsOHH4ZHprZd53fdnouauVrqjkF/3eYBeqji2jGdh5kN2VwVLVl8+e9n49fn0UW6tLbu3pcXXZSeO6QDujNad5lytgneOTrGKtonlvKyFeBCna6xd0h6PoBWCu86UDSvddUpOYLhKQbZUANgxnLDKFOFefhhI+YDrKGBh4Q19cSHKQnvn5u3cloGdAKex6fx5Jr+FT7oRsUJdVym6myBPNvGawbLfWwhzw8mJPx6aT8sFcq64N1PqHxhKvdxZGEJjGv9KgTkq8x38FBXFLzFz00+oq1+NQko1YDnorQfVATuv9XxNwmClS49y86IHGaqScSfMSGD2KC8n+r+SpUyyGJWrHO1nIx42cXzLiQzb/Adh48kg/iqT32Y5syOT0OV47zXFoociB5txtApWYTre6IRxHG4849ZnazaPIGqDjKqCjHIChsYAOIAwdUVw5xAIow9omNSkcBD/ElSLXhpWBYfB4SdTeXQ1Hv15XRMqhNe9NzTAPOaZVkwroQU4yJzCTFpL
 6mzLuvln
 4KQl2VmOyPmWexW/yDsjcl/XKS5blO98SMiv9a2psa1AFvkWku51WKjBpvx9EeUsEBNcbzSWnbe41Q01L9ZBkXG8sAmzMnwFGW1VGzB21Tj2K5+SUv6t7aBdGzen6LaE4Rv53ZLgsG3WnR01SuWvprI52KA6OJZw+SUn3yXohywBOEtLcPqwM04RbvQw1YaMkKwkA4U+f1luUoVzAlw4b7d8WiUFq3pwBvQDEhgtWd6eCWUab15biPmDthjTmBKPrHJXS6bZVwzhYGk5f31Ni0Cl91pa77BKsBphKZhOGbDkd3k412t2eCoYmrzL1qJMewxigore/qosThKNShlqruPIHe86V+yrudKpk8DUx7FxeFQBuvTynXC2XumguEsIRD5EMIUgA9WyeGfWZlzuDQS7+AYZ4vzlrU6av7TOm0jCnLIrHTtAJiZadPPRQEVqsvU8UzS6KLlVB5rI=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000058, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Mar 20, 2024 at 7:22=E2=80=AFPM Huang, Ying <ying.huang@intel.com> =
wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Mar 20, 2024 at 3:20=E2=80=AFPM Huang, Ying <ying.huang@intel.c=
om> wrote:
> >>
> >> Ryan Roberts <ryan.roberts@arm.com> writes:
> >>
> >> > On 19/03/2024 09:20, Huang, Ying wrote:
> >> >> Ryan Roberts <ryan.roberts@arm.com> writes:
> >> >>
> >> >>>>>> I agree phones are not the only platform. But Rome wasn't built=
 in a
> >> >>>>>> day. I can only get
> >> >>>>>> started on a hardware which I can easily reach and have enough =
hardware/test
> >> >>>>>> resources on it. So we may take the first step which can be app=
lied on
> >> >>>>>> a real product
> >> >>>>>> and improve its performance, and step by step, we broaden it an=
d make it
> >> >>>>>> widely useful to various areas  in which I can't reach :-)
> >> >>>>>
> >> >>>>> We must guarantee the normal swap path runs correctly and has no
> >> >>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimi=
zation.
> >> >>>>> So we have to put some effort on the normal path test anyway.
> >> >>>>>
> >> >>>>>> so probably we can have a sysfs "enable" entry with default "n"=
 or
> >> >>>>>> have a maximum
> >> >>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
> >> >>>>>>
> >> >>>>>> "
> >> >>>>>> So in the common case, swap-in will pull in the same size of fo=
lio as was
> >> >>>>>> swapped-out. Is that definitely the right policy for all folio =
sizes? Certainly
> >> >>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). =
But I'm not sure
> >> >>>>>> it makes sense for 2M THP; As the size increases the chances of=
 actually needing
> >> >>>>>> all of the folio reduces so chances are we are wasting IO. Ther=
e are similar
> >> >>>>>> arguments for CoW, where we currently copy 1 page per fault - i=
t probably makes
> >> >>>>>> sense to copy the whole folio up to a certain size.
> >> >>>>>> "
> >> >>>
> >> >>> I thought about this a bit more. No clear conclusions, but hoped t=
his might help
> >> >>> the discussion around policy:
> >> >>>
> >> >>> The decision about the size of the THP is made at first fault, wit=
h some help
> >> >>> from user space and in future we might make decisions to split bas=
ed on
> >> >>> munmap/mremap/etc hints. In an ideal world, the fact that we have =
had to swap
> >> >>> the THP out at some point in its lifetime should not impact on its=
 size. It's
> >> >>> just being moved around in the system and the reason for our origi=
nal decision
> >> >>> should still hold.
> >> >>>
> >> >>> So from that PoV, it would be good to swap-in to the same size tha=
t was
> >> >>> swapped-out.
> >> >>
> >> >> Sorry, I don't agree with this.  It's better to swap-in and swap-ou=
t in
> >> >> smallest size if the page is only accessed seldom to avoid to waste
> >> >> memory.
> >> >
> >> > If we want to optimize only for memory consumption, I'm sure there a=
re many
> >> > things we would do differently. We need to find a balance between me=
mory and
> >> > performance. The benefits of folios are well documented and the kern=
el is
> >> > heading in the direction of managing memory in variable-sized blocks=
. So I don't
> >> > think it's as simple as saying we should always swap-in the smallest=
 possible
> >> > amount of memory.
> >>
> >> It's conditional, that is,
> >>
> >> "if the page is only accessed seldom"
> >>
> >> Then, the page swapped-in will be swapped-out soon and adjacent pages =
in
> >> the same large folio will not be accessed during this period.
> >>
> >> So, I suggest to create an algorithm to decide swap-in order based on
> >> swap-readahead information automatically.  It can detect the situation
> >> above via reduced swap readahead window size.  And, if the page is
> >> accessed for quite long time, and the adjacent pages in the same large
> >> folio are accessed too, swap-readahead window will increase and large
> >> swap-in order will be used.
> >
> > The original size of do_anonymous_page() should be honored, considering=
 it
> > embodies a decision influenced by not only sysfs settings and per-vma
> > HUGEPAGE hints but also architectural characteristics, for example
> > CONT-PTE.
> >
> > The model you're proposing may offer memory-saving benefits or reduce I=
/O,
> > but it entirely disassociates the size of the swap in from the size pri=
or to the
> > swap out.
>
> Readahead isn't the only factor to determine folio order.  For example,
> we must respect "never" policy to allocate order-0 folio always.
> There's no requirements to use swap-out order in swap-in too.  Memory
> allocation has different performance character of storage reading.

Still quite unclear.

If users have only enabled 64KiB (4-ORDER) large folios in sysfs, and the
readahead algorithm requires 16KiB, what should be set as the large folio s=
ize?
Setting it to 16KiB doesn't align with users' requirements, while
setting it to 64KiB
would be wasteful according to your criteria.

>
> > Moreover, there's no guarantee that the large folio generated by
> > the readahead window is contiguous in the swap and can be added to the
> > swap cache, as we are currently dealing with folio->swap instead of
> > subpage->swap.
>
> Yes.  We can optimize only when all conditions are satisfied.  Just like
> other optimization.
>
> > Incidentally, do_anonymous_page() serves as the initial location for al=
locating
> > large folios. Given that memory conservation is a significant considera=
tion in
> > do_swap_page(), wouldn't it be even more crucial in do_anonymous_page()=
?
>
> Yes.  We should consider that too.  IIUC, that is why mTHP support is
> off by default for now.  After we find a way to solve the memory usage
> issue.  We may make default "on".

It's challenging to establish a universal solution because various systems
exhibit diverse hardware characteristics, and VMAs may require different
alignments. The current sysfs and per-vma hints allow users the opportunity
o customize settings according to their specific requirements.

>
> > A large folio, by its nature, represents a high-quality resource that h=
as the
> > potential to leverage hardware characteristics for the benefit of the
> > entire system.
>
> But not at the cost of memory wastage.
>
> > Conversely, I don't believe that a randomly determined size dictated by=
 the
> > readahead window possesses the same advantageous qualities.
>
> There's a readahead algorithm which is not pure random.
>
> > SWP_SYNCHRONOUS_IO devices are not reliant on readahead whatsoever,
> > their needs should also be respected.
>
> I understand that there are special requirements for SWP_SYNCHRONOUS_IO
> devices.  I just suggest to work on general code before specific
> optimization.

I disagree with your definition of "special" and "general". According
to your logic,
non-SWP_SYNCHRONOUS_IO devices could also be classified as "special".
Furthermore, the number of systems running SWP_SYNCHRONOUS_IO is
significantly greater than those running non-SWP_SYNCHRONOUS_IO,
contradicting your assertion.

SWP_SYNCHRONOUS_IO devices have a minor chance of being involved
in readahead. However, in OPPO's code, which hasn't been sent in the
LKML yet, we use the exact same size as do_anonymous_page for readahead.
Without a clear description of how you want the new readahead
algorithm to balance memory waste and users' hints from sysfs and
per-vma flags, it appears to be an ambiguous area to address.

Please provide a clear description of how you would like the new readahead
algorithm to function. I believe this clarity will facilitate others
in attempting to
implement it.

>
> >> > You also said we should swap *out* in smallest size possible. Have I
> >> > misunderstood you? I thought the case for swapping-out a whole folio=
 without
> >> > splitting was well established and non-controversial?
> >>
> >> That is conditional too.
> >>
> >> >>
> >> >>> But we only kind-of keep that information around, via the swap
> >> >>> entry contiguity and alignment. With that scheme it is possible th=
at multiple
> >> >>> virtually adjacent but not physically contiguous folios get swappe=
d-out to
> >> >>> adjacent swap slot ranges and then they would be swapped-in to a s=
ingle, larger
> >> >>> folio. This is not ideal, and I think it would be valuable to try =
to maintain
> >> >>> the original folio size information with the swap slot. One way to=
 do this would
> >> >>> be to store the original order for which the cluster was allocated=
 in the
> >> >>> cluster. Then we at least know that a given swap slot is either fo=
r a folio of
> >> >>> that order or an order-0 folio (due to cluster exhaustion/scanning=
). Can we
> >> >>> steal a bit from swap_map to determine which case it is? Or are th=
ere better
> >> >>> approaches?
> >> >>
> >> >> [snip]
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry