From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 84CDEC54E58
	for <linux-mm@archiver.kernel.org>; Wed, 20 Mar 2024 06:22:42 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E5D776B0092; Wed, 20 Mar 2024 02:22:41 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E0F606B0093; Wed, 20 Mar 2024 02:22:41 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CFCE96B0095; Wed, 20 Mar 2024 02:22:41 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id BF38C6B0092
	for <linux-mm@kvack.org>; Wed, 20 Mar 2024 02:22:41 -0400 (EDT)
Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 925C4C0586
	for <linux-mm@kvack.org>; Wed, 20 Mar 2024 06:22:41 +0000 (UTC)
X-FDA: 81916423722.10.043B86E
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18])
	by imf20.hostedemail.com (Postfix) with ESMTP id D80C11C000D
	for <linux-mm@kvack.org>; Wed, 20 Mar 2024 06:22:38 +0000 (UTC)
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=JxsirPes;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf20.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.18 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710915759;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=d2ls6czumKD5mYCl6Bu/TPle8ElRtpNJX3nLNZDuZJ0=;
	b=eP1sYgU7wco3tFd+X2v3gWbNDz2mhhnRHLPI2Y/qZwxYMe2Lg59B6MZ+qQbkNZBPkAu41M
	0506/skokpz7iLj+PEdMu+WKwEwYRdSsp/3lSh17jIwQs6FzW8YlZL4Efbf/5p9CmPKJIW
	pvKleE0P5nurbc8mg059wrPJRBz4DgI=
ARC-Authentication-Results: i=1;
	imf20.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=JxsirPes;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf20.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.18 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710915759; a=rsa-sha256;
	cv=none;
	b=kngDp/INK7SkxgxQ5ZZ+qXVgitOtYxTe6eSw4KYbtRH7AWRJn4LZPO2LGpOFt4epPOHw1r
	ZeCP2O+71XhQVbrRDLV56WaOds/Gx32QXLlnz5lMrz2kCSGExsd4DLv3tKesEznJW2lvNn
	NzhRiOBp3B2THqurseZomzjqZxVwm8k=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1710915759; x=1742451759;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=Msp9rhYQvlhniWjJV1v6egjJ+xLAn/vC+dotpDHA6EE=;
  b=JxsirPes5Om43Wddi6iucUjlhbST9RPAVALh+Xbd8XRAiOMWf8hJYWRk
   6muUPagegXhg99nfBxcR26so3zSS9PBBL9LhomRfGWPvDn78Tg9Pxu1Im
   vbaYkNal0UXXZIx4Z3dfffXe8zAG/HpUDK6RLtx4kO6sHXCjhXDYlkoIx
   h4aSOcu3bxv9SmXagzWpC2PUZOVaUb8TI3QfhfUNC1O1UgkWBWMLrZDFo
   pVhpm4cuGdaVQWlyzRPBSriE+zxGaOUSsokGRmjKrM/VS6llqOS5cH++N
   QE+V0zOIztsu1/6YNzWn+/ijrI04wVbuTfMYluYwwsK9VpXcQBImpZ6eX
   A==;
X-IronPort-AV: E=McAfee;i="6600,9927,11018"; a="5943016"
X-IronPort-AV: E=Sophos;i="6.07,139,1708416000"; 
   d="scan'208";a="5943016"
Received: from orviesa005.jf.intel.com ([10.64.159.145])
  by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Mar 2024 23:22:38 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.07,139,1708416000"; 
   d="scan'208";a="18778671"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orviesa005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Mar 2024 23:22:32 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Barry Song <21cnbao@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,  Matthew Wilcox
 <willy@infradead.org>,  akpm@linux-foundation.org,  linux-mm@kvack.org,
  chengming.zhou@linux.dev,  chrisl@kernel.org,  david@redhat.com,
  hannes@cmpxchg.org,  kasong@tencent.com,
  linux-arm-kernel@lists.infradead.org,  linux-kernel@vger.kernel.org,
  mhocko@suse.com,  nphamcs@gmail.com,  shy828301@gmail.com,
  steven.price@arm.com,  surenb@google.com,  wangkefeng.wang@huawei.com,
  xiang@kernel.org,  yosryahmed@google.com,  yuzhao@google.com,  Chuanhua
 Han <hanchuanhua@oppo.com>,  Barry Song <v-songbaohua@oppo.com>
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole
In-Reply-To: <CAGsJ_4zuEFnLwM_h7DF1BN2eN3P4S0Sw=omxo90ucKpPT4ampA@mail.gmail.com>
	(Barry Song's message of "Wed, 20 Mar 2024 15:47:50 +1300")
References: <20240304081348.197341-1-21cnbao@gmail.com>
	<20240304081348.197341-6-21cnbao@gmail.com>
	<87wmq3yji6.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAGsJ_4x+t_X4Tn15=QPbH58e1S1FwOoM3t37T+cUE8-iKoENLw@mail.gmail.com>
	<87sf0rx3d6.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAGsJ_4xna1xKz7J=MWDR3h543UvnS9v0-+ggVc5fFzpFOzfpyA@mail.gmail.com>
	<87jzm0wblq.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAGsJ_4wTU3cmzXMCu+yQRMnEiCEUA8rO5=QQUopgG0RMnHYd5g@mail.gmail.com>
	<9ec62266-26f1-46b6-8bb7-9917d04ed04e@arm.com>
	<87jzlyvar3.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<f918354d-12ee-4349-9356-fc02d2457a26@arm.com>
	<87zfutsl25.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAGsJ_4zuEFnLwM_h7DF1BN2eN3P4S0Sw=omxo90ucKpPT4ampA@mail.gmail.com>
Date: Wed, 20 Mar 2024 14:20:38 +0800
Message-ID: <87msqts9u1.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: D80C11C000D
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: a4ck9jtdpydtc3sf8y8jpqab3hs7j1qj
X-HE-Tag: 1710915758-122097
X-HE-Meta: U2FsdGVkX18MBgZtYoha5gG+hKqgxqJMIrijiZNUtdJlGEXGHJXrdmxyZNCnfX8nqiN2Jll3Y5ELqsPQXKY1bBDnKhc0AYtygpQxtaFJeIAv5iuCXAph71LqfaiYa6cBfUcoHjHFyhnh+RhGP6FpHodermcykeRi5QkDk570gHL3dc0tB9jdeiKaVXvWPaX20k7vM4Gz1yRqiYm8d9UYuKEM1frT/FQ23p/PubR4ZHzwYktcs9i1R80B9tBVdAkHEPd6CkWtp4MzyvgJ7yRdCwmLLAyg6BTStaoDyMp6C52uQTlFRJ2FpxBgvGxcopwAdc1tj4oCslI3w2c+rvIOPBz9X8gbQHw2/XBisgR967TKR8WSZkhBvVr5fF743QHmGL7bvXOsrNnyovDdmnVyqQYlz5OxzLyj4jKx3Qh2y5ojqtSDboMiecJDIvXWso/75iiRrcaoSmFo4f0QijxKs49nLT66pUQ06fb1InxHSJ7iRRYLxLSgsyraqIqrSRKw6wXA4Z42SMN3ZAENctNr5l6FkLmAtA7BxOw23JSqNrc6+eSrUIe5GICnwsW+vMJe8ON8sgfEyV0TTegUeXsyg2jNwlVml+DvF9Yrd7eWW3UF8SiRwnwDX6XxLN15tj2c8NnUITBBW3KO2Egc+1VZh0wxLEnLdJHw/Ls50BU8GAk2dxUca3LSTMhfgf8W9fGDpTcJ6OCM4yydB0bvZZ5a9+u+uN2KIrLLNWtYz3qg78ffZBPM+JVCTl3QA+A31+SJzfLKxX5Wk6auO0F8QVROUpNORBLyVWV977rW1xQRpf2cXn56JHKc+1RGqfgjkznhZITJ/ocOkPS1fqQmRq9MSWF12wSgXzp2HwQevDsfmFUBL71itiqEoE+B/zKzAOlZQH/menzlYZh7xY8Y6ZTxABJhY2fQMu8GUXDY5hplDiYk/t9GHlxVdSpLGb1+Qp82DKNphhvXJZe45rGEOul
 6y7pS1CQ
 9xKYIsuuB/9Yp9RX7jgCXsOUvgzL6KpVQj6jdFTknrbdc5oRkRjdsVJGSvLJfOfazUdLaphD08MzaROWGcKG6+owXyBQHB4am4TiEHUQrFrRyBKPMElvJN/eEmmiL/qzg7TfKz6vZUIracNFCVsDCOCK+SI0uXJahAwtwkQ4SXFDEecsUfslz/VQ4ZWDhA+4zbgCcLzUBjR/V/F7FJR+Um0mf6F7ES79lmSfiOn7x6leY2NWX0+p2ANcQjg7zUE6WNMbG
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000002, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Barry Song <21cnbao@gmail.com> writes:

> On Wed, Mar 20, 2024 at 3:20=E2=80=AFPM Huang, Ying <ying.huang@intel.com=
> wrote:
>>
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>> > On 19/03/2024 09:20, Huang, Ying wrote:
>> >> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >>
>> >>>>>> I agree phones are not the only platform. But Rome wasn't built i=
n a
>> >>>>>> day. I can only get
>> >>>>>> started on a hardware which I can easily reach and have enough ha=
rdware/test
>> >>>>>> resources on it. So we may take the first step which can be appli=
ed on
>> >>>>>> a real product
>> >>>>>> and improve its performance, and step by step, we broaden it and =
make it
>> >>>>>> widely useful to various areas  in which I can't reach :-)
>> >>>>>
>> >>>>> We must guarantee the normal swap path runs correctly and has no
>> >>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimiza=
tion.
>> >>>>> So we have to put some effort on the normal path test anyway.
>> >>>>>
>> >>>>>> so probably we can have a sysfs "enable" entry with default "n" or
>> >>>>>> have a maximum
>> >>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
>> >>>>>>
>> >>>>>> "
>> >>>>>> So in the common case, swap-in will pull in the same size of foli=
o as was
>> >>>>>> swapped-out. Is that definitely the right policy for all folio si=
zes? Certainly
>> >>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). Bu=
t I'm not sure
>> >>>>>> it makes sense for 2M THP; As the size increases the chances of a=
ctually needing
>> >>>>>> all of the folio reduces so chances are we are wasting IO. There =
are similar
>> >>>>>> arguments for CoW, where we currently copy 1 page per fault - it =
probably makes
>> >>>>>> sense to copy the whole folio up to a certain size.
>> >>>>>> "
>> >>>
>> >>> I thought about this a bit more. No clear conclusions, but hoped thi=
s might help
>> >>> the discussion around policy:
>> >>>
>> >>> The decision about the size of the THP is made at first fault, with =
some help
>> >>> from user space and in future we might make decisions to split based=
 on
>> >>> munmap/mremap/etc hints. In an ideal world, the fact that we have ha=
d to swap
>> >>> the THP out at some point in its lifetime should not impact on its s=
ize. It's
>> >>> just being moved around in the system and the reason for our origina=
l decision
>> >>> should still hold.
>> >>>
>> >>> So from that PoV, it would be good to swap-in to the same size that =
was
>> >>> swapped-out.
>> >>
>> >> Sorry, I don't agree with this.  It's better to swap-in and swap-out =
in
>> >> smallest size if the page is only accessed seldom to avoid to waste
>> >> memory.
>> >
>> > If we want to optimize only for memory consumption, I'm sure there are=
 many
>> > things we would do differently. We need to find a balance between memo=
ry and
>> > performance. The benefits of folios are well documented and the kernel=
 is
>> > heading in the direction of managing memory in variable-sized blocks. =
So I don't
>> > think it's as simple as saying we should always swap-in the smallest p=
ossible
>> > amount of memory.
>>
>> It's conditional, that is,
>>
>> "if the page is only accessed seldom"
>>
>> Then, the page swapped-in will be swapped-out soon and adjacent pages in
>> the same large folio will not be accessed during this period.
>>
>> So, I suggest to create an algorithm to decide swap-in order based on
>> swap-readahead information automatically.  It can detect the situation
>> above via reduced swap readahead window size.  And, if the page is
>> accessed for quite long time, and the adjacent pages in the same large
>> folio are accessed too, swap-readahead window will increase and large
>> swap-in order will be used.
>
> The original size of do_anonymous_page() should be honored, considering it
> embodies a decision influenced by not only sysfs settings and per-vma
> HUGEPAGE hints but also architectural characteristics, for example
> CONT-PTE.
>
> The model you're proposing may offer memory-saving benefits or reduce I/O,
> but it entirely disassociates the size of the swap in from the size prior=
 to the
> swap out.

Readahead isn't the only factor to determine folio order.  For example,
we must respect "never" policy to allocate order-0 folio always.
There's no requirements to use swap-out order in swap-in too.  Memory
allocation has different performance character of storage reading.

> Moreover, there's no guarantee that the large folio generated by
> the readahead window is contiguous in the swap and can be added to the
> swap cache, as we are currently dealing with folio->swap instead of
> subpage->swap.

Yes.  We can optimize only when all conditions are satisfied.  Just like
other optimization.

> Incidentally, do_anonymous_page() serves as the initial location for allo=
cating
> large folios. Given that memory conservation is a significant considerati=
on in
> do_swap_page(), wouldn't it be even more crucial in do_anonymous_page()?

Yes.  We should consider that too.  IIUC, that is why mTHP support is
off by default for now.  After we find a way to solve the memory usage
issue.  We may make default "on".

> A large folio, by its nature, represents a high-quality resource that has=
 the
> potential to leverage hardware characteristics for the benefit of the
> entire system.

But not at the cost of memory wastage.

> Conversely, I don't believe that a randomly determined size dictated by t=
he
> readahead window possesses the same advantageous qualities.

There's a readahead algorithm which is not pure random.

> SWP_SYNCHRONOUS_IO devices are not reliant on readahead whatsoever,
> their needs should also be respected.

I understand that there are special requirements for SWP_SYNCHRONOUS_IO
devices.  I just suggest to work on general code before specific
optimization.

>> > You also said we should swap *out* in smallest size possible. Have I
>> > misunderstood you? I thought the case for swapping-out a whole folio w=
ithout
>> > splitting was well established and non-controversial?
>>
>> That is conditional too.
>>
>> >>
>> >>> But we only kind-of keep that information around, via the swap
>> >>> entry contiguity and alignment. With that scheme it is possible that=
 multiple
>> >>> virtually adjacent but not physically contiguous folios get swapped-=
out to
>> >>> adjacent swap slot ranges and then they would be swapped-in to a sin=
gle, larger
>> >>> folio. This is not ideal, and I think it would be valuable to try to=
 maintain
>> >>> the original folio size information with the swap slot. One way to d=
o this would
>> >>> be to store the original order for which the cluster was allocated i=
n the
>> >>> cluster. Then we at least know that a given swap slot is either for =
a folio of
>> >>> that order or an order-0 folio (due to cluster exhaustion/scanning).=
 Can we
>> >>> steal a bit from swap_map to determine which case it is? Or are ther=
e better
>> >>> approaches?
>> >>
>> >> [snip]

--
Best Regards,
Huang, Ying