From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 84CDEC54E58 for ; Wed, 20 Mar 2024 06:22:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E5D776B0092; Wed, 20 Mar 2024 02:22:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E0F606B0093; Wed, 20 Mar 2024 02:22:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CFCE96B0095; Wed, 20 Mar 2024 02:22:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id BF38C6B0092 for ; Wed, 20 Mar 2024 02:22:41 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 925C4C0586 for ; Wed, 20 Mar 2024 06:22:41 +0000 (UTC) X-FDA: 81916423722.10.043B86E Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) by imf20.hostedemail.com (Postfix) with ESMTP id D80C11C000D for ; Wed, 20 Mar 2024 06:22:38 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=JxsirPes; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf20.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.18 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710915759; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=d2ls6czumKD5mYCl6Bu/TPle8ElRtpNJX3nLNZDuZJ0=; b=eP1sYgU7wco3tFd+X2v3gWbNDz2mhhnRHLPI2Y/qZwxYMe2Lg59B6MZ+qQbkNZBPkAu41M 0506/skokpz7iLj+PEdMu+WKwEwYRdSsp/3lSh17jIwQs6FzW8YlZL4Efbf/5p9CmPKJIW pvKleE0P5nurbc8mg059wrPJRBz4DgI= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=JxsirPes; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf20.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.18 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710915759; a=rsa-sha256; cv=none; b=kngDp/INK7SkxgxQ5ZZ+qXVgitOtYxTe6eSw4KYbtRH7AWRJn4LZPO2LGpOFt4epPOHw1r ZeCP2O+71XhQVbrRDLV56WaOds/Gx32QXLlnz5lMrz2kCSGExsd4DLv3tKesEznJW2lvNn NzhRiOBp3B2THqurseZomzjqZxVwm8k= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1710915759; x=1742451759; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=Msp9rhYQvlhniWjJV1v6egjJ+xLAn/vC+dotpDHA6EE=; b=JxsirPes5Om43Wddi6iucUjlhbST9RPAVALh+Xbd8XRAiOMWf8hJYWRk 6muUPagegXhg99nfBxcR26so3zSS9PBBL9LhomRfGWPvDn78Tg9Pxu1Im vbaYkNal0UXXZIx4Z3dfffXe8zAG/HpUDK6RLtx4kO6sHXCjhXDYlkoIx h4aSOcu3bxv9SmXagzWpC2PUZOVaUb8TI3QfhfUNC1O1UgkWBWMLrZDFo pVhpm4cuGdaVQWlyzRPBSriE+zxGaOUSsokGRmjKrM/VS6llqOS5cH++N QE+V0zOIztsu1/6YNzWn+/ijrI04wVbuTfMYluYwwsK9VpXcQBImpZ6eX A==; X-IronPort-AV: E=McAfee;i="6600,9927,11018"; a="5943016" X-IronPort-AV: E=Sophos;i="6.07,139,1708416000"; d="scan'208";a="5943016" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Mar 2024 23:22:38 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,139,1708416000"; d="scan'208";a="18778671" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Mar 2024 23:22:32 -0700 From: "Huang, Ying" To: Barry Song <21cnbao@gmail.com> Cc: Ryan Roberts , Matthew Wilcox , akpm@linux-foundation.org, linux-mm@kvack.org, chengming.zhou@linux.dev, chrisl@kernel.org, david@redhat.com, hannes@cmpxchg.org, kasong@tencent.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, mhocko@suse.com, nphamcs@gmail.com, shy828301@gmail.com, steven.price@arm.com, surenb@google.com, wangkefeng.wang@huawei.com, xiang@kernel.org, yosryahmed@google.com, yuzhao@google.com, Chuanhua Han , Barry Song Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole In-Reply-To: (Barry Song's message of "Wed, 20 Mar 2024 15:47:50 +1300") References: <20240304081348.197341-1-21cnbao@gmail.com> <20240304081348.197341-6-21cnbao@gmail.com> <87wmq3yji6.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sf0rx3d6.fsf@yhuang6-desk2.ccr.corp.intel.com> <87jzm0wblq.fsf@yhuang6-desk2.ccr.corp.intel.com> <9ec62266-26f1-46b6-8bb7-9917d04ed04e@arm.com> <87jzlyvar3.fsf@yhuang6-desk2.ccr.corp.intel.com> <87zfutsl25.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Wed, 20 Mar 2024 14:20:38 +0800 Message-ID: <87msqts9u1.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: D80C11C000D X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: a4ck9jtdpydtc3sf8y8jpqab3hs7j1qj X-HE-Tag: 1710915758-122097 X-HE-Meta: U2FsdGVkX18MBgZtYoha5gG+hKqgxqJMIrijiZNUtdJlGEXGHJXrdmxyZNCnfX8nqiN2Jll3Y5ELqsPQXKY1bBDnKhc0AYtygpQxtaFJeIAv5iuCXAph71LqfaiYa6cBfUcoHjHFyhnh+RhGP6FpHodermcykeRi5QkDk570gHL3dc0tB9jdeiKaVXvWPaX20k7vM4Gz1yRqiYm8d9UYuKEM1frT/FQ23p/PubR4ZHzwYktcs9i1R80B9tBVdAkHEPd6CkWtp4MzyvgJ7yRdCwmLLAyg6BTStaoDyMp6C52uQTlFRJ2FpxBgvGxcopwAdc1tj4oCslI3w2c+rvIOPBz9X8gbQHw2/XBisgR967TKR8WSZkhBvVr5fF743QHmGL7bvXOsrNnyovDdmnVyqQYlz5OxzLyj4jKx3Qh2y5ojqtSDboMiecJDIvXWso/75iiRrcaoSmFo4f0QijxKs49nLT66pUQ06fb1InxHSJ7iRRYLxLSgsyraqIqrSRKw6wXA4Z42SMN3ZAENctNr5l6FkLmAtA7BxOw23JSqNrc6+eSrUIe5GICnwsW+vMJe8ON8sgfEyV0TTegUeXsyg2jNwlVml+DvF9Yrd7eWW3UF8SiRwnwDX6XxLN15tj2c8NnUITBBW3KO2Egc+1VZh0wxLEnLdJHw/Ls50BU8GAk2dxUca3LSTMhfgf8W9fGDpTcJ6OCM4yydB0bvZZ5a9+u+uN2KIrLLNWtYz3qg78ffZBPM+JVCTl3QA+A31+SJzfLKxX5Wk6auO0F8QVROUpNORBLyVWV977rW1xQRpf2cXn56JHKc+1RGqfgjkznhZITJ/ocOkPS1fqQmRq9MSWF12wSgXzp2HwQevDsfmFUBL71itiqEoE+B/zKzAOlZQH/menzlYZh7xY8Y6ZTxABJhY2fQMu8GUXDY5hplDiYk/t9GHlxVdSpLGb1+Qp82DKNphhvXJZe45rGEOul 6y7pS1CQ 9xKYIsuuB/9Yp9RX7jgCXsOUvgzL6KpVQj6jdFTknrbdc5oRkRjdsVJGSvLJfOfazUdLaphD08MzaROWGcKG6+owXyBQHB4am4TiEHUQrFrRyBKPMElvJN/eEmmiL/qzg7TfKz6vZUIracNFCVsDCOCK+SI0uXJahAwtwkQ4SXFDEecsUfslz/VQ4ZWDhA+4zbgCcLzUBjR/V/F7FJR+Um0mf6F7ES79lmSfiOn7x6leY2NWX0+p2ANcQjg7zUE6WNMbG X-Bogosity: Ham, tests=bogofilter, spamicity=0.000002, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Barry Song <21cnbao@gmail.com> writes: > On Wed, Mar 20, 2024 at 3:20=E2=80=AFPM Huang, Ying wrote: >> >> Ryan Roberts writes: >> >> > On 19/03/2024 09:20, Huang, Ying wrote: >> >> Ryan Roberts writes: >> >> >> >>>>>> I agree phones are not the only platform. But Rome wasn't built i= n a >> >>>>>> day. I can only get >> >>>>>> started on a hardware which I can easily reach and have enough ha= rdware/test >> >>>>>> resources on it. So we may take the first step which can be appli= ed on >> >>>>>> a real product >> >>>>>> and improve its performance, and step by step, we broaden it and = make it >> >>>>>> widely useful to various areas in which I can't reach :-) >> >>>>> >> >>>>> We must guarantee the normal swap path runs correctly and has no >> >>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimiza= tion. >> >>>>> So we have to put some effort on the normal path test anyway. >> >>>>> >> >>>>>> so probably we can have a sysfs "enable" entry with default "n" or >> >>>>>> have a maximum >> >>>>>> swap-in order as Ryan's suggestion [1] at the beginning, >> >>>>>> >> >>>>>> " >> >>>>>> So in the common case, swap-in will pull in the same size of foli= o as was >> >>>>>> swapped-out. Is that definitely the right policy for all folio si= zes? Certainly >> >>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). Bu= t I'm not sure >> >>>>>> it makes sense for 2M THP; As the size increases the chances of a= ctually needing >> >>>>>> all of the folio reduces so chances are we are wasting IO. There = are similar >> >>>>>> arguments for CoW, where we currently copy 1 page per fault - it = probably makes >> >>>>>> sense to copy the whole folio up to a certain size. >> >>>>>> " >> >>> >> >>> I thought about this a bit more. No clear conclusions, but hoped thi= s might help >> >>> the discussion around policy: >> >>> >> >>> The decision about the size of the THP is made at first fault, with = some help >> >>> from user space and in future we might make decisions to split based= on >> >>> munmap/mremap/etc hints. In an ideal world, the fact that we have ha= d to swap >> >>> the THP out at some point in its lifetime should not impact on its s= ize. It's >> >>> just being moved around in the system and the reason for our origina= l decision >> >>> should still hold. >> >>> >> >>> So from that PoV, it would be good to swap-in to the same size that = was >> >>> swapped-out. >> >> >> >> Sorry, I don't agree with this. It's better to swap-in and swap-out = in >> >> smallest size if the page is only accessed seldom to avoid to waste >> >> memory. >> > >> > If we want to optimize only for memory consumption, I'm sure there are= many >> > things we would do differently. We need to find a balance between memo= ry and >> > performance. The benefits of folios are well documented and the kernel= is >> > heading in the direction of managing memory in variable-sized blocks. = So I don't >> > think it's as simple as saying we should always swap-in the smallest p= ossible >> > amount of memory. >> >> It's conditional, that is, >> >> "if the page is only accessed seldom" >> >> Then, the page swapped-in will be swapped-out soon and adjacent pages in >> the same large folio will not be accessed during this period. >> >> So, I suggest to create an algorithm to decide swap-in order based on >> swap-readahead information automatically. It can detect the situation >> above via reduced swap readahead window size. And, if the page is >> accessed for quite long time, and the adjacent pages in the same large >> folio are accessed too, swap-readahead window will increase and large >> swap-in order will be used. > > The original size of do_anonymous_page() should be honored, considering it > embodies a decision influenced by not only sysfs settings and per-vma > HUGEPAGE hints but also architectural characteristics, for example > CONT-PTE. > > The model you're proposing may offer memory-saving benefits or reduce I/O, > but it entirely disassociates the size of the swap in from the size prior= to the > swap out. Readahead isn't the only factor to determine folio order. For example, we must respect "never" policy to allocate order-0 folio always. There's no requirements to use swap-out order in swap-in too. Memory allocation has different performance character of storage reading. > Moreover, there's no guarantee that the large folio generated by > the readahead window is contiguous in the swap and can be added to the > swap cache, as we are currently dealing with folio->swap instead of > subpage->swap. Yes. We can optimize only when all conditions are satisfied. Just like other optimization. > Incidentally, do_anonymous_page() serves as the initial location for allo= cating > large folios. Given that memory conservation is a significant considerati= on in > do_swap_page(), wouldn't it be even more crucial in do_anonymous_page()? Yes. We should consider that too. IIUC, that is why mTHP support is off by default for now. After we find a way to solve the memory usage issue. We may make default "on". > A large folio, by its nature, represents a high-quality resource that has= the > potential to leverage hardware characteristics for the benefit of the > entire system. But not at the cost of memory wastage. > Conversely, I don't believe that a randomly determined size dictated by t= he > readahead window possesses the same advantageous qualities. There's a readahead algorithm which is not pure random. > SWP_SYNCHRONOUS_IO devices are not reliant on readahead whatsoever, > their needs should also be respected. I understand that there are special requirements for SWP_SYNCHRONOUS_IO devices. I just suggest to work on general code before specific optimization. >> > You also said we should swap *out* in smallest size possible. Have I >> > misunderstood you? I thought the case for swapping-out a whole folio w= ithout >> > splitting was well established and non-controversial? >> >> That is conditional too. >> >> >> >> >>> But we only kind-of keep that information around, via the swap >> >>> entry contiguity and alignment. With that scheme it is possible that= multiple >> >>> virtually adjacent but not physically contiguous folios get swapped-= out to >> >>> adjacent swap slot ranges and then they would be swapped-in to a sin= gle, larger >> >>> folio. This is not ideal, and I think it would be valuable to try to= maintain >> >>> the original folio size information with the swap slot. One way to d= o this would >> >>> be to store the original order for which the cluster was allocated i= n the >> >>> cluster. Then we at least know that a given swap slot is either for = a folio of >> >>> that order or an order-0 folio (due to cluster exhaustion/scanning).= Can we >> >>> steal a bit from swap_map to determine which case it is? Or are ther= e better >> >>> approaches? >> >> >> >> [snip] -- Best Regards, Huang, Ying