From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C8B8DC54E60 for ; Tue, 19 Mar 2024 12:19:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 011AD6B007B; Tue, 19 Mar 2024 08:19:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EDC8C6B0082; Tue, 19 Mar 2024 08:19:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D7DB06B0083; Tue, 19 Mar 2024 08:19:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id C29736B007B for ; Tue, 19 Mar 2024 08:19:36 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 8A54C1C12CF for ; Tue, 19 Mar 2024 12:19:36 +0000 (UTC) X-FDA: 81913694352.04.91F29E7 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf06.hostedemail.com (Postfix) with ESMTP id 843F018001E for ; Tue, 19 Mar 2024 12:19:34 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf06.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710850775; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UxC4q7cb2Og+CBjxQKwCTadwOh7cZ5koJCBpWucSl7A=; b=8LwKe5ZWLcFKbqsA0qpaCa0FtzETI/jVv9x4tWAELrcIktGWQVFozUtsPaPgvK4WXDuBdo JL1u943Twh/nH8i6kJkiKC4PkWAa/dxIoEavO1DeuhGY9vB9pPKyFJzcQhQP+1UJ//kc6E 3LA7z3N1ivSf1fbrCwaxAaMP99trKZE= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf06.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710850775; a=rsa-sha256; cv=none; b=hK/EjwSs+je44ahLjQYR9rALNRGSVqb+JMM2av4IBk6j/SkgD/Aj2/JGnF8JGdtx4wDwxN 4c631X8A1r8Bi5Gh4vsFvikmlQBnIjtr7KpTiRry4Jsra+eGTfYwSZAHy2CA5wQXsO/8ky X90/OPKz6IQ2/+qj5pLD6l6IBz+bptE= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 62F1D106F; Tue, 19 Mar 2024 05:20:08 -0700 (PDT) Received: from [10.1.30.191] (XHFQ2J9959.cambridge.arm.com [10.1.30.191]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 5C45F3F67D; Tue, 19 Mar 2024 05:19:29 -0700 (PDT) Message-ID: Date: Tue, 19 Mar 2024 12:19:27 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole Content-Language: en-GB To: "Huang, Ying" Cc: Barry Song <21cnbao@gmail.com>, Matthew Wilcox , akpm@linux-foundation.org, linux-mm@kvack.org, chengming.zhou@linux.dev, chrisl@kernel.org, david@redhat.com, hannes@cmpxchg.org, kasong@tencent.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, mhocko@suse.com, nphamcs@gmail.com, shy828301@gmail.com, steven.price@arm.com, surenb@google.com, wangkefeng.wang@huawei.com, xiang@kernel.org, yosryahmed@google.com, yuzhao@google.com, Chuanhua Han , Barry Song References: <20240304081348.197341-1-21cnbao@gmail.com> <20240304081348.197341-6-21cnbao@gmail.com> <87wmq3yji6.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sf0rx3d6.fsf@yhuang6-desk2.ccr.corp.intel.com> <87jzm0wblq.fsf@yhuang6-desk2.ccr.corp.intel.com> <9ec62266-26f1-46b6-8bb7-9917d04ed04e@arm.com> <87jzlyvar3.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Ryan Roberts In-Reply-To: <87jzlyvar3.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 843F018001E X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: iewwit46r7dtygtin78aj9dt57unfj1q X-HE-Tag: 1710850774-71441 X-HE-Meta: U2FsdGVkX1/sM3Hu71CBB11SsVIZByzyh6vnJMF/srJPkbjsN6a8pf1UTqcrB3Oya/jASrMFh6FRjkZjNLZMHAkAF96TKhVDN8QlaGmrNV3f1M7w2R5MBvAO2NpnjbemfqvaSnDjUptxGZaFd4hpU65cVOXaq1h83S/ajoaOJVGSe2kyYY4PYrUD/RxfAlJ53QaQX7XlTki5v0DO3xlm25UPSabUMu7zpk3kGSTEkwR2ti8u2TqKhcQFbPzYFPP64EUz13GKXpBOiu35cKwszGsWtIq7YbJE0PWpVBqULGkBZkfUCJyxsHdxvY9p6jdYyT6XWoO/YKIePzjz8H3veGJx4Z014axWiqe8SgoElAbDCjhJc79cQIyecL74026UyYy4NgSsxXBmB4guqoj3b/j4cMZg4C2sIe7oF2l55N1/2whRomNuWbcVzHDlUaqLgKkahuAg2WYgVeR819T5TyIZaWwh6aB0X3B/MSc4YXDycAS3k1LYvl1FU7FVqjVMKjMA7XsMjf6ht/TB9MTdmlPqFRAkdbdKSD0y8RUEle5gBcFFCBl+bZH3IZIocJjTtA5qbTao8/uMaz57vg73GGcZsjm7hv2jkjcx9HY0HU3t7Bzeaiia3Xbxsked9AhQ9SX569HVxCSB7Noil+6fvEOOwadYKyAyfITtu+cx8l+PKGkwfZCRUm2+Zzg6Dzr2fEZ3jHZ0865n1VuxNmKE945CSP24fRpuFC+VtMd8aWTteckut7LvoCD7tcxwx3uNSvqjhFNuQe9asTTGhGDCAiFgMWptHH/wgM1dWE5jm/TRwXZvmkmXAoXw70yjah37/G8CgMius47Jvsi0RVcEh5I8Qu5yIfyA X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 19/03/2024 09:20, Huang, Ying wrote: > Ryan Roberts writes: > >>>>> I agree phones are not the only platform. But Rome wasn't built in a >>>>> day. I can only get >>>>> started on a hardware which I can easily reach and have enough hardware/test >>>>> resources on it. So we may take the first step which can be applied on >>>>> a real product >>>>> and improve its performance, and step by step, we broaden it and make it >>>>> widely useful to various areas in which I can't reach :-) >>>> >>>> We must guarantee the normal swap path runs correctly and has no >>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization. >>>> So we have to put some effort on the normal path test anyway. >>>> >>>>> so probably we can have a sysfs "enable" entry with default "n" or >>>>> have a maximum >>>>> swap-in order as Ryan's suggestion [1] at the beginning, >>>>> >>>>> " >>>>> So in the common case, swap-in will pull in the same size of folio as was >>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly >>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure >>>>> it makes sense for 2M THP; As the size increases the chances of actually needing >>>>> all of the folio reduces so chances are we are wasting IO. There are similar >>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes >>>>> sense to copy the whole folio up to a certain size. >>>>> " >> >> I thought about this a bit more. No clear conclusions, but hoped this might help >> the discussion around policy: >> >> The decision about the size of the THP is made at first fault, with some help >> from user space and in future we might make decisions to split based on >> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap >> the THP out at some point in its lifetime should not impact on its size. It's >> just being moved around in the system and the reason for our original decision >> should still hold. >> >> So from that PoV, it would be good to swap-in to the same size that was >> swapped-out. > > Sorry, I don't agree with this. It's better to swap-in and swap-out in > smallest size if the page is only accessed seldom to avoid to waste > memory. If we want to optimize only for memory consumption, I'm sure there are many things we would do differently. We need to find a balance between memory and performance. The benefits of folios are well documented and the kernel is heading in the direction of managing memory in variable-sized blocks. So I don't think it's as simple as saying we should always swap-in the smallest possible amount of memory. You also said we should swap *out* in smallest size possible. Have I misunderstood you? I thought the case for swapping-out a whole folio without splitting was well established and non-controversial? > >> But we only kind-of keep that information around, via the swap >> entry contiguity and alignment. With that scheme it is possible that multiple >> virtually adjacent but not physically contiguous folios get swapped-out to >> adjacent swap slot ranges and then they would be swapped-in to a single, larger >> folio. This is not ideal, and I think it would be valuable to try to maintain >> the original folio size information with the swap slot. One way to do this would >> be to store the original order for which the cluster was allocated in the >> cluster. Then we at least know that a given swap slot is either for a folio of >> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we >> steal a bit from swap_map to determine which case it is? Or are there better >> approaches? > > [snip] > > -- > Best Regards, > Huang, Ying