From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51E91CDB483 for ; Tue, 17 Oct 2023 05:46:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E5D3E8002F; Tue, 17 Oct 2023 01:46:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E0E7F8002D; Tue, 17 Oct 2023 01:46:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CD6328002F; Tue, 17 Oct 2023 01:46:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id BACF28002D for ; Tue, 17 Oct 2023 01:46:25 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 876B940E7A for ; Tue, 17 Oct 2023 05:46:25 +0000 (UTC) X-FDA: 81353868330.26.3B82D3E Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.136]) by imf01.hostedemail.com (Postfix) with ESMTP id 7312E4000C for ; Tue, 17 Oct 2023 05:46:23 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=VJCJtE+v; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf01.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.136 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697521583; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BdLvRIh5wYEtBOowGr44Y82DhbHt0As4nvkYwnGqZYo=; b=ktag6Oy/qMxOAy0etPD8HMgnYqE+YDTRBSUY2vKmPYIH7jb3xzslI1m2262YVZIQu8IaF4 OyG5nStoohWmzz1DYCO5tiBZf9sQx2+P0TJXYqSMTDwQdQENjbWKz9qOsLIdrzYudVAPnB 4WzBWU5Xii3rvzUxgU6wesGCOJ8VMg8= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=VJCJtE+v; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf01.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.136 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697521583; a=rsa-sha256; cv=none; b=x7FxeyLkChebTs7fXK6ihXgMRRgqDnJ1s4dYny1tKh/zIGVi2eCEveEjoqdOSFLDFPWMi9 aoAzrCkjneeOsxhteO2U8vn47EcjwLwjSbSBngoFMdp0sX4YdUur5Fh1yehLL6um2BsCVC 5g+MbdwwyPm4HSXGi/Tj2+MxfI2oze4= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1697521583; x=1729057583; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=Skzu2i18rf7U9pK7ZpFgTgdCvKn/xYSqDDxJzL2toyE=; b=VJCJtE+vKqYV6wt2OjDIoRkxuNbSc+IK/YjTN3QjY1ME0rMlas+D4W7A uKyR2UW8i8rnVa8zaVpBgWR5LkuEIdLa8xN8xBmI974WZ99jLdymskT4y MotDxcb/FVYKqlFWkR/ABKMVpSOFtvPWavF7UbrT38M3l0WHhn7fnua0F VC3j3QE/CRgvNVLw1GjC1hBJ9Ssa1MdFx2rRkDtEMz3hBrWFy/ZIfX9qs 1Z4bB4usQ7tjTEzrZBbR3wrbuEP+N5+sCSOOZnKgh9IhY3Ana4+oJ2FgN oaEa8SRwjFZcIDL5hndPdQ8V89bDqGboHI6IFROWbb+8fKiF1pXkTSB5j g==; X-IronPort-AV: E=McAfee;i="6600,9927,10865"; a="365056681" X-IronPort-AV: E=Sophos;i="6.03,231,1694761200"; d="scan'208";a="365056681" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Oct 2023 22:46:21 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10865"; a="1087364829" X-IronPort-AV: E=Sophos;i="6.03,231,1694761200"; d="scan'208";a="1087364829" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Oct 2023 22:46:19 -0700 From: "Huang, Ying" To: Ryan Roberts Cc: Andrew Morton , David Hildenbrand , Matthew Wilcox , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , , Subject: Re: [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without splitting In-Reply-To: (Ryan Roberts's message of "Mon, 16 Oct 2023 13:10:21 +0100") References: <20231010142111.3997780-1-ryan.roberts@arm.com> <20231010142111.3997780-3-ryan.roberts@arm.com> <87r0m1ftvu.fsf@yhuang6-desk2.ccr.corp.intel.com> <1ccde143-18a7-483b-a9a4-fff07b0edc72@arm.com> <87ttqrm6pk.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Tue, 17 Oct 2023 13:44:18 +0800 Message-ID: <87a5shajm5.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Queue-Id: 7312E4000C X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: si71q7ppbk98p1yoko3rec5erdy8gd8u X-HE-Tag: 1697521583-60787 X-HE-Meta: U2FsdGVkX19bcwxbY67FtUgKeoMh1v7pfw7UBOtacpfCupIs0RJ9KuEEzlK+hgGj93vh7ApX5Mq8CPaaZ3CCBmj8aMUR/v7nMlQbncYi/1xoHGOu323+U/3YN1EdNnKscuNtVp1CVcEAwDqxEL1gwOgCDFR7KMQO80LdkmBK3TdBkizYRRj01qzC2QPvh9iHD3pdbc5pPElROVoUR/hLeDEzjoQF78lZwB0TMr3NPV0BkfwcqdOnevFhG1D/6LClK2piMu7E+Efog5kfnYAe4RW2PM9KFWdrhP8b6+YTCnxIPcuV0iVGwwkkPnaSDs5z82eDC/IBcDetrEDZRGXE7yhq47B89EnWPurLZeDCgp5PcN2dDeKmipmUY5PdUUhXYm1YVsjmLj8HnjOyJtbeil2bYKycM30+jmNK4ajHewXBM7B3r800J14SmfrDttV2VlERLEEOMRcFexqF127ML0L8fmTtyweu4JarEKAcoFjsrpqwaRoOl5ivFAWhPIABBRkQ88iikUeeRNvPC0Cu6AJyQvxhslJdowoUVB3B/YzgPwGjR+PQ1kpNWQwieR/mAxPT5zrzFoFJZonJ3yiIgl62W2U53X0h1AXy+hcnPvrxtf09JMQu45etbDjSl+mI1YFxOkb1E1Ng0hOEjkU0DoGxjXeePum0xJrlp0ZxfaT78C/NneoauLsdwHlw6dOmMBpJ2gmH4OYpzKQsvLy6rz07chhb8CY7Yg6anc4FuXHJpqJSTQOlDj6rmDosPRABT1SPD8e5Bf3J2QErg/h5rOXF3PAMKx14wU8q/I4CBtd8/V1vs0rBUi9CtJzATWyFvyPz34w5Y7hOY6KPjZnJ5HaoDYsCCTxu9VmwNOOHcNq+lINWVGla2wG9e287fFoyMrGWLUzpC599FcFUGky8GnZmqZAymv/cD8eex8/k6l92mA1ZPKHBen3TP3ScBq4YF/d5dMLIDsrrx4J2yYY 2bF2MJTt X5uxtxGaDMHay3wnjirbjKXi+wnLgO7/RUzq3j7sE1yykyvBpofYUGj3peJ85LPWAD4d9KkyzLEmR6m4sJQdIggmQ+M9PN0ZJ61n0nj2FNHyvTAvKL/F3VDe0VEBq0zSvjHtx3PSXtMcGmvPhQQvhYHB3u97PnC8rLf54tzyDn8ZM07ALEFWcJSgAmriawjEc/y/mEKGKR29UTBU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Ryan Roberts writes: > On 16/10/2023 07:17, Huang, Ying wrote: >> Ryan Roberts writes: >> >>> On 11/10/2023 11:36, Ryan Roberts wrote: >>>> On 11/10/2023 09:25, Huang, Ying wrote: >>>>> Ryan Roberts writes: >>>>> >>>>>> The upcoming anonymous small-sized THP feature enables performance >>>>>> improvements by allocating large folios for anonymous memory. However >>>>>> I've observed that on an arm64 system running a parallel workload (e.g. >>>>>> kernel compilation) across many cores, under high memory pressure, the >>>>>> speed regresses. This is due to bottlenecking on the increased number of >>>>>> TLBIs added due to all the extra folio splitting. >>>>>> >>>>>> Therefore, solve this regression by adding support for swapping out >>>>>> small-sized THP without needing to split the folio, just like is already >>>>>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is >>>>>> enabled, and when the swap backing store is a non-rotating block device >>>>>> - these are the same constraints as for the existing PMD-sized THP >>>>>> swap-out support. >>>>>> >>>>>> Note that no attempt is made to swap-in THP here - this is still done >>>>>> page-by-page, like for PMD-sized THP. >>>>>> >>>>>> The main change here is to improve the swap entry allocator so that it >>>>>> can allocate any power-of-2 number of contiguous entries between [4, (1 >>>>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct >>>>>> order and allocating sequentially from it until the cluster is full. >>>>>> This ensures that we don't need to search the map and we get no >>>>>> fragmentation due to alignment padding for different orders in the >>>>>> cluster. If there is no current cluster for a given order, we attempt to >>>>>> allocate a free cluster from the list. If there are no free clusters, we >>>>>> fail the allocation and the caller falls back to splitting the folio and >>>>>> allocates individual entries (as per existing PMD-sized THP fallback). >>>>>> >>>>>> As far as I can tell, this should not cause any extra fragmentation >>>>>> concerns, given how similar it is to the existing PMD-sized THP >>>>>> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in >>>>>> concurrent use though. In practice, the number of orders in use will be >>>>>> small though. >>>>>> >>>>>> Signed-off-by: Ryan Roberts >>>>>> --- >>>>>> include/linux/swap.h | 7 ++++++ >>>>>> mm/swapfile.c | 60 +++++++++++++++++++++++++++++++++----------- >>>>>> mm/vmscan.c | 10 +++++--- >>>>>> 3 files changed, 59 insertions(+), 18 deletions(-) >>>>>> >>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>>>>> index a073366a227c..fc55b760aeff 100644 >>>>>> --- a/include/linux/swap.h >>>>>> +++ b/include/linux/swap.h >>>>>> @@ -320,6 +320,13 @@ struct swap_info_struct { >>>>>> */ >>>>>> struct work_struct discard_work; /* discard worker */ >>>>>> struct swap_cluster_list discard_clusters; /* discard clusters list */ >>>>>> + unsigned int large_next[PMD_ORDER]; /* >>>>>> + * next free offset within current >>>>>> + * allocation cluster for large >>>>>> + * folios, or UINT_MAX if no current >>>>>> + * cluster. Index is (order - 1). >>>>>> + * Only when cluster_info is used. >>>>>> + */ >>>>> >>>>> I think that it is better to make this per-CPU. That is, extend the >>>>> percpu_cluster mechanism. Otherwise, we may have scalability issue. >>>> >>>> Is your concern that the swap_info spinlock will get too contended as its >>>> currently written? From briefly looking at percpu_cluster, it looks like that >>>> spinlock is always held when accessing the per-cpu structures - presumably >>>> that's what's disabling preemption and making sure the thread is not migrated? >>>> So I'm not sure what the benefit is currently? Surely you want to just disable >>>> preemption but not hold the lock? I'm sure I've missed something crucial... >>> >>> I looked a bit further at how to implement what you are suggesting. >>> get_swap_pages() is currently taking the swap_info lock which it needs to check >>> and update some other parts of the swap_info - I'm not sure that part can be >>> removed. swap_alloc_large() (my new function) is not doing an awful lot of work, >>> so I'm not convinced that you would save too much by releasing the lock for that >>> part. In contrast there is a lot more going on in scan_swap_map_slots() so there >>> is more benefit to releasing the lock and using the percpu stuff - correct me if >>> I've missunderstood. >>> >>> As an alternative approach, perhaps it makes more sense to beef up the caching >>> layer in swap_slots.c to handle large folios too? Then you avoid taking the >>> swap_info lock at all most of the time, like you currently do for single entry >>> allocations. >>> >>> What do you think? >> >> Sorry for late reply. >> >> percpu_cluster is introduced in commit ebc2a1a69111 ("swap: make cluster >> allocation per-cpu"). Please check the changelog for why it's >> introduced. Sorry about my incorrect memory about scalability. >> percpu_cluster is introduced for disk performance mainly instead of >> scalability. > > Thanks for the pointer. I'm not sure if you are still suggesting that I make my > small-sized THP allocation mechanism per-cpu though? Yes. I think that the reason for that we introduced percpu_cluster still applies now. > I anticipate that by virtue of allocating multiple contiguous swap entries for a > small-sized THP we already get a lot of the benefits that percpu_cluster gives > order-0 allocations. (Although obviously it will only give contiguity matching > the size of the THP rather than a full cluster). I think that you will still introduce "interleave disk access" when multiple CPU allocate from and write to swap device simultaneously. Right? Yes, 16KB block is better than 4KB block, but I don't think it solves the problem. > The downside of making this percpu would be keeping more free clusters > tied up in the percpu caches, potentially causing a need to scan for > free entries more often. Yes. We may waste several MB swap space per-CPU. Is this a practical issue given the swap device capacity becomes larger and larger? -- Best Regards, Huang, Ying