From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 17A04CDB465 for ; Mon, 16 Oct 2023 06:20:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8B6588D004E; Mon, 16 Oct 2023 02:20:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 866298D0001; Mon, 16 Oct 2023 02:20:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 72DE88D004E; Mon, 16 Oct 2023 02:20:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 603E48D0001 for ; Mon, 16 Oct 2023 02:20:49 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 20DECC0AC9 for ; Mon, 16 Oct 2023 06:20:49 +0000 (UTC) X-FDA: 81350326218.21.0CDA184 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.88]) by imf13.hostedemail.com (Postfix) with ESMTP id 5FDA020013 for ; Mon, 16 Oct 2023 06:20:45 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=BGotGIYl; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf13.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.88 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697437247; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=99UQ2M0hKCcRGp9JfzOvW3gRpUmGZeoz9MG68BAdS14=; b=0jcbiqdIE+ZKEKdJL16dEJzcs3O0D6vYYYtWGHtJW3LALOXz55l6h2K5BGr+7W4S7e6yN0 jK9LEuz1Sll810r2cIhm7OsCc08ZroCEsos82mlw40B4YvPs2XWm9RcS6xgg9kRjFPpScl o60qVW3mjjfZAMvwJ3IHnjQeMyD5Rig= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=BGotGIYl; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf13.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.88 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697437247; a=rsa-sha256; cv=none; b=gcwa+LDAbB7sqHavS5PeoTsYlfzQFhhNsqBv+qdWDzuirVXgJ3F/fc1DfxpPxddFub3y3J ESqyzDu4CzXYS1AAhu6bKv5nr2+9wgvFY5rI2anAdWk3xSxQGkswxzjDNVBUZ6A2ToyxeC ViKvzoJ0IHO2C89djRHJXD/aJ9aeVHw= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1697437245; x=1728973245; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=6QV46esyzMbwV1pApkXxUKx74MXSyq2TMVxLFgLyDZs=; b=BGotGIYlVA551IsmQA0g068CmSsu5oJjYYIdhnMoUakuEHgYaPP94c94 7gEYaP90McfLPs2uWsOQVdmTx3roX/cK+w2YCYsDo8g/bOn7Oq8pLmItN U+z1J6A1u4JIjV1Z32+I23lpo2BShVPI2dvKcOzxaDc+EgmDyu6y+GKd5 8UaZjKxBvD4E7O0ECcVqfn4kK3ohq5vpG2p6UIsehsqfk19J9VxjyLm+p M4PhZFPQrgs/aGdwAO/regLuFBJ2Rv4L1Y3mo4LpY4KjAz4qqxcuxVsDI /1ves+BWbedHegdlULRLJhlXFxMK2P7ATANKKcdTg27B+YiHoe5IXXpPa A==; X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="416522610" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="416522610" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 23:20:26 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="755571761" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="755571761" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 23:20:11 -0700 From: "Huang, Ying" To: Ryan Roberts Cc: Andrew Morton , David Hildenbrand , Matthew Wilcox , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , , Subject: Re: [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without splitting References: <20231010142111.3997780-1-ryan.roberts@arm.com> <20231010142111.3997780-3-ryan.roberts@arm.com> <87r0m1ftvu.fsf@yhuang6-desk2.ccr.corp.intel.com> <1ccde143-18a7-483b-a9a4-fff07b0edc72@arm.com> Date: Mon, 16 Oct 2023 14:17:43 +0800 In-Reply-To: <1ccde143-18a7-483b-a9a4-fff07b0edc72@arm.com> (Ryan Roberts's message of "Wed, 11 Oct 2023 18:14:37 +0100") Message-ID: <87ttqrm6pk.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 5FDA020013 X-Stat-Signature: xun7kx4pt7mbkkjk9pj4qmoiobr3f4ja X-Rspam-User: X-HE-Tag: 1697437245-545327 X-HE-Meta: U2FsdGVkX1/0XVUYSIbgzxkLUn7Lu2BhdP+SjrWG9M5VDi5lOQ8iej4wwRWgahIxNSK+1l6BM7nN0Jk5v9oDQz8yl69Eutis27xIv/IIDCHGaVY+enkym3CL8NHGGK++jdUT2Yi+vnSSCB3SXFZ1ICezebFVRAe44XaojC3BQv+eriUZGYtYTaE6nXW1X/dNujoSW+zC1KblvUU/J7ZFDqkbyc2agWd5/BtndDJef9OP0M4Mn+Fgb2ivIe2UlNdb4D9tBO21pFYzhTfRjJv+tMUsOsQF1Jsjvx1EYSaEy0GnXwQT2em0WjwgXnFMtHLiQ8/WrLuqZf95nAlI7Tu013FRJh4FYkV6+uMNC5PVL/7nZb4gQDDVzLdswlZRQHLLNigvJN6rL05tIV0F71BHrgGu5/XshIakVqNpyg9XlZ3JJ29UUai3BrfZu4fJ25b8D3VhYvY6fTqlUzu3YGqqYJywOElRGLX869v/VR3VGGczPNyrpg8PICgVK9uV/If0ddlB9889kmSUWopOWbLUpypD4r3UrXWUxEFBVW0jPYPih4KQRQJUwO6c5fkRBzYcpPdKMvvvMumk+Va8K883GfVmVqNdHTeB7PMHSOhIuGOfhkBkqPQXAxbs/Jz7inNLS4toVpSzsPtL038VOD/whOwC5FQKJXNZHcv9G+//nGTt5vxEBMMlgeQsz1xw8+VlSWigKFQFVHC28Q5Q81Q5SHMc2Ye2acTruNEPGLE/5TB6v20+W5FSpkuyqqEP602M855OvUOhIXu9zgW3RIkn2ozrNAIJLGo33ZrAzLACbbCZSWSNUJj2Mq8R/2OPK6nf3uhFL5hLwm+2UrTanRwyNY8WPblNgawSYgzIfemX4eSBZPRnBp/x+bDiOAFsh/e6K2GnNuxfd3EjMlXz04HUB61ehrLTeTPK27YCiXP1d0zAj3arF0U8NvIg1QYZiBqBbr4ANKb7UskT9yo/Jz9 oouOd19L H3MQqBoOyTAfj3Wbu0A8iiyfAtmeXaEauK5hizS7mXR2Eu6nTbF+x5OzVu7X8JwTCBYgIt8KlotIZBU/h3gSBnwKDIFc/KzH78OFqPVK15BHHc4X31x8TvaKno1cKBAkPVVrCgK1SREN17pPHWPGTo33kpDwuTbgLT0zcUZW2mI3uOC7AzdmHYHZK0SXi69NyzizlGG5Z2ja/Drk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Ryan Roberts writes: > On 11/10/2023 11:36, Ryan Roberts wrote: >> On 11/10/2023 09:25, Huang, Ying wrote: >>> Ryan Roberts writes: >>> >>>> The upcoming anonymous small-sized THP feature enables performance >>>> improvements by allocating large folios for anonymous memory. However >>>> I've observed that on an arm64 system running a parallel workload (e.g. >>>> kernel compilation) across many cores, under high memory pressure, the >>>> speed regresses. This is due to bottlenecking on the increased number of >>>> TLBIs added due to all the extra folio splitting. >>>> >>>> Therefore, solve this regression by adding support for swapping out >>>> small-sized THP without needing to split the folio, just like is already >>>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is >>>> enabled, and when the swap backing store is a non-rotating block device >>>> - these are the same constraints as for the existing PMD-sized THP >>>> swap-out support. >>>> >>>> Note that no attempt is made to swap-in THP here - this is still done >>>> page-by-page, like for PMD-sized THP. >>>> >>>> The main change here is to improve the swap entry allocator so that it >>>> can allocate any power-of-2 number of contiguous entries between [4, (1 >>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct >>>> order and allocating sequentially from it until the cluster is full. >>>> This ensures that we don't need to search the map and we get no >>>> fragmentation due to alignment padding for different orders in the >>>> cluster. If there is no current cluster for a given order, we attempt to >>>> allocate a free cluster from the list. If there are no free clusters, we >>>> fail the allocation and the caller falls back to splitting the folio and >>>> allocates individual entries (as per existing PMD-sized THP fallback). >>>> >>>> As far as I can tell, this should not cause any extra fragmentation >>>> concerns, given how similar it is to the existing PMD-sized THP >>>> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in >>>> concurrent use though. In practice, the number of orders in use will be >>>> small though. >>>> >>>> Signed-off-by: Ryan Roberts >>>> --- >>>> include/linux/swap.h | 7 ++++++ >>>> mm/swapfile.c | 60 +++++++++++++++++++++++++++++++++----------- >>>> mm/vmscan.c | 10 +++++--- >>>> 3 files changed, 59 insertions(+), 18 deletions(-) >>>> >>>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>>> index a073366a227c..fc55b760aeff 100644 >>>> --- a/include/linux/swap.h >>>> +++ b/include/linux/swap.h >>>> @@ -320,6 +320,13 @@ struct swap_info_struct { >>>> */ >>>> struct work_struct discard_work; /* discard worker */ >>>> struct swap_cluster_list discard_clusters; /* discard clusters list */ >>>> + unsigned int large_next[PMD_ORDER]; /* >>>> + * next free offset within current >>>> + * allocation cluster for large >>>> + * folios, or UINT_MAX if no current >>>> + * cluster. Index is (order - 1). >>>> + * Only when cluster_info is used. >>>> + */ >>> >>> I think that it is better to make this per-CPU. That is, extend the >>> percpu_cluster mechanism. Otherwise, we may have scalability issue. >> >> Is your concern that the swap_info spinlock will get too contended as its >> currently written? From briefly looking at percpu_cluster, it looks like that >> spinlock is always held when accessing the per-cpu structures - presumably >> that's what's disabling preemption and making sure the thread is not migrated? >> So I'm not sure what the benefit is currently? Surely you want to just disable >> preemption but not hold the lock? I'm sure I've missed something crucial... > > I looked a bit further at how to implement what you are suggesting. > get_swap_pages() is currently taking the swap_info lock which it needs to check > and update some other parts of the swap_info - I'm not sure that part can be > removed. swap_alloc_large() (my new function) is not doing an awful lot of work, > so I'm not convinced that you would save too much by releasing the lock for that > part. In contrast there is a lot more going on in scan_swap_map_slots() so there > is more benefit to releasing the lock and using the percpu stuff - correct me if > I've missunderstood. > > As an alternative approach, perhaps it makes more sense to beef up the caching > layer in swap_slots.c to handle large folios too? Then you avoid taking the > swap_info lock at all most of the time, like you currently do for single entry > allocations. > > What do you think? Sorry for late reply. percpu_cluster is introduced in commit ebc2a1a69111 ("swap: make cluster allocation per-cpu"). Please check the changelog for why it's introduced. Sorry about my incorrect memory about scalability. percpu_cluster is introduced for disk performance mainly instead of scalability. -- Best Regards, Huang, Ying >> >>> >>> And this should be enclosed in CONFIG_THP_SWAP. >> >> Yes, I'll fix this in the next version. >> >> Thanks for the review! >> >>> >>>> struct plist_node avail_lists[]; /* >>>> * entries in swap_avail_heads, one >>>> * entry per node.