From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 17A04CDB465
	for <linux-mm@archiver.kernel.org>; Mon, 16 Oct 2023 06:20:50 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8B6588D004E; Mon, 16 Oct 2023 02:20:49 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 866298D0001; Mon, 16 Oct 2023 02:20:49 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 72DE88D004E; Mon, 16 Oct 2023 02:20:49 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 603E48D0001
	for <linux-mm@kvack.org>; Mon, 16 Oct 2023 02:20:49 -0400 (EDT)
Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 20DECC0AC9
	for <linux-mm@kvack.org>; Mon, 16 Oct 2023 06:20:49 +0000 (UTC)
X-FDA: 81350326218.21.0CDA184
Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.88])
	by imf13.hostedemail.com (Postfix) with ESMTP id 5FDA020013
	for <linux-mm@kvack.org>; Mon, 16 Oct 2023 06:20:45 +0000 (UTC)
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=BGotGIYl;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf13.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.88 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1697437247;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=99UQ2M0hKCcRGp9JfzOvW3gRpUmGZeoz9MG68BAdS14=;
	b=0jcbiqdIE+ZKEKdJL16dEJzcs3O0D6vYYYtWGHtJW3LALOXz55l6h2K5BGr+7W4S7e6yN0
	jK9LEuz1Sll810r2cIhm7OsCc08ZroCEsos82mlw40B4YvPs2XWm9RcS6xgg9kRjFPpScl
	o60qVW3mjjfZAMvwJ3IHnjQeMyD5Rig=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=BGotGIYl;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf13.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.88 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697437247; a=rsa-sha256;
	cv=none;
	b=gcwa+LDAbB7sqHavS5PeoTsYlfzQFhhNsqBv+qdWDzuirVXgJ3F/fc1DfxpPxddFub3y3J
	ESqyzDu4CzXYS1AAhu6bKv5nr2+9wgvFY5rI2anAdWk3xSxQGkswxzjDNVBUZ6A2ToyxeC
	ViKvzoJ0IHO2C89djRHJXD/aJ9aeVHw=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1697437245; x=1728973245;
  h=from:to:cc:subject:references:date:in-reply-to:
   message-id:mime-version;
  bh=6QV46esyzMbwV1pApkXxUKx74MXSyq2TMVxLFgLyDZs=;
  b=BGotGIYlVA551IsmQA0g068CmSsu5oJjYYIdhnMoUakuEHgYaPP94c94
   7gEYaP90McfLPs2uWsOQVdmTx3roX/cK+w2YCYsDo8g/bOn7Oq8pLmItN
   U+z1J6A1u4JIjV1Z32+I23lpo2BShVPI2dvKcOzxaDc+EgmDyu6y+GKd5
   8UaZjKxBvD4E7O0ECcVqfn4kK3ohq5vpG2p6UIsehsqfk19J9VxjyLm+p
   M4PhZFPQrgs/aGdwAO/regLuFBJ2Rv4L1Y3mo4LpY4KjAz4qqxcuxVsDI
   /1ves+BWbedHegdlULRLJhlXFxMK2P7ATANKKcdTg27B+YiHoe5IXXpPa
   A==;
X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="416522610"
X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; 
   d="scan'208";a="416522610"
Received: from orsmga002.jf.intel.com ([10.7.209.21])
  by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 23:20:26 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="755571761"
X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; 
   d="scan'208";a="755571761"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 23:20:11 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,  David Hildenbrand
 <david@redhat.com>,  Matthew Wilcox <willy@infradead.org>,  Gao Xiang
 <xiang@kernel.org>,  Yu Zhao <yuzhao@google.com>,  Yang Shi
 <shy828301@gmail.com>,  Michal Hocko <mhocko@suse.com>,
  <linux-kernel@vger.kernel.org>,  <linux-mm@kvack.org>
Subject: Re: [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without
 splitting
References: <20231010142111.3997780-1-ryan.roberts@arm.com>
	<20231010142111.3997780-3-ryan.roberts@arm.com>
	<87r0m1ftvu.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<f1446ef6-3e29-4ce0-866e-c522931ae364@arm.com>
	<1ccde143-18a7-483b-a9a4-fff07b0edc72@arm.com>
Date: Mon, 16 Oct 2023 14:17:43 +0800
In-Reply-To: <1ccde143-18a7-483b-a9a4-fff07b0edc72@arm.com> (Ryan Roberts's
	message of "Wed, 11 Oct 2023 18:14:37 +0100")
Message-ID: <87ttqrm6pk.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: 5FDA020013
X-Stat-Signature: xun7kx4pt7mbkkjk9pj4qmoiobr3f4ja
X-Rspam-User: 
X-HE-Tag: 1697437245-545327
X-HE-Meta: U2FsdGVkX1/0XVUYSIbgzxkLUn7Lu2BhdP+SjrWG9M5VDi5lOQ8iej4wwRWgahIxNSK+1l6BM7nN0Jk5v9oDQz8yl69Eutis27xIv/IIDCHGaVY+enkym3CL8NHGGK++jdUT2Yi+vnSSCB3SXFZ1ICezebFVRAe44XaojC3BQv+eriUZGYtYTaE6nXW1X/dNujoSW+zC1KblvUU/J7ZFDqkbyc2agWd5/BtndDJef9OP0M4Mn+Fgb2ivIe2UlNdb4D9tBO21pFYzhTfRjJv+tMUsOsQF1Jsjvx1EYSaEy0GnXwQT2em0WjwgXnFMtHLiQ8/WrLuqZf95nAlI7Tu013FRJh4FYkV6+uMNC5PVL/7nZb4gQDDVzLdswlZRQHLLNigvJN6rL05tIV0F71BHrgGu5/XshIakVqNpyg9XlZ3JJ29UUai3BrfZu4fJ25b8D3VhYvY6fTqlUzu3YGqqYJywOElRGLX869v/VR3VGGczPNyrpg8PICgVK9uV/If0ddlB9889kmSUWopOWbLUpypD4r3UrXWUxEFBVW0jPYPih4KQRQJUwO6c5fkRBzYcpPdKMvvvMumk+Va8K883GfVmVqNdHTeB7PMHSOhIuGOfhkBkqPQXAxbs/Jz7inNLS4toVpSzsPtL038VOD/whOwC5FQKJXNZHcv9G+//nGTt5vxEBMMlgeQsz1xw8+VlSWigKFQFVHC28Q5Q81Q5SHMc2Ye2acTruNEPGLE/5TB6v20+W5FSpkuyqqEP602M855OvUOhIXu9zgW3RIkn2ozrNAIJLGo33ZrAzLACbbCZSWSNUJj2Mq8R/2OPK6nf3uhFL5hLwm+2UrTanRwyNY8WPblNgawSYgzIfemX4eSBZPRnBp/x+bDiOAFsh/e6K2GnNuxfd3EjMlXz04HUB61ehrLTeTPK27YCiXP1d0zAj3arF0U8NvIg1QYZiBqBbr4ANKb7UskT9yo/Jz9
 oouOd19L
 H3MQqBoOyTAfj3Wbu0A8iiyfAtmeXaEauK5hizS7mXR2Eu6nTbF+x5OzVu7X8JwTCBYgIt8KlotIZBU/h3gSBnwKDIFc/KzH78OFqPVK15BHHc4X31x8TvaKno1cKBAkPVVrCgK1SREN17pPHWPGTo33kpDwuTbgLT0zcUZW2mI3uOC7AzdmHYHZK0SXi69NyzizlGG5Z2ja/Drk=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 11/10/2023 11:36, Ryan Roberts wrote:
>> On 11/10/2023 09:25, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> The upcoming anonymous small-sized THP feature enables performance
>>>> improvements by allocating large folios for anonymous memory. However
>>>> I've observed that on an arm64 system running a parallel workload (e.g.
>>>> kernel compilation) across many cores, under high memory pressure, the
>>>> speed regresses. This is due to bottlenecking on the increased number of
>>>> TLBIs added due to all the extra folio splitting.
>>>>
>>>> Therefore, solve this regression by adding support for swapping out
>>>> small-sized THP without needing to split the folio, just like is already
>>>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>>>> enabled, and when the swap backing store is a non-rotating block device
>>>> - these are the same constraints as for the existing PMD-sized THP
>>>> swap-out support.
>>>>
>>>> Note that no attempt is made to swap-in THP here - this is still done
>>>> page-by-page, like for PMD-sized THP.
>>>>
>>>> The main change here is to improve the swap entry allocator so that it
>>>> can allocate any power-of-2 number of contiguous entries between [4, (1
>>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>>> order and allocating sequentially from it until the cluster is full.
>>>> This ensures that we don't need to search the map and we get no
>>>> fragmentation due to alignment padding for different orders in the
>>>> cluster. If there is no current cluster for a given order, we attempt to
>>>> allocate a free cluster from the list. If there are no free clusters, we
>>>> fail the allocation and the caller falls back to splitting the folio and
>>>> allocates individual entries (as per existing PMD-sized THP fallback).
>>>>
>>>> As far as I can tell, this should not cause any extra fragmentation
>>>> concerns, given how similar it is to the existing PMD-sized THP
>>>> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in
>>>> concurrent use though. In practice, the number of orders in use will be
>>>> small though.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/swap.h |  7 ++++++
>>>>  mm/swapfile.c        | 60 +++++++++++++++++++++++++++++++++-----------
>>>>  mm/vmscan.c          | 10 +++++---
>>>>  3 files changed, 59 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index a073366a227c..fc55b760aeff 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -320,6 +320,13 @@ struct swap_info_struct {
>>>>  					 */
>>>>  	struct work_struct discard_work; /* discard worker */
>>>>  	struct swap_cluster_list discard_clusters; /* discard clusters list */
>>>> +	unsigned int large_next[PMD_ORDER]; /*
>>>> +					     * next free offset within current
>>>> +					     * allocation cluster for large
>>>> +					     * folios, or UINT_MAX if no current
>>>> +					     * cluster. Index is (order - 1).
>>>> +					     * Only when cluster_info is used.
>>>> +					     */
>>>
>>> I think that it is better to make this per-CPU.  That is, extend the
>>> percpu_cluster mechanism.  Otherwise, we may have scalability issue.
>> 
>> Is your concern that the swap_info spinlock will get too contended as its
>> currently written? From briefly looking at percpu_cluster, it looks like that
>> spinlock is always held when accessing the per-cpu structures - presumably
>> that's what's disabling preemption and making sure the thread is not migrated?
>> So I'm not sure what the benefit is currently? Surely you want to just disable
>> preemption but not hold the lock? I'm sure I've missed something crucial...
>
> I looked a bit further at how to implement what you are suggesting.
> get_swap_pages() is currently taking the swap_info lock which it needs to check
> and update some other parts of the swap_info - I'm not sure that part can be
> removed. swap_alloc_large() (my new function) is not doing an awful lot of work,
> so I'm not convinced that you would save too much by releasing the lock for that
> part. In contrast there is a lot more going on in scan_swap_map_slots() so there
> is more benefit to releasing the lock and using the percpu stuff - correct me if
> I've missunderstood.
>
> As an alternative approach, perhaps it makes more sense to beef up the caching
> layer in swap_slots.c to handle large folios too? Then you avoid taking the
> swap_info lock at all most of the time, like you currently do for single entry
> allocations.
>
> What do you think?

Sorry for late reply.

percpu_cluster is introduced in commit ebc2a1a69111 ("swap: make cluster
allocation per-cpu").  Please check the changelog for why it's
introduced.  Sorry about my incorrect memory about scalability.
percpu_cluster is introduced for disk performance mainly instead of
scalability.

--
Best Regards,
Huang, Ying

>> 
>>>
>>> And this should be enclosed in CONFIG_THP_SWAP.
>> 
>> Yes, I'll fix this in the next version.
>> 
>> Thanks for the review!
>> 
>>>
>>>>  	struct plist_node avail_lists[]; /*
>>>>  					   * entries in swap_avail_heads, one
>>>>  					   * entry per node.