From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EE145C54791
	for <linux-mm@archiver.kernel.org>; Wed, 13 Mar 2024 08:50:59 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 238F68000B; Wed, 13 Mar 2024 04:50:59 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1E92C940010; Wed, 13 Mar 2024 04:50:59 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0D88A8000B; Wed, 13 Mar 2024 04:50:59 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id F3E68940010
	for <linux-mm@kvack.org>; Wed, 13 Mar 2024 04:50:58 -0400 (EDT)
Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id A1FEF40EED
	for <linux-mm@kvack.org>; Wed, 13 Mar 2024 08:50:58 +0000 (UTC)
X-FDA: 81891395796.20.0A6D9BD
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf26.hostedemail.com (Postfix) with ESMTP id E0E8E140007
	for <linux-mm@kvack.org>; Wed, 13 Mar 2024 08:50:56 +0000 (UTC)
Authentication-Results: imf26.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf26.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710319857;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=V0QDPVrjUGzSBRTLxxjnmkwGfLXUVHYLFWntK3qFYVc=;
	b=qhpS3B0JElivb/3KuE+ep1yzJ2H9yAkmcqkDeqruu3WjFmo7FIS9uqX1BZsjKvSOkU4Gim
	n0gF6CbgOL6Ga8iSwl1BmJLJzVuO6BojH+WhfPTMTTb3UTYdPVeYyykOlzbVQBqbTkODip
	+NZS/UlW++fMjmeUEI01/Z+kMMeHBSw=
ARC-Authentication-Results: i=1;
	imf26.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf26.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710319857; a=rsa-sha256;
	cv=none;
	b=x18WDVaAYO4KyrS4LGM1hXu7MgpSIk8xCWq8khN8L+KnkB7xj6QpJDzU+oRWuWGrK/pAYI
	suECB4HTdHug+V09p7Qp5+XVJMxPE+UoGUtqv9O0fAHE74VItVTicUHCxe/vqGFOB+9Vz6
	imvrQBEvMtPpE5ZorwjUgTwfnECvHcM=
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id AB4C91007;
	Wed, 13 Mar 2024 01:51:32 -0700 (PDT)
Received: from [10.57.67.164] (unknown [10.57.67.164])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 091143F73F;
	Wed, 13 Mar 2024 01:50:52 -0700 (PDT)
Message-ID: <3d60e840-00e1-4e6e-a9f4-e67d905b1782@arm.com>
Date: Wed, 13 Mar 2024 08:50:51 +0000
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v4 0/6] Swap-out mTHP without splitting
Content-Language: en-GB
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
 David Hildenbrand <david@redhat.com>, Matthew Wilcox <willy@infradead.org>,
 Gao Xiang <xiang@kernel.org>, Yu Zhao <yuzhao@google.com>,
 Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com>,
 Kefeng Wang <wangkefeng.wang@huawei.com>, Barry Song <21cnbao@gmail.com>,
 Chris Li <chrisl@kernel.org>, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org
References: <20240311150058.1122862-1-ryan.roberts@arm.com>
 <878r2n516c.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <28914585-80bd-4308-b3aa-dd0dbb2cb201@arm.com>
 <2fbc83bf-2e51-40fa-8865-499911ba8102@arm.com>
 <87zfv32aq7.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Ryan Roberts <ryan.roberts@arm.com>
In-Reply-To: <87zfv32aq7.fsf@yhuang6-desk2.ccr.corp.intel.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Rspamd-Queue-Id: E0E8E140007
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: zz9ic8hmaphuedx9zoohbcq6onf7ybcu
X-HE-Tag: 1710319856-995994
X-HE-Meta: U2FsdGVkX1+21x6UwzcgE2mlp1NCOHGtZlhRXRFQigMohTCxN+zD3GOZ16qOn0nKN6+JhTsvQLlJSEqIMAuTfcRVsCdK46/C0mzBmKuaCChWF6n8WKoeKcwfgKtzvh2pM1uHiqadFd/FDo8/oJHtsFzPjcBgf/oOq1GT0s3w7EZzDIXRJf04dt0nlceyjGS03vycxpuoOAQTEn++GIp3ZWRiCRNvKXWq6C+X8z8fzTmYRJ6MKo+sNqw+/nSUx7L7L/REkfrfTCVqIoS/7GaX9gFvaNRldYmmKqDszKji1AcSG1MJEjHilM07PMIcNHj3OiqVir52SuNd0IXtFz6heBsW8ydNdQWh/Lo6urwPTvhZoyaPUnmYo/zJvqZbU9h+99NOplYmdkZiYrwCS5GI4piYvZV7AF6kaK3G+7elhKsT5qEvq46KV5AeaMyzDDgTXszj5EUdo9RlL5JOvmtwKG8VNorP1Q8rcMBKXW3Zss5ZW17MWFd5nI5Ia953eQXOsd7u1GzJJX3o5l5Mir6zg6IzzhykHKdOXrmZfbpH41C7aobGyPUYCHlKqYYMaN5RjQGJn5L9l8k0+RIpC6kocrJ7SVkcSZht8AbW//FggeovU8BcLlQfiG7oJdhPnRgwJCH9CCODhk1ZyfOxE6O5tHhKcxVbQPL9QFDU+erMZs41Qa0VIKCIRgHicbDz3a+76tCxQyOXMlonLF4X3YIAYk5KyV/8OB3y5KqMgQszUgQGdRFthRM+zEtq3/9gALz1U+50JAD7uwxbJKUqq+R2CXU6oJYN9QVMMOWU4Zp4Iq20Rk4WdY0Rq1FLsL4nw686UjcUse4Zr1xJYh3D9Dbm9kPFwhSDiJUP
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 13/03/2024 01:15, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 12/03/2024 08:49, Ryan Roberts wrote:
>>> On 12/03/2024 08:01, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> Hi All,
>>>>>
>>>>> This series adds support for swapping out multi-size THP (mTHP) without needing
>>>>> to first split the large folio via split_huge_page_to_list_to_order(). It
>>>>> closely follows the approach already used to swap-out PMD-sized THP.
>>>>>
>>>>> There are a couple of reasons for swapping out mTHP without splitting:
>>>>>
>>>>>   - Performance: It is expensive to split a large folio and under extreme memory
>>>>>     pressure some workloads regressed performance when using 64K mTHP vs 4K
>>>>>     small folios because of this extra cost in the swap-out path. This series
>>>>>     not only eliminates the regression but makes it faster to swap out 64K mTHP
>>>>>     vs 4K small folios.
>>>>>
>>>>>   - Memory fragmentation avoidance: If we can avoid splitting a large folio
>>>>>     memory is less likely to become fragmented, making it easier to re-allocate
>>>>>     a large folio in future.
>>>>>
>>>>>   - Performance: Enables a separate series [4] to swap-in whole mTHPs, which
>>>>>     means we won't lose the TLB-efficiency benefits of mTHP once the memory has
>>>>>     been through a swap cycle.
>>>>>
>>>>> I've done what I thought was the smallest change possible, and as a result, this
>>>>> approach is only employed when the swap is backed by a non-rotating block device
>>>>> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
>>>>> that this is sufficient.
>>>>>
>>>>>
>>>>> Performance Testing
>>>>> ===================
>>>>>
>>>>> I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
>>>>> VM is set up with a 35G block ram device as the swap device and the test is run
>>>>> from inside a memcg limited to 40G memory. I've then run `usemem` from
>>>>> vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
>>>>> repeated everything 6 times and taken the mean performance improvement relative
>>>>> to 4K page baseline:
>>>>>
>>>>> | alloc size |            baseline |       + this series |
>>>>> |            |  v6.6-rc4+anonfolio |                     |
>>>>> |:-----------|--------------------:|--------------------:|
>>>>> | 4K Page    |                0.0% |                1.4% |
>>>>> | 64K THP    |              -14.6% |               44.2% |
>>>>> | 2M THP     |               87.4% |               97.7% |
>>>>>
>>>>> So with this change, the 64K swap performance goes from a 15% regression to a
>>>>> 44% improvement. 4K and 2M swap improves slightly too.
>>>>
>>>> I don't understand why the performance of 2M THP improves.  The swap
>>>> entry allocation becomes a little slower.  Can you provide some
>>>> perf-profile to root cause it?
>>>
>>> I didn't post the stdev, which is quite large (~10%), so that may explain some
>>> of it:
>>>
>>> | kernel   |   mean_rel |   std_rel |
>>> |:---------|-----------:|----------:|
>>> | base-4K  |       0.0% |      5.5% |
>>> | base-64K |     -14.6% |      3.8% |
>>> | base-2M  |      87.4% |     10.6% |
>>> | v4-4K    |       1.4% |      3.7% |
>>> | v4-64K   |      44.2% |     11.8% |
>>> | v4-2M    |      97.7% |     13.3% |
>>>
>>> Regardless, I'll do some perf profiling and post results shortly.
>>
>> I did a lot more runs (24 for each config) and meaned them to try to remove the
>> noise in the measurements. It's now only showing a 4% improvement for 2M. So I
>> don't think the 2M improvement is real:
>>
>> | kernel   |   mean_rel |   std_rel |
>> |:---------|-----------:|----------:|
>> | base-4K  |       0.0% |      3.2% |
>> | base-64K |      -9.1% |     10.1% |
>> | base-2M  |      88.9% |      6.8% |
>> | v4-4K    |       0.5% |      3.1% |
>> | v4-64K   |      44.7% |      8.3% |
>> | v4-2M    |      93.3% |      7.8% |
>>
>> Looking at the perf data, the only thing that sticks out is that a big chunk of
>> time is spent in during contpte_convert(), called as a result of
>> try_to_unmap_one(). This is present in both the before and after configs.
>>
>> This is an arm64 function to "unfold" contpte mappings. Essentially, the PMD is
>> being split during shrink_folio_list()  with TTU_SPLIT_HUGE_PMD, meaning the
>> THPs are PTE-mapped in contpte blocks. Then we are unmapping each pte one-by-one
>> which means the contpte block needs to be unfolded. I think try_to_unmap_one()
>> could potentially be optimized to batch unmap a contiguously mapped folio and
>> avoid this unfold. But that would be an independent and separate piece of work.
> 
> Thanks for more data and detailed explanation.

And thanks for your review! I'll address all your comments (and any others that
I get in the meantime) and repost after the merge window. It would be great if
we can get this in for v6.10.

> 
> --
> Best Regards,
> Huang, Ying