From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 001F6C4167B
	for <linux-mm@archiver.kernel.org>; Mon, 13 Nov 2023 14:53:17 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 910106B0187; Mon, 13 Nov 2023 09:53:17 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8CAEE6B0189; Mon, 13 Nov 2023 09:53:17 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 788F06B018A; Mon, 13 Nov 2023 09:53:17 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 685346B0187
	for <linux-mm@kvack.org>; Mon, 13 Nov 2023 09:53:17 -0500 (EST)
Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 483ACA07F6
	for <linux-mm@kvack.org>; Mon, 13 Nov 2023 14:53:17 +0000 (UTC)
X-FDA: 81453224034.29.0D55A0D
Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189])
	by imf09.hostedemail.com (Postfix) with ESMTP id 7AAEF14001F
	for <linux-mm@kvack.org>; Mon, 13 Nov 2023 14:53:13 +0000 (UTC)
Authentication-Results: imf09.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf09.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1699887194;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=fJ4GkJN/BFLIis31dBmcExCPqSk7ozi7zVq4CvihE/o=;
	b=rHw21ydhMQ5ChdbKA/9UAoi2QnCc+qUUrARj5PKXpuawZkwUCdHWSo3AM1MEJztjUr6YGD
	JRYfKhMc//mx1zMzuKZgwbSCawiCX+jBx0XGv81zQtPTgMlA7IamBPg+5gV6y4iUj3Rrca
	xYoKru3Fx7bbYiHjjEOUHJWxB3D7ghw=
ARC-Authentication-Results: i=1;
	imf09.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf09.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699887194; a=rsa-sha256;
	cv=none;
	b=QpVECGQfsaxDspZcnNnbB0o5fNJLYf500odTgNTykqDcaFymDtIqJ09V64kC2QzOr9nkR4
	t9UE4dkrm08vahoN/ZUmheexHAsmvANrKQgXDRLC+ztHGHd4dm0vcntfrHeArJiiBeZPa6
	Ga7+a09J+05sgitelPLbJ9TN4Xt8hdo=
Received: from dggpemm100001.china.huawei.com (unknown [172.30.72.53])
	by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4STXNZ6dwMzMmnp;
	Mon, 13 Nov 2023 22:48:14 +0800 (CST)
Received: from [10.174.177.243] (10.174.177.243) by
 dggpemm100001.china.huawei.com (7.185.36.93) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2507.31; Mon, 13 Nov 2023 22:52:48 +0800
Message-ID: <712796da-60b2-4a33-8c21-75ab20c609c7@huawei.com>
Date: Mon, 13 Nov 2023 22:52:47 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
Content-Language: en-US
To: Ryan Roberts <ryan.roberts@arm.com>, Matthew Wilcox <willy@infradead.org>,
	John Hubbard <jhubbard@nvidia.com>
CC: Andrew Morton <akpm@linux-foundation.org>, Yin Fengwei
	<fengwei.yin@intel.com>, David Hildenbrand <david@redhat.com>, Yu Zhao
	<yuzhao@google.com>, Catalin Marinas <catalin.marinas@arm.com>, Anshuman
 Khandual <anshuman.khandual@arm.com>, Yang Shi <shy828301@gmail.com>, "Huang,
 Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>, Luis Chamberlain
	<mcgrof@kernel.org>, Itaru Kitayama <itaru.kitayama@gmail.com>, "Kirill A.
 Shutemov" <kirill.shutemov@linux.intel.com>, David Rientjes
	<rientjes@google.com>, Vlastimil Babka <vbabka@suse.cz>, Hugh Dickins
	<hughd@google.com>, <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>
References: <20230929114421.3761121-1-ryan.roberts@arm.com>
 <c507308d-bdd4-5f9e-d4ff-e96e4520be85@nvidia.com>
 <ZVGxkMeY50JSesaj@casper.infradead.org>
 <f1fa098b-210e-41a9-80fc-aec212976610@arm.com>
 <479b3e2b-456d-46c1-9677-38f6c95a0be8@huawei.com>
 <f034dd2c-4ce1-47e5-a3a6-c3c1fcab5c4b@arm.com>
From: Kefeng Wang <wangkefeng.wang@huawei.com>
In-Reply-To: <f034dd2c-4ce1-47e5-a3a6-c3c1fcab5c4b@arm.com>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 8bit
X-Originating-IP: [10.174.177.243]
X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To
 dggpemm100001.china.huawei.com (7.185.36.93)
X-CFilter-Loop: Reflected
X-Rspam-User: 
X-Stat-Signature: rjxjr8qj87wfyxpdm4gr5ras8mpmbnro
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: 7AAEF14001F
X-HE-Tag: 1699887193-327903
X-HE-Meta: U2FsdGVkX1+RJCKf8UGhYBa5sQx+KvPjBFWaDEPMia0ltIcrXBLmArj/vCKsW4qNplMKr9IbHw+JMKm8xS5qrjwgkdm3Otr2BYQ0pkZwTdT86+2swFSHqQZHoPB8NACMNK/NMYEzaAtWVmNyYi/svZND3aK3Drwm7C8Xyqs0NBhShbAZomlq+b2YkcBT6WY0xybTSClUjLd9yixv4Zu4sPAvO3LQhwwrB8XuTwYQYcftpyhr1HBs26vvIVadpfso1JJfZpIyTFngD0Wuvn+whPLrNW8o1wQc83fM8K4QWk1uMQkyXoU5IMHxIfUlhUJFBllMsn1LKg5dE+SDuL1U2LJOjBXXT3pjcjBT85aQQo2coX2agJidFBqHzyuDjbY5God8jpt8+0N90/bmQHnL4wvX3PN31jV4fi09Mgdd0ptonTzU8Dw/eA0kyxuQk5a6KoKlA1XXAWTQiUunxBc9QxbsxxaqNiZqZc2Tv8GsSBqF0/bPNYcT2jYCNynLyqPNPI/WOW9a6RZkGmyX01ZKo0RW0JqILwpk6f8m2v+cdQU3fmGKYRLpp6riwZeFG8etWlClmyIQJepmgfXjTvZz2KMTMFr+3NqfOq3vhlT2rCrbV1ZrxHaikZhK8PlI2E14ayifVV5K4t/J3kihzG/f7pYJtRP1dkVcf8ei2UZhBJZ+RoIGJSbj7z9dukd82AAuIScHL99a5IsKsbrAxk7sABPybUcR9zf9GswS46Gf8XMYh87tzl7kkj5OS8H1oQxwjRkyoQIC360l1a1m4hL8coKvvdNM+mh2Ya+cQtlVAnxnrcF3ja+glLmOVfScsnTrekQ9rs+Fb8KANClRK8dx1d/OpbUo92BooaKCpHWA2o4t7hpf6V3FD06gp10z+9wlYmD3tqBzP4jA1J7Hnm7JW1DDNUZoIxRXz3VumUOX2FC1M9kw/ncURSzgB6TIokPO9MEhLnEs8ZY/xTb7ybh
 R9IFb2+B
 bUR2Fay0ymoCzbGGB0l9Z9c5nzoRy6GiksoMxQcWvyb+ghodkVJI3QBpwqHp88eXcEmE4mFsSm8K2Mb5rbbz2QlucLHHtdjOlgWgKDoh8d+S7ZU0O4sGCf2Vx2YAIWuXGVj6w6B2Ly2DrCJH9O1+f63Djpe4LQQi/SLf5QDOXI4q0mPMEhbB6QGOBHo1AIvCPgmNZvGb0lNq4mdYddizo8sOzRb0wU0eoKnCuLZ+CFdkOxwH6WjqGF+FitQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 2023/11/13 20:12, Ryan Roberts wrote:
> On 13/11/2023 11:52, Kefeng Wang wrote:
>>
>>
>> On 2023/11/13 18:19, Ryan Roberts wrote:
>>> On 13/11/2023 05:18, Matthew Wilcox wrote:
>>>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>>>>> I've done some initial performance testing of this patchset on an arm64
>>>>> SBSA server. When these patches are combined with the arm64 arch contpte
>>>>> patches in Ryan's git tree (he has conveniently combined everything
>>>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>>>>> some memory-intensive workloads. Many test runs, conducted independently
>>>>> by different engineers and on different machines, have convinced me and
>>>>> my colleagues that this is an accurate result.
>>>>>
>>>>> In order to achieve that result, we used the git tree in [1] with
>>>>> following settings:
>>>>>
>>>>>       echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>>>>       echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>>>
>>>>> This was on a aarch64 machine configure to use a 64KB base page size.
>>>>> That configuration means that the PMD size is 512MB, which is of course
>>>>> too large for practical use as a pure PMD-THP. However, with with these
>>>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>>>>> coverage, while still getting pages that are small enough to be
>>>>> effectively usable.
>>>>
>>>> That is quite remarkable!
>>>
>>> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!
>>>
>>>>
>>>> My hope is to abolish the 64kB page size configuration.  ie instead of
>>>> using the mixture of page sizes that you currently are -- 64k and
>>>> 1M (right?  Order-0, and order-4)
>>>
>>> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
>>> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
>>> intuitively you would expect the order to remain constant, but it doesn't).
>>>
>>> The "recommend" setting above will actually enable order-3 as well even though
>>> there is no HW benefit to this. So the full set of available memory sizes here
>>> is:
>>>
>>> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
>>>
>>>> , that 4k, 64k and 2MB (order-0,
>>>> order-4 and order-9) will provide better performance.
>>>>
>>>> Have you run any experiements with a 4kB page size?
>>>
>>> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
>>> to get to a world were we universally deal in variable sized chunks of memory,
>>> aligned on 4K boundaries.
>>>
>>> In my experience though, there are still some performance benefits to 64K base
>>> page vs 4K+contpte; the page tables are more cache efficient for the former case
>>> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
>>> latter. In practice the HW will still only read 8 bytes in the latter but that's
>>> taking up a full cache line vs the former where a single cache line stores 8x
>>> 64K entries.
>>
>> We test some benchmark, eg, unixbench, lmbench, sysbench, with v5 on
>> arm64 board(for better evaluation of anon large folio, using ext4,
>> which don't support large folio for now), will test again and send
>> the results once v7 out.
> 
> Thanks for the testing and for posting the insights!
> 
>>
>> 1) base page 4k  + without anon large folio
>> 2) base page 64k + without anon large folio
>> 3) base page 4k  + with anon large folio + cont-pte(order = 4,0)
>>
>> Most of the test results from v5 show the 3) have a good improvement
>> vs 1), but still low than 2)
> 
> Do you have any understanding what the shortfall is for these particular
> workloads? Certainly the cache spatial locality benefit of the 64K page tables
> could be a factor. But certainly for the workloads I've been looking at, a
> bigger factor is often the fact that executable file-backed memory (elf
> segments) are not in 64K folios and therefore not contpte-mapped. If the iTLB is
> under pressure this can help a lot. I have a change (hack) to force all
> executable mappings to be read-ahead into 64K folios and this gives an
> improvement. But obviously that only works when the file system supports large
> folios (so not ext4 right now). It would certainly be interesting to see just
> how close to native 64K we can get when employing these extra ideas.

No detailed analysis, but with base page 64k,
  less page fault
  less TLB operation
  less zone-lock congestion(pcp)
  less buddy split/merge
  no reclaim/compact when allocate 64k page, and no fallback logical
  execfolio
  faster page table opreation?
  ...

> 
>> , also for some latency-sensitive
>> benchmark, 2) and 3) maybe have poor performance vs 1).
>>
>> Note, for pcp_allowed_order, order <= PAGE_ALLOC_COSTLY_ORDER=3, for
>> 3), we maybe enlarge it for better scalability when page allocation
>> on arm64, not test on v5, will try to enlarge it on v7.
> 
> Yes interesting! I'm hoping to post v7 this week - just waiting for mm-unstable
> to be rebased on v6.7-rc1. I'd be interested to see your results.
> 
Glad to see it.>>
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>>
> 
>