From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 001F6C4167B for ; Mon, 13 Nov 2023 14:53:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 910106B0187; Mon, 13 Nov 2023 09:53:17 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8CAEE6B0189; Mon, 13 Nov 2023 09:53:17 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 788F06B018A; Mon, 13 Nov 2023 09:53:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 685346B0187 for ; Mon, 13 Nov 2023 09:53:17 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 483ACA07F6 for ; Mon, 13 Nov 2023 14:53:17 +0000 (UTC) X-FDA: 81453224034.29.0D55A0D Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by imf09.hostedemail.com (Postfix) with ESMTP id 7AAEF14001F for ; Mon, 13 Nov 2023 14:53:13 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf09.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699887194; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=fJ4GkJN/BFLIis31dBmcExCPqSk7ozi7zVq4CvihE/o=; b=rHw21ydhMQ5ChdbKA/9UAoi2QnCc+qUUrARj5PKXpuawZkwUCdHWSo3AM1MEJztjUr6YGD JRYfKhMc//mx1zMzuKZgwbSCawiCX+jBx0XGv81zQtPTgMlA7IamBPg+5gV6y4iUj3Rrca xYoKru3Fx7bbYiHjjEOUHJWxB3D7ghw= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf09.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699887194; a=rsa-sha256; cv=none; b=QpVECGQfsaxDspZcnNnbB0o5fNJLYf500odTgNTykqDcaFymDtIqJ09V64kC2QzOr9nkR4 t9UE4dkrm08vahoN/ZUmheexHAsmvANrKQgXDRLC+ztHGHd4dm0vcntfrHeArJiiBeZPa6 Ga7+a09J+05sgitelPLbJ9TN4Xt8hdo= Received: from dggpemm100001.china.huawei.com (unknown [172.30.72.53]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4STXNZ6dwMzMmnp; Mon, 13 Nov 2023 22:48:14 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemm100001.china.huawei.com (7.185.36.93) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.31; Mon, 13 Nov 2023 22:52:48 +0800 Message-ID: <712796da-60b2-4a33-8c21-75ab20c609c7@huawei.com> Date: Mon, 13 Nov 2023 22:52:47 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory Content-Language: en-US To: Ryan Roberts , Matthew Wilcox , John Hubbard CC: Andrew Morton , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , David Rientjes , Vlastimil Babka , Hugh Dickins , , , References: <20230929114421.3761121-1-ryan.roberts@arm.com> <479b3e2b-456d-46c1-9677-38f6c95a0be8@huawei.com> From: Kefeng Wang In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpemm100001.china.huawei.com (7.185.36.93) X-CFilter-Loop: Reflected X-Rspam-User: X-Stat-Signature: rjxjr8qj87wfyxpdm4gr5ras8mpmbnro X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 7AAEF14001F X-HE-Tag: 1699887193-327903 X-HE-Meta: U2FsdGVkX1+RJCKf8UGhYBa5sQx+KvPjBFWaDEPMia0ltIcrXBLmArj/vCKsW4qNplMKr9IbHw+JMKm8xS5qrjwgkdm3Otr2BYQ0pkZwTdT86+2swFSHqQZHoPB8NACMNK/NMYEzaAtWVmNyYi/svZND3aK3Drwm7C8Xyqs0NBhShbAZomlq+b2YkcBT6WY0xybTSClUjLd9yixv4Zu4sPAvO3LQhwwrB8XuTwYQYcftpyhr1HBs26vvIVadpfso1JJfZpIyTFngD0Wuvn+whPLrNW8o1wQc83fM8K4QWk1uMQkyXoU5IMHxIfUlhUJFBllMsn1LKg5dE+SDuL1U2LJOjBXXT3pjcjBT85aQQo2coX2agJidFBqHzyuDjbY5God8jpt8+0N90/bmQHnL4wvX3PN31jV4fi09Mgdd0ptonTzU8Dw/eA0kyxuQk5a6KoKlA1XXAWTQiUunxBc9QxbsxxaqNiZqZc2Tv8GsSBqF0/bPNYcT2jYCNynLyqPNPI/WOW9a6RZkGmyX01ZKo0RW0JqILwpk6f8m2v+cdQU3fmGKYRLpp6riwZeFG8etWlClmyIQJepmgfXjTvZz2KMTMFr+3NqfOq3vhlT2rCrbV1ZrxHaikZhK8PlI2E14ayifVV5K4t/J3kihzG/f7pYJtRP1dkVcf8ei2UZhBJZ+RoIGJSbj7z9dukd82AAuIScHL99a5IsKsbrAxk7sABPybUcR9zf9GswS46Gf8XMYh87tzl7kkj5OS8H1oQxwjRkyoQIC360l1a1m4hL8coKvvdNM+mh2Ya+cQtlVAnxnrcF3ja+glLmOVfScsnTrekQ9rs+Fb8KANClRK8dx1d/OpbUo92BooaKCpHWA2o4t7hpf6V3FD06gp10z+9wlYmD3tqBzP4jA1J7Hnm7JW1DDNUZoIxRXz3VumUOX2FC1M9kw/ncURSzgB6TIokPO9MEhLnEs8ZY/xTb7ybh R9IFb2+B bUR2Fay0ymoCzbGGB0l9Z9c5nzoRy6GiksoMxQcWvyb+ghodkVJI3QBpwqHp88eXcEmE4mFsSm8K2Mb5rbbz2QlucLHHtdjOlgWgKDoh8d+S7ZU0O4sGCf2Vx2YAIWuXGVj6w6B2Ly2DrCJH9O1+f63Djpe4LQQi/SLf5QDOXI4q0mPMEhbB6QGOBHo1AIvCPgmNZvGb0lNq4mdYddizo8sOzRb0wU0eoKnCuLZ+CFdkOxwH6WjqGF+FitQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2023/11/13 20:12, Ryan Roberts wrote: > On 13/11/2023 11:52, Kefeng Wang wrote: >> >> >> On 2023/11/13 18:19, Ryan Roberts wrote: >>> On 13/11/2023 05:18, Matthew Wilcox wrote: >>>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote: >>>>> I've done some initial performance testing of this patchset on an arm64 >>>>> SBSA server. When these patches are combined with the arm64 arch contpte >>>>> patches in Ryan's git tree (he has conveniently combined everything >>>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on >>>>> some memory-intensive workloads. Many test runs, conducted independently >>>>> by different engineers and on different machines, have convinced me and >>>>> my colleagues that this is an accurate result. >>>>> >>>>> In order to achieve that result, we used the git tree in [1] with >>>>> following settings: >>>>> >>>>>      echo always >/sys/kernel/mm/transparent_hugepage/enabled >>>>>      echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders >>>>> >>>>> This was on a aarch64 machine configure to use a 64KB base page size. >>>>> That configuration means that the PMD size is 512MB, which is of course >>>>> too large for practical use as a pure PMD-THP. However, with with these >>>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB >>>>> coverage, while still getting pages that are small enough to be >>>>> effectively usable. >>>> >>>> That is quite remarkable! >>> >>> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost! >>> >>>> >>>> My hope is to abolish the 64kB page size configuration.  ie instead of >>>> using the mixture of page sizes that you currently are -- 64k and >>>> 1M (right?  Order-0, and order-4) >>> >>> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is >>> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that >>> intuitively you would expect the order to remain constant, but it doesn't). >>> >>> The "recommend" setting above will actually enable order-3 as well even though >>> there is no HW benefit to this. So the full set of available memory sizes here >>> is: >>> >>> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13 >>> >>>> , that 4k, 64k and 2MB (order-0, >>>> order-4 and order-9) will provide better performance. >>>> >>>> Have you run any experiements with a 4kB page size? >>> >>> Agree that would be interesting with 64K small-sized THP enabled. And I'd love >>> to get to a world were we universally deal in variable sized chunks of memory, >>> aligned on 4K boundaries. >>> >>> In my experience though, there are still some performance benefits to 64K base >>> page vs 4K+contpte; the page tables are more cache efficient for the former case >>> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the >>> latter. In practice the HW will still only read 8 bytes in the latter but that's >>> taking up a full cache line vs the former where a single cache line stores 8x >>> 64K entries. >> >> We test some benchmark, eg, unixbench, lmbench, sysbench, with v5 on >> arm64 board(for better evaluation of anon large folio, using ext4, >> which don't support large folio for now), will test again and send >> the results once v7 out. > > Thanks for the testing and for posting the insights! > >> >> 1) base page 4k  + without anon large folio >> 2) base page 64k + without anon large folio >> 3) base page 4k  + with anon large folio + cont-pte(order = 4,0) >> >> Most of the test results from v5 show the 3) have a good improvement >> vs 1), but still low than 2) > > Do you have any understanding what the shortfall is for these particular > workloads? Certainly the cache spatial locality benefit of the 64K page tables > could be a factor. But certainly for the workloads I've been looking at, a > bigger factor is often the fact that executable file-backed memory (elf > segments) are not in 64K folios and therefore not contpte-mapped. If the iTLB is > under pressure this can help a lot. I have a change (hack) to force all > executable mappings to be read-ahead into 64K folios and this gives an > improvement. But obviously that only works when the file system supports large > folios (so not ext4 right now). It would certainly be interesting to see just > how close to native 64K we can get when employing these extra ideas. No detailed analysis, but with base page 64k, less page fault less TLB operation less zone-lock congestion(pcp) less buddy split/merge no reclaim/compact when allocate 64k page, and no fallback logical execfolio faster page table opreation? ... > >> , also for some latency-sensitive >> benchmark, 2) and 3) maybe have poor performance vs 1). >> >> Note, for pcp_allowed_order, order <= PAGE_ALLOC_COSTLY_ORDER=3, for >> 3), we maybe enlarge it for better scalability when page allocation >> on arm64, not test on v5, will try to enlarge it on v7. > > Yes interesting! I'm hoping to post v7 this week - just waiting for mm-unstable > to be rebased on v6.7-rc1. I'd be interested to see your results. > Glad to see it.>> >>> >>> Thanks, >>> Ryan >>> >>> >>> > >