From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92D8BC30659 for ; Wed, 26 Jun 2024 10:47:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 215596B008A; Wed, 26 Jun 2024 06:47:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1C50D6B008C; Wed, 26 Jun 2024 06:47:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0B3A46B0092; Wed, 26 Jun 2024 06:47:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id E115C6B008A for ; Wed, 26 Jun 2024 06:47:17 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 8FDD1160DDB for ; Wed, 26 Jun 2024 10:47:17 +0000 (UTC) X-FDA: 82272712914.27.323ECD2 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf11.hostedemail.com (Postfix) with ESMTP id CF64940008 for ; Wed, 26 Jun 2024 10:47:14 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf11.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719398823; a=rsa-sha256; cv=none; b=wvjZxQaMvopEEeputYFRI08tbmNQjsXlhwReapL/MGzRgMmXhpqaGkPvrEH3GQKauY9L3N k9RaYwosph4b99A87b0A4lgniSRWY307rdcFPHJf8JKC5CkasyirWBCu0oCuQhdW9rq54t kyiBAdstDoIStdN4eHb+EOwqVZ1tQAg= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf11.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719398823; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=J4ZwHAzbhu2LW8+mQ10fv/ap3b3aqvDLQoN2n13zQMo=; b=T581Ws6ww8b8SAkEIhPCJHOCgB5vA5wwONvXgEK6G1fB1YgHGQsIXNTnON1VWc0HHXzu1l 2W1kh7Ret/WI3Rj6ikAoHV9q4m/2wPH5x8ZRNxrAc1kpOLeGprnq+qouy/zBYYhzsfgYjA QQkRijmiP8kf39POppDnGU5ewYcn+TQ= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A37A6339; Wed, 26 Jun 2024 03:47:38 -0700 (PDT) Received: from [10.57.73.149] (unknown [10.57.73.149]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DB29B3F766; Wed, 26 Jun 2024 03:47:11 -0700 (PDT) Message-ID: Date: Wed, 26 Jun 2024 11:47:10 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64 Content-Language: en-GB To: "Christoph Lameter (Ampere)" Cc: Yang Shi , Jonathan Cameron , lsf-pc@lists.linux-foundation.org, olivier.singla@amperecomputing.com, Linux MM , Michal Hocko , Dan Williams , Matthew Wilcox , Zi Yan References: <20240401191614.00007c83@Huawei.com> <145031ae-1d4d-4b43-b2c9-aed0d10e86ca@arm.com> <7a8bcd48-47b4-4bc7-a38f-45cef9adc221@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: CF64940008 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: ngfwp39h5986bfmrqzss7ybjqb7k7ffj X-HE-Tag: 1719398834-735804 X-HE-Meta: U2FsdGVkX1/F/pLEOZ4TlMWEeU6DEaBgD7wCbEpuVhujEvoJhk9F0bD87TKIduM/DGV617eiTug/HxeRTkR+joCD86gjH3hbuE/R87hn9qfrb+MPTWyhoem62cC1H4Dq7P2DQta0oGfuJMgM48MOShYl1Q99H6e5ZkdxD4kCIFM2HziEHhAJiO4UYUjj3D/AuWtiGhJqIfUnnNkpKoDesSBjMD6WR2vm82V5/IYLz0vUuUddC/zR04CwrtWIxpXkGrx35svVU0kize3n9J1aIAXcTouXPT1hrfmcqwN1DIIIdm3gQrR0ftgUcWkFlxSuJRAZ12RSZqQAH3bzKcdlXPzXL46E/JWc783/Jn4KPwKiwzDsDTqiOwExudwx57DfQ0g4RhuTrO0a+eCCAaYkKlrsKQPCrgKDyR/SI2p8EA+yONrMVzRR9E7jS7B70WpVQ0sV0ZeXp3mmmyOI4EjKjbst9OpbVfo5q9IdhbQXHXdjuWyoBS0OmUItwdGpNwYjww+jylIHN62hE+aElvN3gZ0Bb5dnrXQ7bjAfY2YOdrod5SpSVcSabF8DoeO+XjrMmhBOPwYAIbHXfz6nRCAe3tsCTk6P9XV5l1jwSgu1dMUEMofPx66qQb7D0AUfzonAgclyzxaX1+i69WlBkTzkjXXVHjyhj3W2TvjV/sUD7OEXNm3BbS52FvyBfheu3FSDqlo77ilX9miOQwWYOPy3UXc/KoLA/642ovnMsYQuoBX8V5L6wa4mzRZ0oIdIVuV/sQoJ8Q6EKvmpLt3nCA9eR3RzVJzcXnU/xXQWzKTAKXu5X1LTfle93v+rC/FAZhf/oMMpnraJywxUfywFF/JaTnfY/QxnkmRXJLb6wpW0NGWNnSHl8dJlYDVpi+XQe0d6R3LTcOy5I0aaTLBS1aEX0Wpi3HDoD8GyBrVcq+5tDgeaanwQ/waieNVZs7+DU7k0Lw09rW4jo2EnwOIOiAy GOzrQu8Z zftgj3KZc7NdOuHJ/bIV0y9J46yQaJ/7btdLtgG4008BjqGmtICctQG0fULiB//7oih00GheIsMgragODKftFw6qmtfLOnjWMKqm9mZIbY7uR05NLkSN9zpKcO+uUiNdgPE7gLY5io13e85l98Es/Y0doyze4053SQhODUq7RNqRvn3SnHPlRiKwt/rRlMhNwgC89iGlVddXpGkCfDrtINQX7TckqUdfQBT6I4yyVEs39QO8U6HyCm+r71D1soVrm39T9FYas6p8h+MQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 25/06/2024 19:11, Christoph Lameter (Ampere) wrote: > On Tue, 25 Jun 2024, Ryan Roberts wrote: > >> But I also want to raise a more general point; We are not done with the >> optimizations yet. contpte can also improve performance for iTLB, but this >> requires a change to the page cache to store text in (at least) 64K folios. >> Typically the iTLB is under a lot of pressure and this can help reduce it. This >> change is not in mainline yet (and I still need to figure out how to make the >> patch acceptable), but is worth another ~1.5% for the 4KPS case. I suspect this >> will also move the needle on the other benchmarks you ran. See [3] - I'd >> appreciate any thoughts you have on how to get something like this accepted. >> >> [3] https://lore.kernel.org/lkml/20240111154106.3692206-1-ryan.roberts@arm.com/ > > The discussion here seems to indicate that readahead is already ok for order-2 > (16K mTHP size?). So this is only for 64K mTHP on 4K? Kind of; for fiflesystems that report support for large folios, readahead starts with order-2 folio, then increments the folio order by 2 orders for every subsequent readahead marker that is hit. But text is rarely accessed sequentially so readahead markers are rarely hit in practice and therefore all the text folios tend to end up as order-2 (16K for 4K base pages). But the important bit is that the filesystem needs to support large folios in the first place, without that, we are always stuck using small (order-0) folios. XFS and a few other (network) filesystems support large folios today, but ext4 doesn't - that's being worked on though. > > From what I read in the ARM64 manuals it seems that CONT_PTE can only be used > for 64K mTHP on 4K kernels. The 16K case will not benefit from CONT_PTE nor any > other intermediate size than 64K. Yes and no. The contiguous hint, when applied, constitutes a single fixed size and that size depends on the base page size. Its 64K for 4KPS, 2M for 16KPS and 2M for 64KPS. However, most modern Arm-designed CPUs support a micro-architectural feature called Hardware Page Aggregation (HPA), which can aggregate up to 4 pages into a single TLB in a way that is transparent to SW. So that feature can benefit from 16K folios when using 4K base pages. Although HPA is implemented in the Neoverse N1 CPU (which is what I believe is in the Ampere Altra), it is disabled and due to an errata can't be enabled. So HPA is not relevant for Altra. > > Quoting: > > https://developer.arm.com/documentation/ddi0406/c/System-Level-Architecture/Virtual-Memory-System-Architecture--VMSA-/Memory-region-attributes/Long-descriptor-format-memory-region-attributes?lang=en#BEIIBEIJ Note this link is for armv7A, not v8. But hopefully my explanation about answers everything. Thanks, Ryan > > "Contiguous hint > > The Long-descriptor translation table format descriptors contain a Contiguous > hint bit. Setting this bit to 1 indicates that 16 adjacent translation table > entries point to a contiguous output address range. These 16 entries must be > aligned in the translation table so that the top 5 bits of their input > addresses, that index their position in the translation table, are the same. For > example, referring to Figure 12.21, to use this hint for a block of 16 entries > in the third-level translation table, bits[20:16] of the input addresses for the > 16 entries must be the same. > > The contiguous output address range must be aligned to size of 16 translation > table entries at the same translation table level. > > Use of this hint means that the TLB can cache a single entry to cover the 16 > translation table entries. > > This bit is only a hint bit. The architecture does not require a processor to > cache TLB entries in this way. To avoid TLB coherency issues, any TLB > maintenance by address must not assume any optimization of the TLB tables that > might result from use of the hint bit. >