From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14AA3CF3962 for ; Thu, 19 Sep 2024 17:21:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 658BA6B008A; Thu, 19 Sep 2024 13:21:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 608536B008C; Thu, 19 Sep 2024 13:21:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4CFD06B0092; Thu, 19 Sep 2024 13:21:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 306896B008A for ; Thu, 19 Sep 2024 13:21:39 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id C24C240EFA for ; Thu, 19 Sep 2024 17:21:38 +0000 (UTC) X-FDA: 82582154676.16.1F0B18D Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf09.hostedemail.com (Postfix) with ESMTP id 09953140017 for ; Thu, 19 Sep 2024 17:21:35 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=none; spf=pass (imf09.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726766345; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ZQdWI4sFcLExCIJ7O+RS2TRXo7chFjzqtoWxkLcegHE=; b=VnEYYgaZsIiTsb0jwJ5pqUHMX+L6Qd+KQkvYxbzDSOacnmnCHJj2LDhq8u6Lwt7b2DgLt4 VQcpVM2szL5ruI+VlKKsxbNOeVMM84HbuNN3NsuXPk+239cRnVRiw9PCwko5JOkjVRbcXO bfusnScVTgRZF9WilISgeGGoTrKNO3o= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=none; spf=pass (imf09.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726766345; a=rsa-sha256; cv=none; b=HybRAjbWsPhx70YmQuYMHVbQB4oIdAqnT3RaRqDIvQF0mnpm+wmBP/nHhEVhtzWZI8YKXN g4ekavDrz9viZWilardLbTNPK1HgxafA+47ltKE6PnIhxYDKX53tSXFdXhj9rkz12W9WvY htEW8dwiUUXwgjlJHExOZ60QB1Zma5k= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6F8B51007; Thu, 19 Sep 2024 10:22:04 -0700 (PDT) Received: from [10.57.82.79] (unknown [10.57.82.79]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id ECB5A3F64C; Thu, 19 Sep 2024 10:21:30 -0700 (PDT) Message-ID: Date: Thu, 19 Sep 2024 19:21:27 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory Content-Language: en-GB To: Barry Song Cc: Andrew Morton , Hugh Dickins , Jonathan Corbet , "Matthew Wilcox (Oracle)" , David Hildenbrand , Lance Yang , Baolin Wang , Gavin Shan , Pankaj Raghav , Daniel Gomez , linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20240717071257.4141363-1-ryan.roberts@arm.com> <480f34d0-a943-40da-9c69-2353fe311cf7@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 09953140017 X-Stat-Signature: 7p3byyaix14pm6ptszxn1ibtc1ejhcpi X-Rspam-User: X-HE-Tag: 1726766495-695664 X-HE-Meta: U2FsdGVkX18ZA9YaW0lYB0Q3818XUD8fl1oZmBrrpubAsXRvlnYHTZUquqfB5BWa9ve9yxDRALGWgI43ABkGj77ookwvCxEpxr8IAawaNissiIu6DnbOs7Xhmgit0XXJAQRV7E+s8VdtOhUaaq9JSYYMmIe1NAmuo0pequ09CcYz8AtId8DShgRy0yhwg/bPQuKyS/RD2V7EfcCaL06GeyDQ9bQB/3hjKZ9KoeAR1Z789DHa77Crt3q1QYXTItPejgHU1HaY6OagqEkK8xYIl/CagWwUSYuN+8VeOg/gne+trqP8DDx6b3i2MSefcqRjPrWIUyP9C6FAuNoxAsSB65UfLIpUrWhra5xoRv3MFm9EYhNOxobKl1j6g9yfzwTBXwaNvjCJ+/3ATRsOmctLs6TAF96WS03kFS6wAIOWe5kbPJYKAEsXWyMUBmsVjmsLz1Bu/aJolzVMpYHMYPzyEnjO0upIw2gluGyehevGDjnsT38k9zmlPllUxGhF64WUoArsl8FKtPiTIUasgAJTsTrrikoZpIzjRac/2ivC5WdreWkI/Xfdv+dCxbQuGX1B54+1PLJ2ExWfMVYbkAsN+RHoQgVSiTzwkitRtg1ICKISJxmcRtS+4AUwRbvn8q1qMYzE6YApt7o2R/d4qzTCxyXTA/aZkLeW1DNeRgqXmIyLS0IKqG/y9GOkk367H4S7jWtDMQ3iTp1s9Vesi1R4bZSsu4+IleT0mQjBsd3uuvPoRA3iHiZ0r8TqhACOXyhVzHDL4IMFUi2t2yVBcOx5A0Q4TBvBkNap/+Et+fi1Zj9YMKY3T2NKOrALvN4hkWjPgdsQbnb20NrPcidQ5baxBD3/aeRDDhC/pT6N4NgC3JWQjDTzlbb+xkF1Q0RiZ2PgKwnHKOzAobryh2Z+b3mVOqeuV3wBlbeYf89tMbCZOtNhtqW3FzxV225mh0gYqv97j3kUsFQRro8X5FmgkgN RaCcogrV Zqfpz3FsTlD0bWOypb2OJx7Lb3kSmI89CAh+N5mNlAEZWqq5sY4EjFABDygxd8psH0jlCWDayzwxdvxyExMcXC5tv7nvcjsX25DRjX/etApVXOVSA6/SLmyXdKgoO/4wN5m5OKRmEqGPATJJjZbM1Vmv26DN9pKpDQPCCFJOeH4vG0Bj2Vuhz8y8X2i1NbD2ez9TuYnkRJDCj4WsUupesVc32mrYyuX3nHT0swsGdaeSznlM6XelpVa1xBRmx0s8a+LYKUbN8Cl8gk7w/D44XjKb96w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 19/09/2024 09:20, Barry Song wrote: > On Thu, Aug 8, 2024 at 10:27 PM Ryan Roberts wrote: >> >> On 17/07/2024 08:12, Ryan Roberts wrote: >>> Hi All, >>> >>> This series is an RFC that adds sysfs and kernel cmdline controls to configure >>> the set of allowed large folio sizes that can be used when allocating >>> file-memory for the page cache. As part of the control mechanism, it provides >>> for a special-case "preferred folio size for executable mappings" marker. >>> >>> I'm trying to solve 2 separate problems with this series: >>> >>> 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified >>> approach for the change at [1]. Instead of hardcoding the preferred executable >>> folio size into the arch, user space can now select it. This decouples the arch >>> code and also makes the mechanism more generic; it can be bypassed (the default) >>> or any folio size can be set. For my use case, 64K is preferred, but I've also >>> heard from Willy of a use case where putting all text into 2M PMD-sized folios >>> is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and >>> therefore faulting in all text ahead of time) to achieve that. >> >> Just a polite bump on this; I'd really like to get something like this merged to >> help reduce iTLB pressure. We had a discussion at the THP Cabal meeting a few >> weeks back without solid conclusion. I haven't heard any concrete objections >> yet, but also only a luke-warm reception. How can I move this forwards? > > Hi Ryan, > > These requirements seem to apply to anon, swap, pagecache, and shmem to > some extent. While the swapin_enabled knob was rejected, the shmem_enabled > option is already in place. > > I wonder if it's possible to use the existing 'enabled' setting across > all cases, as > from an architectural perspective with cont-pte, pagecache may not differ from > anon. The demand for reducing page faults, LRU overhead, etc., also seems > quite similar. > > I imagine that once Android's file systems support mTHP, we’ll uniformly enable > 64KB for anon, swap, shmem, and page cache. It should then be sufficient to > enable all of them using a single knob: > '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/enabled'. > > Is there anything that makes pagecache and shmem significantly different > from anon? In my Android case, they all seem the same. However, I assume > there might be other use cases where differentiating them is necessary? For anon vs shmem, we were just following the precedent set by the legacy PMD controls, which separated them. I vaguely recall David explaining why there are separate controls but don't recall the exact reason; I beleive there was some use case where anon THP made sense, but shmem THP was problematic for some reason. Note too, that the controls expose different options; anon has {always never, madvise}, shmem has {always, never, advise (no m; it applies to fadvise too), within_size, force, deny}. So I guess if the extra shmem options are important then it makes sense to have a separate control. For pagecache vs anon, I'm not sure it makes sense to tie these to the same control. We have readahead information to help us make an educated guess at the folio size we should use (currently we start at order-2 and increase by 2 orders every time we hit the readahead marker) and it's much easier to drop pagecache folios under memory pressure. So by default, I think most/all orders would be enabled for pagecahce. But for anon, things are harder. In the common case, likely we only want 2M when madvised, and 64K always (and possibly 16K always). Talking with Willy today, his preference is to not expose any controls for pagecache at all, and let the architecture hint the preferred folio size for code - basically how I did it at [1] - linked in the original post. This is very simple and exposes no user controls so could be easily modified over time as we get more data. Trouble is nobody seemed willing to R-b the first approach. So perhaps we're stuck waiting for Android's FSs to support large folios so we can start benchmarking the real-world gains? Thanks, Ryan > >> >> Thanks, >> Ryan >> >> >>> >>> 2. Reduce memory fragmentation in systems under high memory pressure (e.g. >>> Android): The theory goes that if all folios are 64K, then failure to allocate a >>> 64K folio should become unlikely. But if the page cache is allocating lots of >>> different orders, with most allocations having an order below 64K (as is the >>> case today) then ability to allocate 64K folios diminishes. By providing control >>> over the allowed set of folio sizes, we can tune to avoid crucial 64K folio >>> allocation failure. Additionally I've heard (second hand) of the need to disable >>> large folios in the page cache entirely due to latency concerns in some >>> settings. These controls allow all of this without kernel changes. >>> >>> The value of (1) is clear and the performance improvements are documented in >>> patch 2. I don't yet have any data demonstrating the theory for (2) since I >>> can't reproduce the setup that Barry had at [2]. But my view is that by adding >>> these controls we will enable the community to explore further, in the same way >>> that the anon mTHP controls helped harden the understanding for anonymous >>> memory. >>> >>> --- >>> This series depends on the "mTHP allocation stats for file-backed memory" series >>> at [3], which itself applies on top of yesterday's mm-unstable (650b6752c8a3). All >>> mm selftests have been run; no regressions were observed. >>> >>> [1] https://lore.kernel.org/linux-mm/20240215154059.2863126-1-ryan.roberts@arm.com/ >>> [2] https://www.youtube.com/watch?v=ht7eGWqwmNs&list=PLbzoR-pLrL6oj1rVTXLnV7cOuetvjKn9q&index=4 >>> [3] https://lore.kernel.org/linux-mm/20240716135907.4047689-1-ryan.roberts@arm.com/ >>> >>> Thanks, >>> Ryan >>> >>> Ryan Roberts (4): >>> mm: mTHP user controls to configure pagecache large folio sizes >>> mm: Introduce "always+exec" for mTHP file_enabled control >>> mm: Override mTHP "enabled" defaults at kernel cmdline >>> mm: Override mTHP "file_enabled" defaults at kernel cmdline >>> >>> .../admin-guide/kernel-parameters.txt | 16 ++ >>> Documentation/admin-guide/mm/transhuge.rst | 66 +++++++- >>> include/linux/huge_mm.h | 61 ++++--- >>> mm/filemap.c | 26 ++- >>> mm/huge_memory.c | 158 +++++++++++++++++- >>> mm/readahead.c | 43 ++++- >>> 6 files changed, 329 insertions(+), 41 deletions(-) >>> >>> -- >>> 2.43.0 >>> >> > > Thanks > Barry