From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CD31DC36008
	for <linux-mm@archiver.kernel.org>; Sat, 29 Mar 2025 10:08:07 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 4269F28017A; Sat, 29 Mar 2025 06:08:05 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3D5C0280165; Sat, 29 Mar 2025 06:08:05 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2EB3328017A; Sat, 29 Mar 2025 06:08:05 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 0C2DB280165
	for <linux-mm@kvack.org>; Sat, 29 Mar 2025 06:08:05 -0400 (EDT)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id E9050C0D60
	for <linux-mm@kvack.org>; Sat, 29 Mar 2025 10:08:05 +0000 (UTC)
X-FDA: 83274162930.18.B52964E
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf25.hostedemail.com (Postfix) with ESMTP id CB706A000B
	for <linux-mm@kvack.org>; Sat, 29 Mar 2025 10:08:03 +0000 (UTC)
Authentication-Results: imf25.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf25.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743242884; a=rsa-sha256;
	cv=none;
	b=BkCXJ0gO6O/OWxhnFwyNCLOsTpp68C34SVfLi5T0ZMiZzZiKKv2VsW31tYngbQCbuGYrVr
	q9HOammxy+nkYcN8PwGNlC679z3dg3yBTbdxce7Bzdb08Tta4EdH29TAU8LlmeUoiKRXpe
	gj2r7sLU03ToGv5+ic2XVWNXTWcJLvQ=
ARC-Authentication-Results: i=1;
	imf25.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf25.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1743242884;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=4/lGj6ICowfePMFm+RZRWeblwzVjHeQcX5TTHLcnzWc=;
	b=evWkT2tuwcfyhygCakUHTXa6rgzDIt1y/vn77VbsZ5RzQ8Pk+7Dbdy+gNuYlVDtvKyl+jD
	8hCOkALKBPzgB8SrB8jt3+Z9FhlXlvHtSbcviXpXYRur6RZZp5xZnA8DIAOAYNEHG9ot98
	Ewr/FDsSgP2+A59u2PkopT22JMjq1XA=
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6505D152B;
	Sat, 29 Mar 2025 03:08:07 -0700 (PDT)
Received: from [10.57.87.112] (unknown [10.57.87.112])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2BA713F694;
	Sat, 29 Mar 2025 03:08:01 -0700 (PDT)
Message-ID: <ee11907a-5bd7-44ec-844c-8f10ff406b46@arm.com>
Date: Sat, 29 Mar 2025 10:07:59 +0000
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v3] mm/filemap: Allow arch to request folio size for exec
 memory
Content-Language: en-GB
To: Matthew Wilcox <willy@infradead.org>
Cc: Kalesh Singh <kaleshsingh@google.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 David Hildenbrand <david@redhat.com>, Dave Chinner <david@fromorbit.com>,
 Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>,
 linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org,
 linux-fsdevel@vger.kernel.org, linux-mm@kvack.org
References: <20250327160700.1147155-1-ryan.roberts@arm.com>
 <Z-WAbWfZzG1GA-4n@casper.infradead.org>
 <5131c7ad-cc37-44fc-8672-5866ecbef65b@arm.com>
 <Z-b1FmZ5nHzh5huL@casper.infradead.org>
From: Ryan Roberts <ryan.roberts@arm.com>
In-Reply-To: <Z-b1FmZ5nHzh5huL@casper.infradead.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Rspam-User: 
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: CB706A000B
X-Stat-Signature: z57pceo8i3pio6gn5mw33bzkhz5ghwuc
X-HE-Tag: 1743242883-611936
X-HE-Meta: U2FsdGVkX18nvsZikHmLyOWSYHEFdPxFbRdC/KiAHE47LhJPmkLAAsa0xn0H/JwogN+f+hjFvNXNYncZYoXz6CkTxD8wTwcP6FiklKZfAUOHYz+u3NeB5XjyEEowya5NCkzpEwAEiJW9lmrNMqH6CRMmMeuSzr8g5xIl0l9DSuI9K1H7ZEndl78fQ74hcpg2c0BMYp8eAWkndvsf9zIM5aNsytUAgVfNuJNDqLQDxSlsWF6/AcwukaXDb8A8bReU2DrGHB9xYZuhR7wQXc1Fzgm6Q94MIHLZaiJQc9FuRRLv2Nn/rpO38lcSaYz3jBWaMBWO3N/akoilIsJL6+V7O0VUmewHRZLFb7dawGbS925vQ5gxpl+akeo5oX8/hrBMiunG9c07xygsvDJiWupMR6moKshsP5McA4C+ucuIPLs0FgGC4LpA9/SJmjaUN/NaPSeeuySSS5sTB5l3YxK1TjiyaIIemXgH5zUKAVF6oA24GpoD6K4gRZssUHHzTIrnP3l+0NIA/UvdexWcfCOivkx10W0uUI0g8zpWT4WGyyNsmkRmr22mzurNfcg952EeToeB5RxVXxU5/px8KktpWcc8/ZDIZs7+kQLR9K3nFceHvK9XcUcgiTMdnq8OPaUQfYYXjeY7635D1b7q0GjZKLD2UCr+u8bhywWV0LuOguRL7jA4txB2eVsgJvnYz95JRK6jPmd4lmNcmNj4lINUnMSwjZoBcL28VJmoPVW9BxT/1x+HsHJjG2WeJoj9Cnz2FlVuSYUtztibORtaZP8iv/TxExRcP7nD/OSS2BXyqulwhbYQi40u1QZJPzGrw99pYZHGHW6QAh33W/7O3qRKD7qUM9sAkRp7/9hDSteFZtavBPAlBDcVF4Ab90YxHjv0eowplM64mhhN4CMohP5s+2yXvpXc7NixfoeYPHfIQJqSz+sq2Wl/UjyZPs2jqrpNGna6C/lJJwvvWyQOjyt
 AwQgOnIm
 vVy4un2uKTLlldfYo6X5qQ0QLCv2SiLe4DW9uD7fjvbjf14/T1g0Eoaa2TcSPDqWk7sfvjQso+x2bpBaYF4Unt/SjmhbOSZxquncZb/D6Re7RD1MDi+4sr5sLGcX1Mojri09oHFkp5O2fnOg=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 28/03/2025 15:14, Matthew Wilcox wrote:
> On Thu, Mar 27, 2025 at 04:23:14PM -0400, Ryan Roberts wrote:
>> + Kalesh
>>
>> On 27/03/2025 12:44, Matthew Wilcox wrote:
>>> On Thu, Mar 27, 2025 at 04:06:58PM +0000, Ryan Roberts wrote:
>>>> So let's special-case the read(ahead) logic for executable mappings. The
>>>> trade-off is performance improvement (due to more efficient storage of
>>>> the translations in iTLB) vs potential read amplification (due to
>>>> reading too much data around the fault which won't be used), and the
>>>> latter is independent of base page size. I've chosen 64K folio size for
>>>> arm64 which benefits both the 4K and 16K base page size configs and
>>>> shouldn't lead to any read amplification in practice since the old
>>>> read-around path was (usually) reading blocks of 128K. I don't
>>>> anticipate any write amplification because text is always RO.
>>>
>>> Is there not also the potential for wasted memory due to ELF alignment?
>>
>> I think this is an orthogonal issue? My change isn't making that any worse.
> 
> To a certain extent, it is.  If readahead was doing order-2 allocations
> before and is now doing order-4, you're tying up 0-12 extra pages which
> happen to be filled with zeroes due to being used to cache the contents
> of a hole.

Well we would still have read them in before, nothing has changed there. But I
guess your point is more about reclaim? Because those pages are now contained in
a larger folio, if part of the folio is in use then all of it remains active.
Whereas before, if the folio was fully contained in the pad area and never
accessed, it would fall down the LRU quickly and get reclaimed.

> 
>>> Kalesh talked about it in the MM BOF at the same time that Ted and I
>>> were discussing it in the FS BOF.  Some coordination required (like
>>> maybe Kalesh could have mentioned it to me rathere than assuming I'd be
>>> there?)
>>
>> I was at Kalesh's talk. David H suggested that a potential solution might be for
>> readahead to ask the fs where the next hole is and then truncate readahead to
>> avoid reading the hole. Given it's padding, nothing should directly fault it in
>> so it never ends up in the page cache. Not sure if you discussed anything like
>> that if you were talking in parallel?
> 
> Ted said that he and Kalesh had talked about that solution.  I have a
> more bold solution in mind which lifts the ext4 extent cache to the
> VFS inode so that the readahead code can interrogate it.
> 
>> Anyway, I'm not sure if you're suggesting these changes need to be considered as
>> one somehow or if you're just mentioning it given it is loosely related? My view
>> is that this change is an improvement indepently and could go in much sooner.
> 
> This is not a reason to delay this patch.  It's just a downside which
> should be mentioned in the commit message.

Fair point; I'll add a paragraph about the potential reclaim issue.

> 
>>>> +static inline int arch_exec_folio_order(void)
>>>> +{
>>>> +	return -1;
>>>> +}
>>>
>>> This feels a bit fragile.  I often expect to be able to store an order
>>> in an unsigned int.  Why not return 0 instead?
>>
>> Well 0 is a valid order, no? I think we have had the "is order signed or
>> unsigned" argument before. get_order() returns a signed int :)
> 
> But why not always return a valid order?  I don't think we need a
> sentinel.  The default value can be 0 to do what we do today.
> 

But a single order-0 folio is not what we do today. Note that my change as
currently implemented requests to read a *single* folio of the specified order.
And note that we only get the order we request to page_cache_ra_order() because
the size is limited to a single folio. If the size were bigger, that function
would actually expand the requested order by 2. (although the parameter is
called "new_order", it's actually interpretted as "old_order").

The current behavior is effectively to read 128K in order-2 folios (with smaller
folios for boundary alignment).

So I see a few options:

  - Continue to allow non-opted in arches to use the existing behaviour; in this
case we need a sentinel. This could be -1, UINT_MAX or 0. But in the latter case
you are preventing an opted-in arch from specifying that they want order-0 -
it's meaning is overridden.

  - Force all arches to use the new approach with a default folio order (and
readahead size) of order-0. (The default can be overridden per-arch). Personally
I'd be nervous about making this change.

  - Decouple the read size from the folio order size; continue to use the 128K
read size and only allow opting-in to a specific folio order. The default order
would be 2 (or 0). We would need to fix page_cache_async_ra() to call
page_cache_ra_order() with "order + 2" (the new order) and fix
page_cache_ra_order() to treat its order parameter as the *new* order.

Perhaps we should do those fixes anyway (and then actually start with a folio
order of 0 - which I think you said in the past was your original intention?).

Thanks,
Ryan