From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C8A09C36017
	for <linux-mm@archiver.kernel.org>; Wed,  2 Apr 2025 13:40:41 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 125F3280004; Wed,  2 Apr 2025 09:40:40 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0AE37280001; Wed,  2 Apr 2025 09:40:40 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E69DA280004; Wed,  2 Apr 2025 09:40:39 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id C3952280001
	for <linux-mm@kvack.org>; Wed,  2 Apr 2025 09:40:39 -0400 (EDT)
Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id BCF891619D4
	for <linux-mm@kvack.org>; Wed,  2 Apr 2025 13:40:40 +0000 (UTC)
X-FDA: 83289213840.04.708CA01
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf09.hostedemail.com (Postfix) with ESMTP id BD862140013
	for <linux-mm@kvack.org>; Wed,  2 Apr 2025 13:40:38 +0000 (UTC)
Authentication-Results: imf09.hostedemail.com;
	dkim=none;
	spf=pass (imf09.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1743601239;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=AMPTy/5MVoIix/c6WOcXRuIWfWO6spXjl5op+599L5g=;
	b=CO8Ti8WqEVAGb/zK1SeCPqZgqkvjdE0p7jh73Ewlrbqne6upNMHmzOrUSElFNWoz4dx5oT
	63EdyqTwfLLbITQ+3mJgbR84Ex9+lXvVs3dIvDiTN4f+/UW5OkY0++HqZ3/sJsHbpo/HIg
	9mU1bU8vF5Cs9AWqGuH4YjFbujr7EFI=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743601239; a=rsa-sha256;
	cv=none;
	b=B+W+XihHnnY5ZehZFOK/WidGlO11X+9shur3TzdC6Fl3jwfTjTwkQzLQCfE01p4gLMgGLR
	Csi4dLMOM4RY8iNZSQrHWX39u5iu/eC14NyWi1HZoDW9BMe+0kWUVm2V71oTFlxvkClivI
	64YpIDsmsuuvoI0nVxnkXmXoUrskNBc=
ARC-Authentication-Results: i=1;
	imf09.hostedemail.com;
	dkim=none;
	spf=pass (imf09.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9FF67FEC;
	Wed,  2 Apr 2025 06:40:40 -0700 (PDT)
Received: from [10.57.88.195] (unknown [10.57.88.195])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E601F3F694;
	Wed,  2 Apr 2025 06:40:35 -0700 (PDT)
Message-ID: <e809c355-7f3c-4c7b-acee-852d5b64d1c7@arm.com>
Date: Wed, 2 Apr 2025 14:40:34 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v3] mm/filemap: Allow arch to request folio size for exec
 memory
Content-Language: en-GB
To: Kalesh Singh <kaleshsingh@google.com>
Cc: Matthew Wilcox <willy@infradead.org>,
 Andrew Morton <akpm@linux-foundation.org>,
 David Hildenbrand <david@redhat.com>, Dave Chinner <david@fromorbit.com>,
 Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>,
 linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org,
 linux-fsdevel@vger.kernel.org, linux-mm@kvack.org
References: <20250327160700.1147155-1-ryan.roberts@arm.com>
 <Z-WAbWfZzG1GA-4n@casper.infradead.org>
 <5131c7ad-cc37-44fc-8672-5866ecbef65b@arm.com>
 <Z-b1FmZ5nHzh5huL@casper.infradead.org>
 <ee11907a-5bd7-44ec-844c-8f10ff406b46@arm.com>
 <CAC_TJveU2v+EcokLKJVVZ8Xje2nYmmUg8bvCD8KO1oC5MgmWCA@mail.gmail.com>
 <76f5ba9b-1a8c-4973-89ce-14f504819da1@arm.com>
 <CAC_TJvemG2XDky1HtA8g=SG0nv-sONOS-ssEchdymh-_xjBecw@mail.gmail.com>
From: Ryan Roberts <ryan.roberts@arm.com>
In-Reply-To: <CAC_TJvemG2XDky1HtA8g=SG0nv-sONOS-ssEchdymh-_xjBecw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: BD862140013
X-Stat-Signature: 4eq1s37gkdzr73j54xoudjn7wa51btp4
X-HE-Tag: 1743601238-102452
X-HE-Meta: U2FsdGVkX1+4tv0gwF1ZaSgnQVbjPwhxf9x0fLfQBIFggxkWtOGp0A6DjNviDe2hjY7eKhDKD50mw6UXD8UFDC9s3Ubns7Gc4994hEvJZ5+17UJw4ldywJnG1OdYQJXF8wskaBZ+8qI2Ux97/BWLHk1ITOdoRoZ1EJL8hUlkEL/LpweCU6jm4+Iabjy2W80KlVO7pQwqtuFehOHVof3Xcz/2V3Tz82i2dr0ujhGQjFi8J4nqkpRwROknVilk/lx8iWNpFksFhDwXys0Ly+X2E63PPCFE5j6sti3ZnhmokZkQYUg0OOyuK0FDrUOYlqMNlxVibPI+YguYg8shY7X/XfP+DeP7+OlvqAdiU9qYCtOnjGsO+rPPiCw2DWCsd7qh8OlvwLsbULSa9IBjp/IfGHK87RW4YztUB8DqPQgiF2+2ivNf2gS/M+PcdvzSEQYLM3rXrcduY+hqjCv3yNwClEm20+f4CxZUg2GeZMlqG5vsSs2WoH6MPXzhpCoRZRAMg6p1XyV+K7VVUm9DHZqSCiDdH6e9ZKy2ShRcfaLbex8F4w9poaeo/X/LK8SkszFht/0e3N7+QQXdBUoQIi6iMyeJeliONHBDRgl5Q7P9ETcWtnJLfxmyFkzL4jIoToe31WbLG/XAcbHoOegi8qbNqxrPr6mhsxJIiZxEWv1Q2vHXs0HFY6eYytw+4UXQXCVuXtFE546rFIxCMvEKUKaVQpjcUYfwjgXszGxrtQaL/kJlRG2rLawjwMNAwKxeGpovvh+b++Qz/fXNwZ21Gdel4Bwt5Y0Ie7lO3G54m3TUxPN7sg8z9WL5StTLKNv2bFf9zyCTXYjwgCkXEGyXNT/hAKhz41r2E2RBLyp2BJqh6zOSBUiREdC/DF5cj6mFUtGHZ5TjJGQh8eMefoIBQ2u1j+6Zgf5QUyR36hoelqkl8eOqCRldXCIN6e5Om740djXTyQrPhsqNbcqxgc3hzoO
 JwdjbeFD
 0Gr4cSskpzIbjjsPUsspDdb2oRMJgdgwwX8Uwk1BWcpNgDZQlFtRuitDeqLcquXlnjP/rerzWIfo06PhXQ2A6sQvALEra9+BwazIZV6lAl3s+Hqxd1S0K4GHTgjT64icLD5tfbrpib5BJOf5cJqUQjjrscSHdsYCsQwvP0Uu7LosL0jOPDoQez0/0DcKubi/fv7iWz10cxyfd9pYkuAguXw990vYO608nHfjK81ZJCu7RWHg=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 01/04/2025 13:55, Kalesh Singh wrote:
> On Tue, Apr 1, 2025 at 3:35 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 01/04/2025 03:19, Kalesh Singh wrote:
>>> On Sat, Mar 29, 2025 at 3:08 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 28/03/2025 15:14, Matthew Wilcox wrote:
>>>>> On Thu, Mar 27, 2025 at 04:23:14PM -0400, Ryan Roberts wrote:
>>>>>> + Kalesh
>>>>>>
>>>>>> On 27/03/2025 12:44, Matthew Wilcox wrote:
>>>>>>> On Thu, Mar 27, 2025 at 04:06:58PM +0000, Ryan Roberts wrote:
>>>>>>>> So let's special-case the read(ahead) logic for executable mappings. The
>>>>>>>> trade-off is performance improvement (due to more efficient storage of
>>>>>>>> the translations in iTLB) vs potential read amplification (due to
>>>>>>>> reading too much data around the fault which won't be used), and the
>>>>>>>> latter is independent of base page size. I've chosen 64K folio size for
>>>>>>>> arm64 which benefits both the 4K and 16K base page size configs and
>>>>>>>> shouldn't lead to any read amplification in practice since the old
>>>>>>>> read-around path was (usually) reading blocks of 128K. I don't
>>>>>>>> anticipate any write amplification because text is always RO.
>>>>>>>
>>>>>>> Is there not also the potential for wasted memory due to ELF alignment?
>>>>>>
>>>>>> I think this is an orthogonal issue? My change isn't making that any worse.
>>>>>
>>>>> To a certain extent, it is.  If readahead was doing order-2 allocations
>>>>> before and is now doing order-4, you're tying up 0-12 extra pages which
>>>>> happen to be filled with zeroes due to being used to cache the contents
>>>>> of a hole.
>>>>
>>>> Well we would still have read them in before, nothing has changed there. But I
>>>> guess your point is more about reclaim? Because those pages are now contained in
>>>> a larger folio, if part of the folio is in use then all of it remains active.
>>>> Whereas before, if the folio was fully contained in the pad area and never
>>>> accessed, it would fall down the LRU quickly and get reclaimed.
>>>>
>>>
>>>
>>> Hi Ryan,
>>>
>>> I agree this was happening before and we don't need to completely
>>> address it here. Though with the patch it's more likely that the holes
>>> will be cached. I'd like to minimize it if possible. Since this is for
>>> EXEC mappings, a simple check we could use is to limit this to the
>>> VM_EXEC vma.
>>>
>>> + if (vm_flags & VM_EXEC) {
>>> + int order = arch_exec_folio_order();
>>> +
>>> + if (order >= 0 && ((end-address)*2) >= 1<<order) { /* Fault around case */
>>
>> I think the intent of this extra check is to ensure the folio will be fully
>> contained within the exec vma? Assuming end is the VA of the end of the vma and
>> address is the VA of the fault, I don't think the maths are quite right? What's
>> the "*2" for? And you probably mean PAGE_SIZE<<order ? But this also doesn't
>> account for alignment; the folio will be aligned down to a natural boundary in
>> the file.
> 
> Hi Ryan,
> 
> Sorry for the pseudocode. You are right it should be PAGE_SIZE. *2 is
> because I missed that this isn't centered around the faulting address
> as the usual fault around logic.
> 
>>
>> But more fundamentally, I thought I suggested reducing the VMA bounds to exclude
>> padding pages the other day at LSF/MM and you said you didn't want to do that
>> because you didn't want to end up with something else mapped in the gap? So
>> doesn't that mean the padding pages are part of the VMA and this check won't help?
> 
> The intention was to check that we aren't reading past (more than we
> do today) into other subsequent segments (and holes) if the exec
> segment is small relative to 64K, in which case we fall back to the
> conventional readahead heuristics.

Gottya. I think this is probably a reasonable thing to do.

> 
>>
>>>
>>> For reference I found below (coincidentally? similar) distributions on
>>> my devices
>>>
>>> == x86 Workstation ==
>>>
>>> Total unique exec segments:   906
>>>
>>> Exec segments >= 16 KB:   663 ( 73.18%)
>>> Exec segments >= 64 KB:   414 ( 45.70%)
>>
>> What are those percentages? They don't add up to more than 100...
> 
> The percent of segments with size >=16KB/64KB of the total number of
> exec mappings. "Exec segments >= 64 KB" is also a subset of "Exec
> segments >= 16 KB" it's why they don't add up to 100.

Ahh got it.

> 
>>
>> The numbers I included with the patch are caclulated based on actual mappings so
>> if we end up with a partially mapped 64K folio (because it runs off the end of
>> the VMA) it wouldn't have been counted as a 64K contiguous mapping. So I don't
>> think this type of change would change my numbers at all.
> 
> I don't think it should affect the results much (probably only a small
> benefit from the extra preread of other segments if we aren't under
> much memory pressure).

I was referring to the my numbers that show the percentage of text mapped by
contpte, not to the performance numbers. But I agree I wouldn't expect either to
change.

> 
> Anyways I don't have a strong opinion, given that this shouldn't be an
> issue once we work out Matthew and Ted's idea for the holes. So we can
> keep it simple for now.

I'll add it in and verify it doesn't affect the numbers. I think there is a
wider decision that needs to be made below first though, so will wait for
opinions on that.

Thanks,
Ryan

> 
> Thanks,
> Kalesh
> 
>>
>>>
>>> == arm64 Android Device ==
>>>
>>> Total unique exec segments:   2171
>>>
>>> Exec segments >= 16 KB:  1602 ( 73.79%)
>>> Exec segments >= 64 KB:   988 ( 45.51%)
>>>
>>> Result were using the below script:
>>>
>>> cat /proc/*/maps | grep 'r-xp' | \
>>> awk '
>>> BEGIN { OFS = "\t" }
>>> $NF ~ /^\// {
>>> path = $NF;
>>> split($1, addr, "-");
>>> size = strtonum("0x" addr[2]) - strtonum("0x" addr[1]);
>>> print size, path;
>>> }' | \
>>> sort -u | \
>>> awk '
>>> BEGIN {
>>> FS = "\t";
>>> total_segments = 0;
>>> segs_ge_16k = 0;
>>> segs_ge_64k = 0;
>>> }
>>> {
>>> total_segments++;
>>> size = $1;
>>> if (size >= 16384) segs_ge_16k++;
>>> if (size >= 65536) segs_ge_64k++;
>>> }
>>> END {
>>> if (total_segments > 0) {
>>> percent_gt_16k = (segs_ge_16k / total_segments) * 100;
>>> percent_gt_64k = (segs_ge_64k / total_segments) * 100;
>>>
>>> printf "Total unique exec segments: %d\n", total_segments;
>>> printf "\n";
>>> printf "Exec segments >= 16 KB: %5d (%6.2f%%)\n", segs_ge_16k, percent_gt_16k;
>>> printf "Exec segments >= 64 KB: %5d (%6.2f%%)\n", segs_ge_64k, percent_gt_64k;
>>> } else {
>>> print "No executable segments found.";
>>> }
>>> }'
>>>
>>>>>
>>>>>>> Kalesh talked about it in the MM BOF at the same time that Ted and I
>>>>>>> were discussing it in the FS BOF.  Some coordination required (like
>>>>>>> maybe Kalesh could have mentioned it to me rathere than assuming I'd be
>>>>>>> there?)
>>>>>>
>>>>>> I was at Kalesh's talk. David H suggested that a potential solution might be for
>>>>>> readahead to ask the fs where the next hole is and then truncate readahead to
>>>>>> avoid reading the hole. Given it's padding, nothing should directly fault it in
>>>>>> so it never ends up in the page cache. Not sure if you discussed anything like
>>>>>> that if you were talking in parallel?
>>>>>
>>>>> Ted said that he and Kalesh had talked about that solution.  I have a
>>>>> more bold solution in mind which lifts the ext4 extent cache to the
>>>>> VFS inode so that the readahead code can interrogate it.
>>>>>
>>>
>>> Sorry about the hiccup in coordination, Matthew. It was my bad for not
>>> letting you know I planned to discuss it in the MM BoF. I'd like to
>>> hear Ted and your ideas on this when possible.
>>>
>>> Thanks,
>>> Kalesh
>>>
>>>>>> Anyway, I'm not sure if you're suggesting these changes need to be considered as
>>>>>> one somehow or if you're just mentioning it given it is loosely related? My view
>>>>>> is that this change is an improvement indepently and could go in much sooner.
>>>>>
>>>>> This is not a reason to delay this patch.  It's just a downside which
>>>>> should be mentioned in the commit message.
>>>>
>>>> Fair point; I'll add a paragraph about the potential reclaim issue.
>>>>
>>>>>
>>>>>>>> +static inline int arch_exec_folio_order(void)
>>>>>>>> +{
>>>>>>>> +  return -1;
>>>>>>>> +}
>>>>>>>
>>>>>>> This feels a bit fragile.  I often expect to be able to store an order
>>>>>>> in an unsigned int.  Why not return 0 instead?
>>>>>>
>>>>>> Well 0 is a valid order, no? I think we have had the "is order signed or
>>>>>> unsigned" argument before. get_order() returns a signed int :)
>>>>>
>>>>> But why not always return a valid order?  I don't think we need a
>>>>> sentinel.  The default value can be 0 to do what we do today.
>>>>>
>>>>
>>>> But a single order-0 folio is not what we do today. Note that my change as
>>>> currently implemented requests to read a *single* folio of the specified order.
>>>> And note that we only get the order we request to page_cache_ra_order() because
>>>> the size is limited to a single folio. If the size were bigger, that function
>>>> would actually expand the requested order by 2. (although the parameter is
>>>> called "new_order", it's actually interpretted as "old_order").
>>>>
>>>> The current behavior is effectively to read 128K in order-2 folios (with smaller
>>>> folios for boundary alignment).
>>>>
>>>> So I see a few options:
>>
>> Matthew,
>>
>> Did you have any thoughts on these options?
>>
>> Thanks,
>> Ryan
>>
>>>>
>>>>   - Continue to allow non-opted in arches to use the existing behaviour; in this
>>>> case we need a sentinel. This could be -1, UINT_MAX or 0. But in the latter case
>>>> you are preventing an opted-in arch from specifying that they want order-0 -
>>>> it's meaning is overridden.
>>>>
>>>>   - Force all arches to use the new approach with a default folio order (and
>>>> readahead size) of order-0. (The default can be overridden per-arch). Personally
>>>> I'd be nervous about making this change.
>>>>
>>>>   - Decouple the read size from the folio order size; continue to use the 128K
>>>> read size and only allow opting-in to a specific folio order. The default order
>>>> would be 2 (or 0). We would need to fix page_cache_async_ra() to call
>>>> page_cache_ra_order() with "order + 2" (the new order) and fix
>>>> page_cache_ra_order() to treat its order parameter as the *new* order.
>>>>
>>>> Perhaps we should do those fixes anyway (and then actually start with a folio
>>>> order of 0 - which I think you said in the past was your original intention?).
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>