From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F2F32C36014 for ; Tue, 1 Apr 2025 10:35:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8BAEE280002; Tue, 1 Apr 2025 06:35:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 86A31280001; Tue, 1 Apr 2025 06:35:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7575B280002; Tue, 1 Apr 2025 06:35:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 52676280001 for ; Tue, 1 Apr 2025 06:35:51 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 3F6D7B4D82 for ; Tue, 1 Apr 2025 10:35:52 +0000 (UTC) X-FDA: 83285119344.28.BD2E02D Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf17.hostedemail.com (Postfix) with ESMTP id 7AE6B4000A for ; Tue, 1 Apr 2025 10:35:50 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf17.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743503750; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=W8bdYlVi2iOGNPjvXvzfLrSEQ29VkT69sQg/OM0grbs=; b=p0lP/67PuCf9nFCKDrXTWObddTD7v65MsjfL9LcD4ipBFXyu1VWTWnVuy2sYlIdrZn8FFn dG6+vMqK0TL+JXoqmIDnDISgRPkEhiF3zm+N93GotwdcPEAyzuroJM+VFE9pkUVDhbN8GD /wjOzYsb+acbuWkirrnVFGKDMNN6KKE= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf17.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743503750; a=rsa-sha256; cv=none; b=cUEKC2HP0NAClR9pPEcbCJKjc3fnEnwCa3ti/cPaH43DvYmJiLRlizeOLgM9DZKowq9hDq 2HpUO6fSaDjj5f+wErnJJJDV6MJxKKVT6FbW/nHb/HkdN6CMEEm7Xp6JwXbiZNW01ZWnr/ khjXfqVHm4KiC2G1yDk8dtBcqEV7KCM= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9D32B14BF; Tue, 1 Apr 2025 03:35:52 -0700 (PDT) Received: from [10.1.28.189] (XHFQ2J9959.cambridge.arm.com [10.1.28.189]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 75B433F63F; Tue, 1 Apr 2025 03:35:47 -0700 (PDT) Message-ID: <76f5ba9b-1a8c-4973-89ce-14f504819da1@arm.com> Date: Tue, 1 Apr 2025 11:35:45 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3] mm/filemap: Allow arch to request folio size for exec memory Content-Language: en-GB To: Kalesh Singh Cc: Matthew Wilcox , Andrew Morton , David Hildenbrand , Dave Chinner , Catalin Marinas , Will Deacon , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org References: <20250327160700.1147155-1-ryan.roberts@arm.com> <5131c7ad-cc37-44fc-8672-5866ecbef65b@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam01 X-Stat-Signature: hhu9noknp5ztdezmqeom6d7a6c4w4c7u X-Rspam-User: X-Rspamd-Queue-Id: 7AE6B4000A X-HE-Tag: 1743503750-428601 X-HE-Meta: U2FsdGVkX1/9Jj20HYPJE5PMjN0yQlzp4jAd2IEIm/bThwNYmX4N28l6HwR5WJkjf1Km9cPFPZnvaMi34W0rP9O8BVOJnH7DnAJ/0T+abku1c7zYfVVdx/2wMzfERy/LimVnckyKGahb7iV2objwf8gYYlw/o4bvjPdU9IxJBZU9GFpdgshtw0u35Vp9aPBV0Ma+39l5T4O2AYk/iNnGtn55lIddLx5BDfgwVZs1tO3vkcJABFbIOUzvLHOWl6kmBwjhgoNskiUIvKgDB/K8IY8GogrqGgBrEd40wAyCiCHMw+vUY7O2YKCrk7WTkfQGvAtCJY+bYsLSs0AQOhNxvoTqzNutZ0L5wiOdrxmBLMUdQ/NU5xgDU1ly09T9gqDHOdd/4j+6ZctWjZ0V3ExDHoVvpWidRcEYdp+ALP8e1NRKhdHpoCw5QiNMgceXwWncR96v/c57UQydffOaM2atp7DXpEwcRhMP3c2R4rYhpKJvgUc9+vshxq1kaL7L+vM1jGLd4ZDhOFlVV8oGVaP9/fPZUChYoVYynHffe0q6cnu78uAsUqOSuTKXfN0C+XLTFsKfHgzSIHZnrAJ9sAfX/lxe3G5192UzJEBEqK/+7XVs0M2ICebDODlw52fTCd+kjHlSWUIdx3UfUAJ604bSm1MCmsRsoGGsvJi77qnkfsbA6HwbSBjEt+2rPJ+t9jGGPTIvM17ZnIdrC2MMTD2KZrIDaYQKVPuw7yu+88fwwxPv40TKh1xIONuvHIXR0mo9YpF+FQuAjUB75elQAWY/f4FuIjvWvo6P7mZNnDjJbmCLMQaf3pJ+IR60J39L85HjAzc10yP7+6Ti9Em8lznD0+PFYzLacZqk X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 01/04/2025 03:19, Kalesh Singh wrote: > On Sat, Mar 29, 2025 at 3:08 AM Ryan Roberts wrote: >> >> On 28/03/2025 15:14, Matthew Wilcox wrote: >>> On Thu, Mar 27, 2025 at 04:23:14PM -0400, Ryan Roberts wrote: >>>> + Kalesh >>>> >>>> On 27/03/2025 12:44, Matthew Wilcox wrote: >>>>> On Thu, Mar 27, 2025 at 04:06:58PM +0000, Ryan Roberts wrote: >>>>>> So let's special-case the read(ahead) logic for executable mappings. The >>>>>> trade-off is performance improvement (due to more efficient storage of >>>>>> the translations in iTLB) vs potential read amplification (due to >>>>>> reading too much data around the fault which won't be used), and the >>>>>> latter is independent of base page size. I've chosen 64K folio size for >>>>>> arm64 which benefits both the 4K and 16K base page size configs and >>>>>> shouldn't lead to any read amplification in practice since the old >>>>>> read-around path was (usually) reading blocks of 128K. I don't >>>>>> anticipate any write amplification because text is always RO. >>>>> >>>>> Is there not also the potential for wasted memory due to ELF alignment? >>>> >>>> I think this is an orthogonal issue? My change isn't making that any worse. >>> >>> To a certain extent, it is. If readahead was doing order-2 allocations >>> before and is now doing order-4, you're tying up 0-12 extra pages which >>> happen to be filled with zeroes due to being used to cache the contents >>> of a hole. >> >> Well we would still have read them in before, nothing has changed there. But I >> guess your point is more about reclaim? Because those pages are now contained in >> a larger folio, if part of the folio is in use then all of it remains active. >> Whereas before, if the folio was fully contained in the pad area and never >> accessed, it would fall down the LRU quickly and get reclaimed. >> > > > Hi Ryan, > > I agree this was happening before and we don't need to completely > address it here. Though with the patch it's more likely that the holes > will be cached. I'd like to minimize it if possible. Since this is for > EXEC mappings, a simple check we could use is to limit this to the > VM_EXEC vma. > > + if (vm_flags & VM_EXEC) { > + int order = arch_exec_folio_order(); > + > + if (order >= 0 && ((end-address)*2) >= 1< > For reference I found below (coincidentally? similar) distributions on > my devices > > == x86 Workstation == > > Total unique exec segments: 906 > > Exec segments >= 16 KB: 663 ( 73.18%) > Exec segments >= 64 KB: 414 ( 45.70%) What are those percentages? They don't add up to more than 100... The numbers I included with the patch are caclulated based on actual mappings so if we end up with a partially mapped 64K folio (because it runs off the end of the VMA) it wouldn't have been counted as a 64K contiguous mapping. So I don't think this type of change would change my numbers at all. > > == arm64 Android Device == > > Total unique exec segments: 2171 > > Exec segments >= 16 KB: 1602 ( 73.79%) > Exec segments >= 64 KB: 988 ( 45.51%) > > Result were using the below script: > > cat /proc/*/maps | grep 'r-xp' | \ > awk ' > BEGIN { OFS = "\t" } > $NF ~ /^\// { > path = $NF; > split($1, addr, "-"); > size = strtonum("0x" addr[2]) - strtonum("0x" addr[1]); > print size, path; > }' | \ > sort -u | \ > awk ' > BEGIN { > FS = "\t"; > total_segments = 0; > segs_ge_16k = 0; > segs_ge_64k = 0; > } > { > total_segments++; > size = $1; > if (size >= 16384) segs_ge_16k++; > if (size >= 65536) segs_ge_64k++; > } > END { > if (total_segments > 0) { > percent_gt_16k = (segs_ge_16k / total_segments) * 100; > percent_gt_64k = (segs_ge_64k / total_segments) * 100; > > printf "Total unique exec segments: %d\n", total_segments; > printf "\n"; > printf "Exec segments >= 16 KB: %5d (%6.2f%%)\n", segs_ge_16k, percent_gt_16k; > printf "Exec segments >= 64 KB: %5d (%6.2f%%)\n", segs_ge_64k, percent_gt_64k; > } else { > print "No executable segments found."; > } > }' > >>> >>>>> Kalesh talked about it in the MM BOF at the same time that Ted and I >>>>> were discussing it in the FS BOF. Some coordination required (like >>>>> maybe Kalesh could have mentioned it to me rathere than assuming I'd be >>>>> there?) >>>> >>>> I was at Kalesh's talk. David H suggested that a potential solution might be for >>>> readahead to ask the fs where the next hole is and then truncate readahead to >>>> avoid reading the hole. Given it's padding, nothing should directly fault it in >>>> so it never ends up in the page cache. Not sure if you discussed anything like >>>> that if you were talking in parallel? >>> >>> Ted said that he and Kalesh had talked about that solution. I have a >>> more bold solution in mind which lifts the ext4 extent cache to the >>> VFS inode so that the readahead code can interrogate it. >>> > > Sorry about the hiccup in coordination, Matthew. It was my bad for not > letting you know I planned to discuss it in the MM BoF. I'd like to > hear Ted and your ideas on this when possible. > > Thanks, > Kalesh > >>>> Anyway, I'm not sure if you're suggesting these changes need to be considered as >>>> one somehow or if you're just mentioning it given it is loosely related? My view >>>> is that this change is an improvement indepently and could go in much sooner. >>> >>> This is not a reason to delay this patch. It's just a downside which >>> should be mentioned in the commit message. >> >> Fair point; I'll add a paragraph about the potential reclaim issue. >> >>> >>>>>> +static inline int arch_exec_folio_order(void) >>>>>> +{ >>>>>> + return -1; >>>>>> +} >>>>> >>>>> This feels a bit fragile. I often expect to be able to store an order >>>>> in an unsigned int. Why not return 0 instead? >>>> >>>> Well 0 is a valid order, no? I think we have had the "is order signed or >>>> unsigned" argument before. get_order() returns a signed int :) >>> >>> But why not always return a valid order? I don't think we need a >>> sentinel. The default value can be 0 to do what we do today. >>> >> >> But a single order-0 folio is not what we do today. Note that my change as >> currently implemented requests to read a *single* folio of the specified order. >> And note that we only get the order we request to page_cache_ra_order() because >> the size is limited to a single folio. If the size were bigger, that function >> would actually expand the requested order by 2. (although the parameter is >> called "new_order", it's actually interpretted as "old_order"). >> >> The current behavior is effectively to read 128K in order-2 folios (with smaller >> folios for boundary alignment). >> >> So I see a few options: Matthew, Did you have any thoughts on these options? Thanks, Ryan >> >> - Continue to allow non-opted in arches to use the existing behaviour; in this >> case we need a sentinel. This could be -1, UINT_MAX or 0. But in the latter case >> you are preventing an opted-in arch from specifying that they want order-0 - >> it's meaning is overridden. >> >> - Force all arches to use the new approach with a default folio order (and >> readahead size) of order-0. (The default can be overridden per-arch). Personally >> I'd be nervous about making this change. >> >> - Decouple the read size from the folio order size; continue to use the 128K >> read size and only allow opting-in to a specific folio order. The default order >> would be 2 (or 0). We would need to fix page_cache_async_ra() to call >> page_cache_ra_order() with "order + 2" (the new order) and fix >> page_cache_ra_order() to treat its order parameter as the *new* order. >> >> Perhaps we should do those fixes anyway (and then actually start with a folio >> order of 0 - which I think you said in the past was your original intention?). >> >> Thanks, >> Ryan >>