From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B30B5C3601A for ; Tue, 1 Apr 2025 17:55:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 112A1280003; Tue, 1 Apr 2025 13:55:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0C1A8280001; Tue, 1 Apr 2025 13:55:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EA752280003; Tue, 1 Apr 2025 13:55:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id CDFAE280001 for ; Tue, 1 Apr 2025 13:55:52 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id B2BFA1A06D5 for ; Tue, 1 Apr 2025 17:55:53 +0000 (UTC) X-FDA: 83286228186.21.C57E4DF Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) by imf17.hostedemail.com (Postfix) with ESMTP id CEE394000E for ; Tue, 1 Apr 2025 17:55:51 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=nCYVBvZB; spf=pass (imf17.hostedemail.com: domain of kaleshsingh@google.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=kaleshsingh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743530151; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qwhXc+3Mv33awoyK00ZHq4pdn0DYvpiatBKGbcRAOA4=; b=GVzCZmnwlh+QiPsHqnkKtuBdkNcwV009DRB6EqSLHLKMmy3DCiISD4yoRGO3LP5qNJIC2R y0uESuho085LC18GEIlDGGaRaMKUNHUlWFbVLCdTPokctj+z1gW9tOnRDr2pS662DJC57b cArY8DTlcndJndIs1Zzc5LPlN6Hxyok= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=nCYVBvZB; spf=pass (imf17.hostedemail.com: domain of kaleshsingh@google.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=kaleshsingh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743530151; a=rsa-sha256; cv=none; b=Ti/csiiXv0yFjzxHFULMjBPQLL/2rKFC26ICPnmZvCB7FwVKTkNq2p+A1GS+5JY4b0jpFs ZhpRqJVJN3yVJ0LP8Nw0bQXCSW74qZZdYEJJOqCOgIlVerC4mQl8UTVUcb31m6juGa/8Nv 5s8hZ0IyZQyOOuo/E9eAAjdLOepunnM= Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-2240aad70f2so26245ad.0 for ; Tue, 01 Apr 2025 10:55:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1743530151; x=1744134951; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=qwhXc+3Mv33awoyK00ZHq4pdn0DYvpiatBKGbcRAOA4=; b=nCYVBvZBx+0vHQ8X8ss6tYjiDH4mGm3QRzXQaCqAiVtrBX/qfX/tFmx8qJAK2Lonl3 LBUXU8/6yIhANQX4dnUDfOOIS/5stIFOIWK7IrMuYNyZvQgPviw6V/JZDa1VlwfaDgmK gQj7fszAFCxRvQmhgUWmrlh7T9Usac7NOzFaljxv2Vgv1E6gkp0g+jljP5aqeZfX39gf MF7+XQE641BngWoxaeBfiazPwHOYkXVGVeGeMBn6TpsCUkfXF7bDs8FpyVY18/qKD5p+ uYbXdhfnzbNThMvL9vQnunR26FF0vs1GKWVx4OZ7KyoOHmIIP3EzWqUcasOZQkTF8jb9 uHXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743530151; x=1744134951; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=qwhXc+3Mv33awoyK00ZHq4pdn0DYvpiatBKGbcRAOA4=; b=B26e7Myw3Lgptqo222WetmgTvy9YMElC2z3xlZjZN6pyXPC6rEskLdaCP26MZDYOs+ Kj6FPYmLW4invRV29/QXOKUl4O3JMMJZd0DUw3U2ghn638UuH8v041PCYLxy389ZrQLm jlnlQMjVYsBT4TPad7I2VJvYsAhCUkwn6KtY7Et7hIZQW5SW1zEvtFXGg4iQiRVybVyZ PRNgXh3Kcv474dNBzuXXYAg9slYaoN1ZJlTmmyea4a8FAP+kJHkGNowxdFmANKXWhfcD bSTgMYuH0AICD1UHbnX1iEoJtAA3yfQqoaH4OZV4PutwEBHpJ95UBM6w1xN37+Gj2BiR sN8A== X-Forwarded-Encrypted: i=1; AJvYcCV1+RcbBKzzmoWRkaVMDlihYQM9P0dGasETGiEhG3D353Up+Htn2tfcBoV6+xv4yslvqfrNQTmd3A==@kvack.org X-Gm-Message-State: AOJu0Yy0e/gAtvouxXQ4B9HibeEVFQfhT2TcVNnfczFlI2VB7eR1HHLI 8pZf6Exaku+bJXvYoJv+qqSwl3AZ/KhRQa+q59nmrMT2E9Nynw2AkEbSmmgjSnWG+UU/CRUcHVW WpAqfpnHmiPbJ8lG7xh2rD7sBab6yk0inpkVD X-Gm-Gg: ASbGncvDCNkfnvU67pyU6X7xFyhPU+nFxFCl7QUu5+5YVZt4/QaSF46rn20bDKU4psj y1/tA2FYhmQFaNyMINmEMWXyMb7NpU4Phibrl3LsSC8DNzZFXHs2fxXwHjAisaANot9Op8g5FT+ 30a0yWPHRCpDqj71uqf7aVFeGV5WIUlHIT2PLL4KN2IENFTs6Q0UCh2v5d X-Google-Smtp-Source: AGHT+IE7+lugkQoCFTK/A+4GPe77QGElu8c0EADC3mRhCAQpzpeh+IouIII0n0V1WBWVNUBlMFPnDVuVctKpkGKQbWQ= X-Received: by 2002:a17:902:dace:b0:215:65f3:27ef with SMTP id d9443c01a7336-22969ddb1e2mr140565ad.12.1743530150219; Tue, 01 Apr 2025 10:55:50 -0700 (PDT) MIME-Version: 1.0 References: <20250327160700.1147155-1-ryan.roberts@arm.com> <5131c7ad-cc37-44fc-8672-5866ecbef65b@arm.com> <76f5ba9b-1a8c-4973-89ce-14f504819da1@arm.com> In-Reply-To: <76f5ba9b-1a8c-4973-89ce-14f504819da1@arm.com> From: Kalesh Singh Date: Tue, 1 Apr 2025 10:55:38 -0700 X-Gm-Features: AQ5f1Jr2CcyeaoQIJQrpxG5gMN7Sj5YaqmW2gOSH6C-Z8_Xxr2eNEVByBPGBvFo Message-ID: Subject: Re: [PATCH v3] mm/filemap: Allow arch to request folio size for exec memory To: Ryan Roberts Cc: Matthew Wilcox , Andrew Morton , David Hildenbrand , Dave Chinner , Catalin Marinas , Will Deacon , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: CEE394000E X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: fiqrd63j4fu6z9pta1cgkum5p75cmtja X-HE-Tag: 1743530151-874704 X-HE-Meta: U2FsdGVkX1+PI+sN9iN/IPNjGi6E8jMGxGgiKRKH0j6yZVhEThBRwI9T/iRwkysxk/SesjR7sIhdmHwwj1EJXGnB0xUD+jZ6gcsCKPftOix4zrlLTLtxL35mhL6HBs73SwpRQTFIClyl4F9KVr2qoCKLqvSAioij1cTIflJPn05Dc87kfs3NEaRG9l9axKC+8RV6EnWIqIofVvzjAH6dA/ZBAV2KrJsNxTIYRbwwCK3MpF2tJ+jbOhsqQbHPsQRLtqk22To8Ta10WD6jAm/N2tjCwmfwjO5TCOvUHmFr0TtWfoVOmzDH42NtUBIAb3wyvhJv13MvP9FPs7C8muDXHpV27FlECGAc3dKyDpa77NaqLD7FyeOX6CrebHv06bb+LZ2zYcV1FlIh8z2IWuxXcPu7TFF6IU7jot5Haw7tW1TSE6/Cy6t1RpwJnxnVqP1+SbNaYXC/iAYuHVSAi3Zo2k16hhol8LWsBAXMjl9Mxk0d4mZxN7pbkzpVNFAHwxheGPTratQBnvhX3W+bvnjEL/E+pzxNgojPiI+M/joAAAuvuUi6UO7d0LbrVtLWEeYrSFCpZnk5SPJqOuXPg+qBcNzwHHIsbibq1rjtpWkI7Pr3ztUmFd9AX5fmn0EqSDjig9j92ZcHY3c4oqesbfskKhSPBnD4Y69TSJRYJrvPhypIeVjWAfUtc1k2LM6CKfN/W+8HvXwo0zcEu/A5uQkoH06ys41Xp6IccGoWl86fiHpPlj1/fnzSC0JL4WXZlWYvKe2gUtX1vACdm1lkAXb0vZFSuXg25bJUC9hlR4KA+TYJLw06XPEgwzIHqAtkqSu7nMlyIadn1R4q2QUZwPIbLgY4ut5iacDapLQe0SoZ6PUTUCH/uko/x8ZIQxnlcrWot5ulSETUk0N/Sba6sNDxL8dHjRnKCztd4SutdLOpgshX1ebYOgOQcIoud5WMUNkED7j0wCAxqPWvuRO7tJ2 JWh2Ailb N3cvneKnDURE/I07+izsrZ2Du5hWAn1BbUOfuxHH1aB3GlYfhSNtrYfb5alN2ssTLn1928Tb62pFDRPF1wdXKKJ0Q1WIJ11yHSZE7vVs+GIbpaQZz+Bhd7lJTqNuJ0VLeZZ+8Ual2GPfjWYN8SV3OEGQhc4OajdwmJF32IottWZ9goxd6PpxDiJGcecugUaT8zLEsBLcCFTEBmJneYOD/Yf2uu630aUUw7JBMB5/U7y6beWhYZJX9Kk9vMsRn+iZdpBp4 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Apr 1, 2025 at 3:35=E2=80=AFAM Ryan Roberts = wrote: > > On 01/04/2025 03:19, Kalesh Singh wrote: > > On Sat, Mar 29, 2025 at 3:08=E2=80=AFAM Ryan Roberts wrote: > >> > >> On 28/03/2025 15:14, Matthew Wilcox wrote: > >>> On Thu, Mar 27, 2025 at 04:23:14PM -0400, Ryan Roberts wrote: > >>>> + Kalesh > >>>> > >>>> On 27/03/2025 12:44, Matthew Wilcox wrote: > >>>>> On Thu, Mar 27, 2025 at 04:06:58PM +0000, Ryan Roberts wrote: > >>>>>> So let's special-case the read(ahead) logic for executable mapping= s. The > >>>>>> trade-off is performance improvement (due to more efficient storag= e of > >>>>>> the translations in iTLB) vs potential read amplification (due to > >>>>>> reading too much data around the fault which won't be used), and t= he > >>>>>> latter is independent of base page size. I've chosen 64K folio siz= e for > >>>>>> arm64 which benefits both the 4K and 16K base page size configs an= d > >>>>>> shouldn't lead to any read amplification in practice since the old > >>>>>> read-around path was (usually) reading blocks of 128K. I don't > >>>>>> anticipate any write amplification because text is always RO. > >>>>> > >>>>> Is there not also the potential for wasted memory due to ELF alignm= ent? > >>>> > >>>> I think this is an orthogonal issue? My change isn't making that any= worse. > >>> > >>> To a certain extent, it is. If readahead was doing order-2 allocatio= ns > >>> before and is now doing order-4, you're tying up 0-12 extra pages whi= ch > >>> happen to be filled with zeroes due to being used to cache the conten= ts > >>> of a hole. > >> > >> Well we would still have read them in before, nothing has changed ther= e. But I > >> guess your point is more about reclaim? Because those pages are now co= ntained in > >> a larger folio, if part of the folio is in use then all of it remains = active. > >> Whereas before, if the folio was fully contained in the pad area and n= ever > >> accessed, it would fall down the LRU quickly and get reclaimed. > >> > > > > > > Hi Ryan, > > > > I agree this was happening before and we don't need to completely > > address it here. Though with the patch it's more likely that the holes > > will be cached. I'd like to minimize it if possible. Since this is for > > EXEC mappings, a simple check we could use is to limit this to the > > VM_EXEC vma. > > > > + if (vm_flags & VM_EXEC) { > > + int order =3D arch_exec_folio_order(); > > + > > + if (order >=3D 0 && ((end-address)*2) >=3D 1< > I think the intent of this extra check is to ensure the folio will be ful= ly > contained within the exec vma? Assuming end is the VA of the end of the v= ma and > address is the VA of the fault, I don't think the maths are quite right? = What's > the "*2" for? And you probably mean PAGE_SIZE< account for alignment; the folio will be aligned down to a natural bounda= ry in > the file. Hi Ryan, Sorry for the pseudocode. You are right it should be PAGE_SIZE. *2 is because I missed that this isn't centered around the faulting address as the usual fault around logic. > > But more fundamentally, I thought I suggested reducing the VMA bounds to = exclude > padding pages the other day at LSF/MM and you said you didn't want to do = that > because you didn't want to end up with something else mapped in the gap? = So > doesn't that mean the padding pages are part of the VMA and this check wo= n't help? The intention was to check that we aren't reading past (more than we do today) into other subsequent segments (and holes) if the exec segment is small relative to 64K, in which case we fall back to the conventional readahead heuristics. > > > > > For reference I found below (coincidentally? similar) distributions on > > my devices > > > > =3D=3D x86 Workstation =3D=3D > > > > Total unique exec segments: 906 > > > > Exec segments >=3D 16 KB: 663 ( 73.18%) > > Exec segments >=3D 64 KB: 414 ( 45.70%) > > What are those percentages? They don't add up to more than 100... The percent of segments with size >=3D16KB/64KB of the total number of exec mappings. "Exec segments >=3D 64 KB" is also a subset of "Exec segments >=3D 16 KB" it's why they don't add up to 100. > > The numbers I included with the patch are caclulated based on actual mapp= ings so > if we end up with a partially mapped 64K folio (because it runs off the e= nd of > the VMA) it wouldn't have been counted as a 64K contiguous mapping. So I = don't > think this type of change would change my numbers at all. I don't think it should affect the results much (probably only a small benefit from the extra preread of other segments if we aren't under much memory pressure). Anyways I don't have a strong opinion, given that this shouldn't be an issue once we work out Matthew and Ted's idea for the holes. So we can keep it simple for now. Thanks, Kalesh > > > > > =3D=3D arm64 Android Device =3D=3D > > > > Total unique exec segments: 2171 > > > > Exec segments >=3D 16 KB: 1602 ( 73.79%) > > Exec segments >=3D 64 KB: 988 ( 45.51%) > > > > Result were using the below script: > > > > cat /proc/*/maps | grep 'r-xp' | \ > > awk ' > > BEGIN { OFS =3D "\t" } > > $NF ~ /^\// { > > path =3D $NF; > > split($1, addr, "-"); > > size =3D strtonum("0x" addr[2]) - strtonum("0x" addr[1]); > > print size, path; > > }' | \ > > sort -u | \ > > awk ' > > BEGIN { > > FS =3D "\t"; > > total_segments =3D 0; > > segs_ge_16k =3D 0; > > segs_ge_64k =3D 0; > > } > > { > > total_segments++; > > size =3D $1; > > if (size >=3D 16384) segs_ge_16k++; > > if (size >=3D 65536) segs_ge_64k++; > > } > > END { > > if (total_segments > 0) { > > percent_gt_16k =3D (segs_ge_16k / total_segments) * 100; > > percent_gt_64k =3D (segs_ge_64k / total_segments) * 100; > > > > printf "Total unique exec segments: %d\n", total_segments; > > printf "\n"; > > printf "Exec segments >=3D 16 KB: %5d (%6.2f%%)\n", segs_ge_16k, percen= t_gt_16k; > > printf "Exec segments >=3D 64 KB: %5d (%6.2f%%)\n", segs_ge_64k, percen= t_gt_64k; > > } else { > > print "No executable segments found."; > > } > > }' > > > >>> > >>>>> Kalesh talked about it in the MM BOF at the same time that Ted and = I > >>>>> were discussing it in the FS BOF. Some coordination required (like > >>>>> maybe Kalesh could have mentioned it to me rathere than assuming I'= d be > >>>>> there?) > >>>> > >>>> I was at Kalesh's talk. David H suggested that a potential solution = might be for > >>>> readahead to ask the fs where the next hole is and then truncate rea= dahead to > >>>> avoid reading the hole. Given it's padding, nothing should directly = fault it in > >>>> so it never ends up in the page cache. Not sure if you discussed any= thing like > >>>> that if you were talking in parallel? > >>> > >>> Ted said that he and Kalesh had talked about that solution. I have a > >>> more bold solution in mind which lifts the ext4 extent cache to the > >>> VFS inode so that the readahead code can interrogate it. > >>> > > > > Sorry about the hiccup in coordination, Matthew. It was my bad for not > > letting you know I planned to discuss it in the MM BoF. I'd like to > > hear Ted and your ideas on this when possible. > > > > Thanks, > > Kalesh > > > >>>> Anyway, I'm not sure if you're suggesting these changes need to be c= onsidered as > >>>> one somehow or if you're just mentioning it given it is loosely rela= ted? My view > >>>> is that this change is an improvement indepently and could go in muc= h sooner. > >>> > >>> This is not a reason to delay this patch. It's just a downside which > >>> should be mentioned in the commit message. > >> > >> Fair point; I'll add a paragraph about the potential reclaim issue. > >> > >>> > >>>>>> +static inline int arch_exec_folio_order(void) > >>>>>> +{ > >>>>>> + return -1; > >>>>>> +} > >>>>> > >>>>> This feels a bit fragile. I often expect to be able to store an or= der > >>>>> in an unsigned int. Why not return 0 instead? > >>>> > >>>> Well 0 is a valid order, no? I think we have had the "is order signe= d or > >>>> unsigned" argument before. get_order() returns a signed int :) > >>> > >>> But why not always return a valid order? I don't think we need a > >>> sentinel. The default value can be 0 to do what we do today. > >>> > >> > >> But a single order-0 folio is not what we do today. Note that my chang= e as > >> currently implemented requests to read a *single* folio of the specifi= ed order. > >> And note that we only get the order we request to page_cache_ra_order(= ) because > >> the size is limited to a single folio. If the size were bigger, that f= unction > >> would actually expand the requested order by 2. (although the paramete= r is > >> called "new_order", it's actually interpretted as "old_order"). > >> > >> The current behavior is effectively to read 128K in order-2 folios (wi= th smaller > >> folios for boundary alignment). > >> > >> So I see a few options: > > Matthew, > > Did you have any thoughts on these options? > > Thanks, > Ryan > > >> > >> - Continue to allow non-opted in arches to use the existing behaviou= r; in this > >> case we need a sentinel. This could be -1, UINT_MAX or 0. But in the l= atter case > >> you are preventing an opted-in arch from specifying that they want ord= er-0 - > >> it's meaning is overridden. > >> > >> - Force all arches to use the new approach with a default folio orde= r (and > >> readahead size) of order-0. (The default can be overridden per-arch). = Personally > >> I'd be nervous about making this change. > >> > >> - Decouple the read size from the folio order size; continue to use = the 128K > >> read size and only allow opting-in to a specific folio order. The defa= ult order > >> would be 2 (or 0). We would need to fix page_cache_async_ra() to call > >> page_cache_ra_order() with "order + 2" (the new order) and fix > >> page_cache_ra_order() to treat its order parameter as the *new* order. > >> > >> Perhaps we should do those fixes anyway (and then actually start with = a folio > >> order of 0 - which I think you said in the past was your original inte= ntion?). > >> > >> Thanks, > >> Ryan > >> >