From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2BE2FC35FFA for ; Thu, 20 Mar 2025 00:53:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B8DF2280002; Wed, 19 Mar 2025 20:53:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B3CD2280001; Wed, 19 Mar 2025 20:53:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9DDEE280002; Wed, 19 Mar 2025 20:53:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 7FE50280001 for ; Wed, 19 Mar 2025 20:53:39 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id A6D70C0823 for ; Thu, 20 Mar 2025 00:53:39 +0000 (UTC) X-FDA: 83240106558.26.F18845B Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) by imf16.hostedemail.com (Postfix) with ESMTP id 978FC180003 for ; Thu, 20 Mar 2025 00:53:37 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=jhsKd6ZG; spf=pass (imf16.hostedemail.com: domain of david@fromorbit.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=david@fromorbit.com; dmarc=pass (policy=quarantine) header.from=fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742432017; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3d/XWYFg4FLAkrVu/dRzly0EWf4TwHOwCXe3khkOlfw=; b=fEcqAfwEdEVBmDTc9Q7vpUX49LojtOk/9BjZH8jp2XWMLiV+Mx4nU3n41tJcE6tIJG+8BT kFnQt/yrMz1VnlWt7m5rsztWAl3t79Mhj5EVfFu9l54uHkrwbTXBkQ4l3ihDoIELmvz4Nl TzJ/ixAznivz2lPPBrvGUM8LKjoQk/M= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=jhsKd6ZG; spf=pass (imf16.hostedemail.com: domain of david@fromorbit.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=david@fromorbit.com; dmarc=pass (policy=quarantine) header.from=fromorbit.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742432017; a=rsa-sha256; cv=none; b=k1NX7s0laVqb90LOkAD5UhAxalZ9O3e9K6Ch+pmrxQVkUc3FCF8vpmiPCBELGU9Fy21FfI bGK9uXtW7LiGCREQV3Gr5YtpYWdptQq61+90W7kDL5ceT78WegOV0o8mRjzwLi3caINFZc EPGFYG2PEkH9Smg56+nYRxouEW5gIv4= Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-223fb0f619dso2576945ad.1 for ; Wed, 19 Mar 2025 17:53:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1742432016; x=1743036816; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=3d/XWYFg4FLAkrVu/dRzly0EWf4TwHOwCXe3khkOlfw=; b=jhsKd6ZGH76D5Lr8JkmAx833Q4tdAEmAoyPJqv6JJEw3K9xhdkzR3GUja3DGWFJdou iJTgsC4kEds/t77wIBKyddx/S32S4OGlzM5TnLFiZ3Q5MfRbMxJ29zeHFai2mQm6Aaxq DZYk9iUpjFqJGeXTH3xVQelOmAKeIhC/PVISYHxgZm58WvqGM9v7wgomDQSgy8Arsn5p pKPUtwDMyK2xK+4Ts9meeimFSzxtQxGl9qiQm5xiaWHZrklb32c7pCe/VhnyeiByTXGC f5BJ4PP5UqK4rYOYZAdVVRuH9rVOyNE6ig+x3uq3JB1JNzJXeeA1DcXPIJ22Vpd0WONC 75Vw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742432016; x=1743036816; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=3d/XWYFg4FLAkrVu/dRzly0EWf4TwHOwCXe3khkOlfw=; b=PaNYnp8Dpm1Lluf+g+11wk/jAWabOAFGIn8dEDZRaYipspgMU42nPevTnHHLcVEDCC leOEZKnu6XgwGi8FLj3JXruaWnY/Qf5se7SVVsmetpuqvUvSrGwOReeAXuMc4En+VNya QmZ5kqmqY+faxvpt2hcJ65LPFaaP413+uaeIr7MEmrkBOyWjctKCPI31EAixr3M+MHpo sZBRuJbqn63qeadDmm/JHAiOjXXE3o+7F/eCqNz9mhKez30M8HbX+oTZfdUWhFkh3P+E xWrBNb+DE3OyFBo2+cx2bwoiR6EMCwA0hS4vduhHyrtDyWSa/cX0Qr3lAGWGH+YnBq97 Qp8A== X-Forwarded-Encrypted: i=1; AJvYcCWHzl0fZTUEh6MXmA//vVSqSHv8YCxKhJaVPOm6N8mh+yjDaRoJ/rfvk6XtmCCeg5LM0go/pjMFVQ==@kvack.org X-Gm-Message-State: AOJu0YyeeLo6TcYm8ikGNOlM387bp3XmrkC0XcMyDx3lv/iC1cuExE3r JI1zHa6S+rhbKwVMjvT0sXBHQFrAYA1oNOx6lH+62YJ7bbD+658MiBhu1gZZi5A= X-Gm-Gg: ASbGncve7zLuaEV7jcuZ900SJT60NAbAJeVEFR64byxlC00d7tgIrrPAOk5uqIu4JJY GVe6ARCGGeaZKVTOCU/J/dz/UtwsgQy6KPEl8DxNVD49wLEzm3GZ4lT/0PQ7Mr/fk+NyuRahn6X 72/P6WJ1zPX0/n6OS2mVh8ZhQv3fk/jqhFXVmLapGw1nH1g93jt+lTB7sSBmMvhOVKwk1yenZ5/ pJngOiL6BMmTHMXzYYmxNYyur+81TVDE2ca567nbE4gaEkumfrTQKRTOkWEDrhfKbEBHy0D6NNI 3ULrzMr9UiNni1SLwdHv6vkKrTd709NlwXVMkF+QXosRxLNFYjzsdnIaXT4S//4Em3lvdclTOkA dFKLY4EBlh8O46Eb6xO8nqPhIE+K4xY8= X-Google-Smtp-Source: AGHT+IGiSRgcHjCqJNRyH6wwT3e1GrZ/bMxf77Ll2HpQ2R4u+WxOuufWvWkvhVRHSdl0flEJsA7Axg== X-Received: by 2002:a17:903:1a23:b0:220:c911:3f60 with SMTP id d9443c01a7336-22649c94744mr68612585ad.47.1742432016137; Wed, 19 Mar 2025 17:53:36 -0700 (PDT) Received: from dread.disaster.area (pa49-186-36-239.pa.vic.optusnet.com.au. [49.186.36.239]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-73711559269sm12806933b3a.65.2025.03.19.17.53.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 19 Mar 2025 17:53:35 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.98) (envelope-from ) id 1tv4A8-0000000FMF5-3r6i; Thu, 20 Mar 2025 11:53:32 +1100 Date: Thu, 20 Mar 2025 11:53:32 +1100 From: Dave Chinner To: Barry Song <21cnbao@gmail.com> Cc: Yang Shi , Ryan Roberts , lsf-pc@lists.linux-foundation.org, Linux-MM , Matthew Wilcox Subject: Re: [LSF/MM/BPF TOPIC] Mapping text with large folios Message-ID: References: <6201267f-6d3a-4942-9a61-371bd41d633d@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Stat-Signature: a3m779kqfbf4hxhp91cndnid94zecy4h X-Rspamd-Queue-Id: 978FC180003 X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1742432017-696450 X-HE-Meta: U2FsdGVkX1+pEFE10kIKSqWLl7EP5HUfaBIMm7fuq5hsJaKPSNzFowkD1Kk6aoqGbAULQvr0J37HdRgyJ10bYB1RS+BqwtdcL5eppczEs7V1aIsSX1vD6Vuu9t8X7VsNgq8FzQuFRed++0dUPxeQIoq0HElhExa5SM0/AnybY1VJhQthQJVx73XmDRCvY63iEUtbw4OJZlITVcxuF9bIE3PQLx+2a+ZxhtFhGYoGLd/kVUAGk5SRSb3xrCzCG+Dt3FxnOAygk1G2VF2GzqRUOUvKd9YZNq2s0P5jxXZT2YVGUbcNWYCO7erpGN+jBGPsxPTPZkCeDe/mhKi7m6x63eHVY68L1Ita+rFo4eUskmh/8Nabp+M6OkW0F8KcDwoI1tdyO/GIcd/BjZSY+HQIBgamlSllrAyvr/F1/b5UWl1AC36wHmnmfyqwbb0TKhxPgF1q7uUqYi2W/JpLbiKxguNZ4yCVw8nKI+DQRkfgJw5GSs8m0vWnpYn4BCYrTiA050JrkOqEnf4/RVKdHRxrlzBSNZxyvn3gFukHuAXbkld9Pavrjg+kqNsjtd711R+EQA5dptDpb2imSoNHlM8lrDzTfdAnN1e5bXElQALg6kdEdDJP3fDih1VNEQefBPOn1uxZqjr3VyriLjQ6Sob3uKms0rZFyZtZhYKOUPOWDzvreiCsK9VMU+WRqMU1blHzcWhdINAueuqtArk3XT0hiZ/12z0MTK4uxBnGWH7XeFDxLIos9EqcklXHbqlg/xOPAsf3IQhlED+926/RoppmBcQw/IZFIrhvYZFpK/63NMzHoiG4XYzyaPNXIKoyld20TkMFATzOYBLtXoWWHhoVQb2KiYEgJNdaGSqh7mzOjelf7mB4Kw4t/EEzy78quz5P93GY9IotW8bajn2ydT6Dqomqewl0zf5V5vnaHyQVdaQTM7MSHeJnHF6mmG7ydcugzmUB3g+B5/Hl2KgGNxx 6GgHmmYR Srk0noH7P9a25A7K//CFbtObry8EZcPzxvkWwOz0jrru+KP1pwARTb5ja6khjohMyTfb+a5dbMRdE6GNm10yuSqo0Da4SmvBR5dgg+pQxVqYnLJtcoccV/5nAstzML9qkUzTiMad+mdgdKJRzZiaFL96oRq/xFnjjyI2xa5S7ZFtaI69luQhWD1IN9jvMRcZTnK1GqNjGxv1P87Pr6BOERUzGj28DkoLtUFYsOpn6vT4dXkDkMNNFeDSswKZOcGKVQIF9pMJJAaeq6XJzCTMhtfczswKuBGuFDruD/HjTHALaIeMnp10nHvZJLXsjADE/7jv5TupVRGzFGVd/s1W7jBMXB8W7nDXMoCXWgKLXO1BUzDs6+U2nKNICdYwZEg2GfPpNZg+7iw5gxyky8xqzVGYjC9JWywceSiEiHunsXoltTfUdTprSWAdf/3gLufYCoxOyYI3FDQu2DYTn5PWdLBQrPViUv1lCob7RqgjR6u7j0uqXgr7J2pRF+A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Mar 20, 2025 at 11:13:11AM +1300, Barry Song wrote: > On Thu, Mar 20, 2025 at 9:38 AM Dave Chinner wrote: > > On Wed, Mar 19, 2025 at 11:16:16AM -0700, Yang Shi wrote: > > > On Wed, Mar 19, 2025 at 8:39 AM Ryan Roberts wrote: > > However, I agree with the general principle that the fs should be > > directing the inode mapping tree folio order behaviour. i.e. the > > filesystem already sets both the floor and the desired behaviour for > > folio instantiation for any given inode mapping tree. > > > > It also needs to be able to instantiate large folios -before- the > > executable is mapped into VMAs via mmap() because files can be read > > into cache before they are run (e.g. boot time readahead hacks). > > i.e. a mmap() time directive is too late to apply to the inode > > mapping tree to guarantee optimal layout for PTE optimisation. It > > also may not be possible to apply mmap() time directives due to > > other filesystem constraints, so mmap() time directives may well end > > up being unpredictable and unreliable.... > > > > ELF loading and the linker may lead to readaheading a small portion > of the code text before mmap(). However, once the executable files > are large, the minor loss of large folios due to limited read-ahead of > the text may not be substantial enough to justify consideration. > > But "boot time readahead hacks" seem like something that can read > ahead significantly. Unless we can modify these "boot time readahead > hacks" to use mmap() with EXEC mapping, it seems we would need > something at the sys_read() to apply the preferred size. Yes, that's exactly what I said. :) But you haven't understood the example I gave (ie.. boot time readahead). There are many ways to have executables cached without them being mapped executable. They get accessed by a linker during compilation of code. They get updated by the OS package manager. A backup or dedpulication program accesses them. A virus scanners reads it looking for trojans, etc. i.e. there are lots of ways of getting executables cached that prevent optimal large folio formation if the filesystem doesn't directly control formation of said large folios. Hence if we don't apply large folio selection criteria to -all- buffered IO (read, write and mmap), the result when mmap(EXEC) occurs is going to be .... unpredictable and no always optimal. value. So assuming that the cache is cold, we want filemap_fault() to allocate large folios from cache misses on read faults, yes? That lands us in do_sync_mmap_readahead(), and that has a bit of a problem w.r.t. large folios. it ends up calling: page_cache_ra_order(.... new_order = 0) This limits folio allocated by readahead to order-2 in size, unless the mapping was instantiated by the filesystem with a larger min_order. In which case if will use the larger min_order value. Either way, we don't get the desired large folio size the arch wants to optimise the page table mappings. I'd suggest this would be fixed by something like this in do_sync_mmap_readahead(): - page_cache_ra_order(..., 0); + new_order = 0; + if (is_exec_mapping(vmf->vma->vm_flags)) + new_order = + page_cache_ra_order(..., new_order); And now the page cache will be populated with large folios of at least the order requested if filesystem can support folios of that size. Unless I've misunderstood something (cold cache instantiation of 64kB folios is what you desired, isn't it?), that small change should largely make exec mappings behave the way you want... > > There's also an obvious filesystem level trigger for enabling this > > behaviour in a generic manner. e.g. The filesystem can look at the > > X perm bits on the inode at instantiation time and if they are set, > > set a "desired order" value+flag on the mapping at inode cache > > instantiation in addition to "min order". > > > > Not sure what proportion of an executable file is the text section. If it's > less than 30% or 50%, it seems we might be allocating "preferred size" > large folios to many other sections that may not benefit from them? > > Also, a Bash shell script with executable permissions might get a > preferred large folio size. This seems weird? But none of this is actually a problem at all. Fewer, larger folios still means less page cache and memory reclaim management overhead even if there is no direct benefit from optimised page table mapping. Also, we typically know the file size at mapping tree instantiation time and hence we could make a sane decision as to whether large folios should be used for any specific executable file. > By the way, are .so files executable files, even though they may contain > a lot of code? As I check my filesystems, it seems not: > > /usr/lib/aarch64-linux-gnu # ls -l libz.so.1.2.13 > -rw-r--r-- 1 root root 133280 Jan 11 2023 libz.so.1.2.13 True, I hadn't considered that. Seems like fixing do_sync_mmap_readahead() might be the best way to go then.... > > If a desired order is configured, the page cache read code can then > > pass a FGP_TRY_ORDER flag with the fgp_order set to the desired > > value to folio allocation. If that can't be allocated then it can > > fall back to single page folios instead of failing. > > > > At this point, we will always optimistically try to allocate larger > > folios for executables on all architectures. Architectures that > > can optimise sequential PTE mappings can then simply add generic > > support for large folio optimisation, and more efficient executable > > mappings simply fall out of the generic support for efficient > > mapping of large folios and filesystems preferring large folios for > > executable inode mappings.... > > I feel this falls more within the scope of architecture and memory > management rather than the filesystem. If possible, we should try > to avoid modifying the filesystem code? Large folios may be a MM construct, but you can't use them in the page cache without the backing filesystem being fully aware of them and the mm subsystem has to work within the constraints the filesystem places on large folios in the page cache. If we need to change constraints or enact new policies around file IO specific large folio optimisations, then we definitely are going to need to modify both mm and filesystem code to implement them.... -Dave. -- Dave Chinner david@fromorbit.com