From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 78469C35FFA for ; Wed, 19 Mar 2025 22:13:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6685F280003; Wed, 19 Mar 2025 18:13:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6165A280002; Wed, 19 Mar 2025 18:13:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4DFC4280003; Wed, 19 Mar 2025 18:13:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 2D127280002 for ; Wed, 19 Mar 2025 18:13:26 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 2384F565D3 for ; Wed, 19 Mar 2025 22:13:27 +0000 (UTC) X-FDA: 83239702854.19.10F64F7 Received: from mail-ua1-f49.google.com (mail-ua1-f49.google.com [209.85.222.49]) by imf29.hostedemail.com (Postfix) with ESMTP id 417F9120008 for ; Wed, 19 Mar 2025 22:13:25 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ldDiQZ2B; spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.49 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742422405; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VlRcJBv5hr2zUEeHkQ7HYu1xG9hwWOlYpa24L/7YG1U=; b=TqGliZOjrUalK1GPOd41yEyKzqPkmoNykDIRvITkhoSJyqZlhqsKrqE0eLnNYgVX54WsGN 2tEjErc9vsjrgc0O+Eo09AoyX4yXmPBEcorlmrQnfur9MEAAK7d/hhe2NGUedU1761uKMG PxiLNgmbSlzq8nOHtKQERFpU9IXsrcw= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ldDiQZ2B; spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.49 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742422405; a=rsa-sha256; cv=none; b=BGP8WFrj6IyyaSDs/p3KLJINheFfNJjy5nt8hbsXdApLBVznORsmb8VxpC5xgqSniChpNL ODymcMrsWQH92KHRUX51IOOoPKb2iS0bLKRknSfsr7s87/Clm1mwgzG73P7AY8ySV98mEu OLiiWYaKglNp90lqywkMKDu//KM771w= Received: by mail-ua1-f49.google.com with SMTP id a1e0cc1a2514c-8671441a730so52190241.0 for ; Wed, 19 Mar 2025 15:13:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1742422404; x=1743027204; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=VlRcJBv5hr2zUEeHkQ7HYu1xG9hwWOlYpa24L/7YG1U=; b=ldDiQZ2BQUU9YSrDACtv+6TGRTml6vgAJ7Nq4JnTps4BUVxOIwBaKf4vA5TgLzJ2hr Oz6PbZWWxgVcKaLAuZS6ts9wNKF7K8HzKLDYnkkvIMiIcWp0FzFkQyU2KXmvxpIlBcrj XrNoMWrXAxuZqTQP8fHrUhBly6H0ekLv6OF8fweeTdvQfCXHyk7fuoLctWi1bbMfQmvv ST32QS8g4HSnnoHlT7p+BvhAxRdcoxs9Raega8+4tOho5ICqQzBjDxIYKypbFDI/WQ6M xV1VpeCHJQh78u3T4PB/oW4hTDQrYSQiaPwkQPhVYfif5BLYNSTou2tOKjukKilNTUeq d8rw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742422404; x=1743027204; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VlRcJBv5hr2zUEeHkQ7HYu1xG9hwWOlYpa24L/7YG1U=; b=Smg3GW36Ezn8TC7mhko3qPXL0UC5v81FOvB+UU63E/Hi4sGqID0fsTZCDtHfLPO4RA vAHqs/5GIrpUv/9z9Rc4xOuTZ+scHEQ3yAKYlnFkKsmJDXs5caRz7YlOCBRpIt6DzVrW IF5BzeefJAHxP3FmFKbSqffVCgzpCBZ8crAkJIrcRYSn358HOScxI+EGAq7h5z0uvggt KXvkcwZf2oxhM7AxpbsD79T3SIAW9lhJ7Fz95dycYfkTiTkczmcrL/dokZF5VceJrdNh 3Edp+t0+3gSZHrkB7sUKtzr4Xot7+LPzsTcrlh3vCf6GAk2HjLblbtaH+pUJGPOYWkO6 XVIA== X-Forwarded-Encrypted: i=1; AJvYcCVUNgpzgutc92LRfNJmMfwkTSGxJ5YmIo98cPFceEoaHLYivviEHusrqy7Dkq/XHwhkm2hAWPRIrg==@kvack.org X-Gm-Message-State: AOJu0YzuZhL4ee5782p8b+TVCIx0JyM8yt+raEcg9PQfW4agH+eYa0tS 9SlPqgV0Vxtd849Ww+pXlWF29pQ9IzwC3ozPyldq7+czknP5te/VfzyhytpyRqHeDQYrFEE9JgT vao5HsYpmS/KR2tRl73fEqeB72Bk= X-Gm-Gg: ASbGncsOFddheXIN5KC3bPSm7RWWly+Y0xjgqDJ4Pw5hckqYUsS3t6APPfTs2vuNFqB ir503evcLmNghDck1Xd/yUmvkqhLG4Jl+mBoeD36UCcAfEqdka2l6osDHwJ6o0OC/xdPYLxz50A X7gKxKqWPO+yJ0rwiffT3DDyjUyw== X-Google-Smtp-Source: AGHT+IHwO1d1/FMzBlGGTn34q9b2RksJpCsyLd0/vLpycjyTUBl9pc4hndJyCSMUBmnMini7k2fphNu12AO7S6l0mKU= X-Received: by 2002:a05:6102:38cd:b0:4bb:e6bc:e16d with SMTP id ada2fe7eead31-4c4fce060b5mr859037137.20.1742422404105; Wed, 19 Mar 2025 15:13:24 -0700 (PDT) MIME-Version: 1.0 References: <6201267f-6d3a-4942-9a61-371bd41d633d@arm.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Thu, 20 Mar 2025 11:13:11 +1300 X-Gm-Features: AQ5f1JrqRbGXyJ7R514VWoFHU9e4vxvId6xH4-GxU7EyMj7mIaQzejuWmZEtwEw Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Mapping text with large folios To: Dave Chinner Cc: Yang Shi , Ryan Roberts , lsf-pc@lists.linux-foundation.org, Linux-MM , Matthew Wilcox Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: jfz5mx1nn4yys4ywsggpua4wun3jbawy X-Rspamd-Queue-Id: 417F9120008 X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1742422405-40842 X-HE-Meta: U2FsdGVkX19w5fl0ZnxLw6AoBx228gbkJqtDY/XR8BL1DtXB4LkdkuCNrCipFLG92u0WvFOYKCfypPHQSPhl1XH4V1PeFD9VYbgc9QFiOlHXikgXUJHZ6EmulZG2VZmD7j/M3vtafNmXVHZzCFOHgOM3XP5XAl3lPKDSJDFIoVbdLrd/XzLaXadzBL0MGyfh4ghT9l6IVy+NiCgughI3ABZiFKv6ogDd/ipyu4oUUY5bWhFnEwSXidhPMVfE1BSdtTHmEZNtSF3lEzG4mZBVTici4Z3xJo+DgkR5ehuhNIEcXxmXB12bBynMMC/NjhZ5t9xVvIunbc/3p0lzOqBBmPNRHsW3q2KpeBkzc32Vo9itfvTmFprkUlqof07X4Z4f/JiXOXooGpr5B9lfSNubS1C/fqwLezM0BNRTMpFRWgPgMHTFpnRDLThSgZ8sUMGR1MNhWN+7Uuo9rLAjGQW3tspmbVEHn+bKjeNPoRbp7QfCRvqanhcCdnVsLeDR+xZyJFqt1oh3ENCtxVj3DH1NG7uJkG9CWgDgVitxp2QlnUCa2YcLAc+Xg0uZCXaMkCiiE+bredg8v57J1WwB5vv22VlfCnrJVgLWF5pRFWYP9ItY++O1GMSwsHLCZCN9mZ+cpWtxPsBqc6qy1NhinxGGuQxgiyWv0Man24Unn+NvNDsg+aESboYyZV40QNX0+77k3bmbWSdVQXohci9tapw+bFb8iSqOC+HRJ/A9Yjkl+l7rNxZwE5Ch6IGoDLr9ttFuLb1OcZjPZ7pnnkIfQeUqC3Hccnac1i3xyZUPB2+HSD/ODjUYRWtQCD/N0HZr56I0FzQMEqkEnJ6GZ/yhWZZbHIUGW48e0EssqSuqv/dSEWVLJ8InL54v82G0rL8dY+scm8DLdLwbYTX8EOTjWDMGy1PEkK4f8SaSCq8yk8A+Csdj1JoBFf9DwK1bWuhI1FHkIdhxA2bLBEwT497slqR 8h2adRjj B8UQcB5vfg8j3eZZNl7lEb3NGxZolbagzbvrAElZ9HGfl53XwHNd3jKejyI+7DgZsx0oaE5SU3uG7BuH2YQOLYQWTFmT2OsafKfcbrb3+K+E13kc4zCKrqOdcg1TBTehZATYwjBUkBJ0qJVQqVCm5PssADWfnyeDK1wztVSIKQ+boY583o9m1UBHeVYQ8pmCVOFqLyuMxjJSKIg8oJPM04el0R2OgprFoaZC06O/WTkvcvCV0bD+THdnEtf9BuzHfE6ZtiV5sxoKspZ0eb8SSRqScXtsmrTEDzWZLO4wH4kH5mhNtWFN/aB8GMzN/OGddnaiezSC3eTF/ttKsEVpFMFRRRHPKk5gKiXq1PJvnjFBmJWfzJFzpC7REPPHIkb/DJgenFvQpG1lmxpPQBsC23F94aCEc6icheJzXtqL9m+CuGH8+z7k7qbxmNg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Mar 20, 2025 at 9:38=E2=80=AFAM Dave Chinner = wrote: > > On Wed, Mar 19, 2025 at 11:16:16AM -0700, Yang Shi wrote: > > On Wed, Mar 19, 2025 at 8:39=E2=80=AFAM Ryan Roberts wrote: > > > > > > Hi All, > > > > > > I know this is very last minute, but I was hoping that it might be po= ssible to > > > squeeze in a session to discuss the following? > > I'm not going to be at LSFMM, so I'd prefer this sort of thing get > discussed on the dev lists... > > > > Summary/Background: > > > > > > On arm64, physically contiguous and naturally aligned regions can tak= e advantage > > > of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, fo= r file > > > regions containing text, current readahead behaviour often yields sma= ll, > > > misaligned folios, preventing this optimization. This proposal introd= uces a > > > special-case path for executable mappings, performing synchronous rea= ds of an > > > architecture-chosen size into large folios (64 KB on arm64). Early pe= rformance > > > tests on real-world workloads (e.g. nginx, redis, kernel compilation)= show ~2-9% > > > gains. > > > > AFAIK, MySQL is quite sensitive to iTLB pressure. It should be worth > > adding to the tests. > > > > > > > > I=E2=80=99ve previously posted attempts to enable this performance im= provement ([1], > > > [2]), but there were objections and conversation fizzled out. Now tha= t I have > > > more compelling performance data, I=E2=80=99m hoping there is now str= onger > > > justification, and we can find a path forwards. > > > > > > What I=E2=80=99d Like to Cover: > > > > > > - Describe how text memory should ideally be mapped and why it benef= its > > > performance. > > I think the main people involved already understand this... > > > > - Brief review of performance data. > > You don't need to convince me - there's 3 decades of evidence > proving that larger, fewer page table mappings for executables > results in better performance. > > > > - Discuss options for the best way to encourage text into large foli= os: > > > - Let the architecture request a preferred size > > > - Extend VMA attributes to include preferred THP size hint > > > - Provide a sysfs knob > > > - Plug into the =E2=80=9Cmapping min folio order=E2=80=9D infras= tructure > > > - Other approaches? > > Implement generic large folio/sequential PTE mapping optimisations > for each platform, then control it by letting the filesystem decide > what the desired mapping order and alignment should be for any given > inode mapping tree. > > > Did you try LBS? You can have 64K block size with LBS, it should > > create large folios for page cache so text should get large folios > > automatically (IIRC arm64 linker script has 64K alignment by default). > > We really don't want people using 64kB block size filesystems for > root filesystems - there are plenty of downsides to using huge block > sizes for filesytems that generally hold many tiny files. Agreed. Large folios will be compatible with existing file systems and applications, which don=E2=80=99t always require userspace to adopt them. > > However, I agree with the general principle that the fs should be > directing the inode mapping tree folio order behaviour. i.e. the > filesystem already sets both the floor and the desired behaviour for > folio instantiation for any given inode mapping tree. > > It also needs to be able to instantiate large folios -before- the > executable is mapped into VMAs via mmap() because files can be read > into cache before they are run (e.g. boot time readahead hacks). > i.e. a mmap() time directive is too late to apply to the inode > mapping tree to guarantee optimal layout for PTE optimisation. It > also may not be possible to apply mmap() time directives due to > other filesystem constraints, so mmap() time directives may well end > up being unpredictable and unreliable.... > ELF loading and the linker may lead to readaheading a small portion of the code text before mmap(). However, once the executable files are large, the minor loss of large folios due to limited read-ahead of the text may not be substantial enough to justify consideration. But "boot time readahead hacks" seem like something that can read ahead significantly. Unless we can modify these "boot time readahead hacks" to use mmap() with EXEC mapping, it seems we would need something at the sys_read() to apply the preferred size. > There's also an obvious filesystem level trigger for enabling this > behaviour in a generic manner. e.g. The filesystem can look at the > X perm bits on the inode at instantiation time and if they are set, > set a "desired order" value+flag on the mapping at inode cache > instantiation in addition to "min order". > Not sure what proportion of an executable file is the text section. If it's less than 30% or 50%, it seems we might be allocating "preferred size" large folios to many other sections that may not benefit from them? Also, a Bash shell script with executable permissions might get a preferred large folio size. This seems weird? By the way, are .so files executable files, even though they may contain a lot of code? As I check my filesystems, it seems not: /usr/lib/aarch64-linux-gnu # ls -l libz.so.1.2.13 -rw-r--r-- 1 root root 133280 Jan 11 2023 libz.so.1.2.13 > If a desired order is configured, the page cache read code can then > pass a FGP_TRY_ORDER flag with the fgp_order set to the desired > value to folio allocation. If that can't be allocated then it can > fall back to single page folios instead of failing. > > At this point, we will always optimistically try to allocate larger > folios for executables on all architectures. Architectures that > can optimise sequential PTE mappings can then simply add generic > support for large folio optimisation, and more efficient executable > mappings simply fall out of the generic support for efficient > mapping of large folios and filesystems preferring large folios for > executable inode mappings.... I feel this falls more within the scope of architecture and memory management rather than the filesystem. If possible, we should try to avoid modifying the filesystem code? > > -Dave. > -- > Dave Chinner > david@fromorbit.com Thanks Barry