From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2FD83C28B30 for ; Thu, 20 Mar 2025 14:57:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1BCB2280003; Thu, 20 Mar 2025 10:57:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 14735280001; Thu, 20 Mar 2025 10:57:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 00DB5280003; Thu, 20 Mar 2025 10:57:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id D380C280001 for ; Thu, 20 Mar 2025 10:57:44 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 27D1F1615C7 for ; Thu, 20 Mar 2025 14:57:45 +0000 (UTC) X-FDA: 83242233690.30.2198063 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf28.hostedemail.com (Postfix) with ESMTP id 50172C0014 for ; Thu, 20 Mar 2025 14:57:43 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf28.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742482663; a=rsa-sha256; cv=none; b=jubwjolsC5j9pPrqSuVAwm1uuuR4qEVZ89OROh6/XdaoAZxV+lTG4GfI8E5Ob2g1DTb0zW rk8BKGidUegfEs3RCYye0cxRICAbIvrTxdoDr0pgu6wHrE5TrcnJtqk1OCRsI+X4jejrh5 c9jF9S4IP/vEsec+3onUURQGuC8C+hY= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf28.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742482663; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=eHMEJipBcnBp/I5J8h1PgBbXebGYhDkqVTZGKZcRyN0=; b=zE+3t9hxIEgh97fzsI7xvJvkd97uHeoMwX87Tr+gojaEo9bol5bobvMd2NvjLyATSRj2ep CT0j/boQiX/WYt+qL3zJtNApOaRLO9l7YqiWaFPbfya+QPeQ8ssZQIbj0JlDzNPIx125fj YxTBZKaC+b0Y+pmyeAmGuyZl0BfuD48= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 1E04C106F; Thu, 20 Mar 2025 07:57:50 -0700 (PDT) Received: from [10.57.84.158] (unknown [10.57.84.158]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 00E283F673; Thu, 20 Mar 2025 07:57:40 -0700 (PDT) Message-ID: <3e3a2d12-efcb-44e7-bd03-e8211161f3a4@arm.com> Date: Thu, 20 Mar 2025 14:57:38 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [LSF/MM/BPF TOPIC] Mapping text with large folios Content-Language: en-GB To: Barry Song <21cnbao@gmail.com> Cc: lsf-pc@lists.linux-foundation.org, Linux-MM , Matthew Wilcox , Dave Chinner References: <6201267f-6d3a-4942-9a61-371bd41d633d@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam07 X-Rspam-User: X-Stat-Signature: c154nq9u6c91p4tm1sr4fch74arwgh1o X-Rspamd-Queue-Id: 50172C0014 X-HE-Tag: 1742482663-278963 X-HE-Meta: U2FsdGVkX18PPunG0nYycnrUriXMLzBY/El3PZrx0JT2cXR5aTbfbZ+RsWVeWFTaO89QRWtrBHmhXGZFytnfiy03rUpSA1xtcLsQfoC10d3if5CkXEsKdSV8iOadVBmNbjHrMcPsQL4JS0ch9Z0k6eeUQKpmbY6829ufmHP1NMTkXgpx2rV1sD2MGzNpGnziW5kydyQzG6CFZz2ZQP2XUKX5njRxq1P6EpIXjy6hHCmcel5/670Mt+zLB3X4+QfICPJe06mrZZzfJ22uqwY16CKqlxJS3EDJl8mUrx72b3M5h7Y0XeHak0vA8eLCUM1zflKa60LNr5dpYGeKa/34pzBR/HIwzyAtjlvAN5OvJnSxx+m72GLBCIZVxGbtWf37dnadD5yl9fSgccHgDBMaMXYHcgLVLaiWrkAfD2eqHkUzObUEe5WTvFv9iJDIcBXqY64t8YFJrdkd2JW/Xps7mbSd4XECWqTPRRdVIq4oub70OpPffU0sW85Dw2BeFvnf4jrFNpUtbu2VDOMqXeYWG+ScUoYG/21Bqpyr3ZC9FPtHR10Bhy2x3eixtk7UqVLK5KzHTULvAefIUeb19FP2M3p/d/D5XM7MN2ztV1iJK9LJAHJbgdb2jFNYX1has2258k+vuK+wU3wXwuNqAh+o1oJN4kCBRCky5O8a3XQJyLXuEKzhTaXDYHmubHH+UsbNigm2RRB12swmOJIWJT5oD8jYNPXHUm5elRJaKschCPdYVtuEfqnLQxZgw5uZsBFqM855pv4Dkmwmt0Qk8XHMEHInPxtT6voGU7F7p7Ha8o1ohmHMXNAH9A9W5+r5K6oMpRJLJfMX+UZAzJCoicAq4yyC+IwDZMwYu2yKLmtRJpmg4yosOcoKjV4ysp02drXhMrMFSH337Dx2E/6HUNlpWUDQzfl+8dxyp06/IyWmSta1b2oWHh/ZnChEKAvGcI/Okk5oYPYedL41ZZHP5ku xm2RSSbb x1XildpAHVDpPDrfPOEcTP1iEp1KQl8l2hbvZ+p8vBkbFVq2SbL1cunD0eH2ykWI+Ru5fmZGSVnyUjbqJmzACOyW7DZzutn1qnwvkr2dcfgxbBCcwKCAfDkRLLBaAvjcGblW2WblxgeXkJ7Wv82jkPoTBqsZQVOZvfUKMRCb++/5pw3r3gMuCjxzfSmLhhoIjfauEXv5zd4vEvItIx1volU0tbKmH1M+mhYT3byy+z/xOAYCJtU7YG5jhww== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 19/03/2025 20:47, Barry Song wrote: > On Thu, Mar 20, 2025 at 4:38 AM Ryan Roberts wrote: >> >> Hi All, >> >> I know this is very last minute, but I was hoping that it might be possible to >> squeeze in a session to discuss the following? >> >> Summary/Background: >> >> On arm64, physically contiguous and naturally aligned regions can take advantage >> of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file >> regions containing text, current readahead behaviour often yields small, >> misaligned folios, preventing this optimization. This proposal introduces a >> special-case path for executable mappings, performing synchronous reads of an >> architecture-chosen size into large folios (64 KB on arm64). Early performance >> tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9% >> gains. >> >> I’ve previously posted attempts to enable this performance improvement ([1], >> [2]), but there were objections and conversation fizzled out. Now that I have >> more compelling performance data, I’m hoping there is now stronger >> justification, and we can find a path forwards. >> >> What I’d Like to Cover: >> >> - Describe how text memory should ideally be mapped and why it benefits >> performance. >> >> - Brief review of performance data. >> >> - Discuss options for the best way to encourage text into large folios: >> - Let the architecture request a preferred size >> - Extend VMA attributes to include preferred THP size hint > > We might need this for a couple of other cases. > > 1. The native heap—for example, a native heap like jemalloc—can configure > the base "granularity" and then use MADV_DONTNEED/FREE at that granularity > to manage memory. Currently, the default granularity is PAGE_SIZE, which can > lead to excessive folio splitting. For instance, if we set jemalloc's > granularity to > 16KB while sysfs supports 16KB, 32KB, 64KB, etc., splitting can still occur. > Therefore, in some cases, I believe the kernel should be aware of how > userspace is managing memory. > > 2. Java heap GC compaction - userfaultfd_move() things. > I am considering adding support for batched PTE/folios moves in > userfaultfd_move(). > If sysfs enables 16KB, 32KB, 64KB, 128KB, etc., but the userspace Java > heap moves > memory at a 16KB granularity, it could lead to excessive folio splitting. Would these heaps ever use a 64K granule or is that too big? If they can use 64K, then one simple solution would be to only enable mTHP sizes upto 64K (which is the magic size for arm64). Alternatively they could use MADV_NOHUGEPAGE today and be guarranteed that memory would remain mapped as small folios. But I see the potential problem if you want to benefit from HPA with 16K granule there but still enable 64K globally. We have briefly discussed the idea of supporting MADV_HUGEPAGE via madvise_process() in the past; that has an extra param that could encode the size hint(s). > > For exec, it seems we need a userspace-transparent approach. Asking each > application to modify its code to madvise the kernel on its preferred exec folio > size seems cumbersome. I would much prefer a transparent approach. If we did take the approach of using a per-VMA size hint, I was thinking that could be handled by the dynamic linker. Then it's only one place to update. > > I mean, we could whitelist all execs by default unless an application explicitly > requests to disable it? I guess the explicit disable would be MADV_NOHUGEPAGE. But I don't believe the pagecache honours this right now; presumably because the memory is shared. What would you do if one process disabled and another didn't? Thanks, Ryan > >> - Provide a sysfs knob >> - Plug into the “mapping min folio order” infrastructure >> - Other approaches? >> >> [1] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@arm.com/ >> [2] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@arm.com/ >> >> Thanks, >> Ryan > > Thanks > Barry