From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8863DC36014
	for <linux-mm@archiver.kernel.org>; Tue,  1 Apr 2025 11:09:31 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id ABE0A280002; Tue,  1 Apr 2025 07:09:29 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A6D98280001; Tue,  1 Apr 2025 07:09:29 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 95C99280002; Tue,  1 Apr 2025 07:09:29 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 746CF280001
	for <linux-mm@kvack.org>; Tue,  1 Apr 2025 07:09:29 -0400 (EDT)
Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 1F9F0160708
	for <linux-mm@kvack.org>; Tue,  1 Apr 2025 11:09:30 +0000 (UTC)
X-FDA: 83285204100.02.EBF4318
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf21.hostedemail.com (Postfix) with ESMTP id 42FF51C0006
	for <linux-mm@kvack.org>; Tue,  1 Apr 2025 11:09:28 +0000 (UTC)
Authentication-Results: imf21.hostedemail.com;
	dkim=none;
	spf=pass (imf21.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1743505768;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=eH5e+47hEdI1Q8WbCH3tUcYCsBaGLOrVfS1F1iolmBc=;
	b=uQjbqwDzpFb/lnb3vHaFzk0vLW8W41jFGELR3V6F2mD/FUk1dMzqIfOzzanahvu+f1U/ry
	g09mM+QmvSeWh01u3Jlr+Iei6r0fFYRwU82zoRfPc5u14wGRBPBpi6/7De1SzgDjwruAU3
	aGpqA6AnvhWfAgrxVYa+EpMOGijnpIU=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743505768; a=rsa-sha256;
	cv=none;
	b=t+yeQ0GR0GP7dawtbwi9BsZYlEFedfzllM6T3xJMV7woG8zSLFOD64Z+fKNrJ6f95ueXh9
	EsYWavraVlUhnuB8Zn27VeDyX019mq/pd+JufHYVG+Ukcl1Kzt8Lt2+RSsoyku98/lgSdG
	b+pDeJIHSMmyQuczB2gAvg9VaI264dI=
ARC-Authentication-Results: i=1;
	imf21.hostedemail.com;
	dkim=none;
	spf=pass (imf21.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id AA3FD14BF;
	Tue,  1 Apr 2025 04:09:30 -0700 (PDT)
Received: from [10.1.28.189] (XHFQ2J9959.cambridge.arm.com [10.1.28.189])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 3D4DA3F694;
	Tue,  1 Apr 2025 04:09:25 -0700 (PDT)
Message-ID: <7ececd6d-90a8-4a05-a759-820801f1a8aa@arm.com>
Date: Tue, 1 Apr 2025 12:09:23 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [LSF/MM/BPF TOPIC] Mapping text with large folios
Content-Language: en-GB
To: Barry Song <21cnbao@gmail.com>
Cc: lsf-pc@lists.linux-foundation.org, Linux-MM <linux-mm@kvack.org>,
 Matthew Wilcox <willy@infradead.org>, Dave Chinner <david@fromorbit.com>
References: <6201267f-6d3a-4942-9a61-371bd41d633d@arm.com>
 <CAGsJ_4yCP5ELP-jkuOay8zjbzJhw5f430b8JA5kGJD=PyTKB8A@mail.gmail.com>
 <3e3a2d12-efcb-44e7-bd03-e8211161f3a4@arm.com>
 <CAGsJ_4zETytN_6pzNQjnt3Jc1ubAjCr3toHf0LcRA_7hmMMuxg@mail.gmail.com>
From: Ryan Roberts <ryan.roberts@arm.com>
In-Reply-To: <CAGsJ_4zETytN_6pzNQjnt3Jc1ubAjCr3toHf0LcRA_7hmMMuxg@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 42FF51C0006
X-Stat-Signature: t1k4tmr6d3zd3ce87rj5y51kkm3rwh38
X-HE-Tag: 1743505768-389727
X-HE-Meta: U2FsdGVkX1/SnGaoEXZ+WYMNb2aW9TbCf7nBfV2DWhlZ5eDKy2jSmpTzH4oscFmGfVxSIAFoqIDLh9IytlH5V4SoVGgByvkpVKV0NrzUaKN4iNqx8Qb5fm5hUDmFpCmI2jCEoxmszi66ezzJ7Fm5JpRbjJoPjtE/dlgmnwv7XEgAjR0j2XmutkwpAv+x5kPYC16vwl8vm68vFw2OTP1NkQ0l703ikEnjrcpxjR7/jhxzV58Qo+uhL+Wa8aJm15AstQ6PjPI/CQ8dgAQ0vVT9dy+OJ+jqDF9+drISzIRq2BoR0PbWzYOadFsp/fdkUQt+AoYxYydBvhoHlJ4t17j1Uhkmtuc1+eT99ROLEZKffHXrascs9+7shi9iVTEkoKVJ2zpIlLIkbwVRj/RTZ+bPLN83hYbEieLdf1jU9FsqFKEN+LWvs4N7x69JIUseJG6D7itXGVWA3KIij4r3NvjCdEbv3pHWj7e8t5L/6VH79UH9WQvsCZYg2IMXa/CxS97xbUCUT1Rz/gzvQ/opjmV0jezh9t7UDTJnPhxDrYqKQmGkKzGgYhmvRNtAVX1V4eEDjN3f2/+9Qh9w7L8xlTDAE9q+jloygNjzWGdXNsBb6xrumeEARHIluI9hIdTZFZzzSepJPOzm39bjiIWQtiISEkgZfQM9DSDyVDg2W7H5OPG9FmHFvoriqLSggFd4TSKqNzWjx1dPjx6muprT0fMVdJumYl9AzNa5aF+GqhJOaADC3HzDAXpwYypuPd95iqhsRun+1eTla7y9SyR7uyY5juinyFuqWwUJsEmIe+oOGxytf1jX/dsWkYo62M2kPkc+OjQhpsGadA4uhOQSVsPz4+9NmJNe9VUQRxoQVEiRNutpPRcASrOY5E53w0BpdG4aL2chId6XZZt2L1GvCv69IHVLE6mITWzIqdFacNR1HAkXz1EZdfOhMtGjtiwY2qlwvlqPvsyrIBG7aAJTGDM
 G+Xn39oL
 zWAtuz8SfPj6pBMohYwgzl9dQNubce6JTCDzqVhJCOo1VBFDz6XC6Kz2zli0hTX3v1L2HEjcseg6Ih7sNniWnCaUwEziu0+HmLjqDQbUS8bINMgqbKHqKGO9RqSazZ41GsKqqKV9gWorRAoCLpL1iiLt7rCbEV2w6m4QR2usJFVV2M+d9MekKBMW4NBeV0NaslA05UOWLcCletd1rWW3LIzo6tkLBceCDXhnmodTNmeIBEdByOHzWeohqZA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 30/03/2025 00:46, Barry Song wrote:
> On Thu, Mar 20, 2025 at 10:57 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 19/03/2025 20:47, Barry Song wrote:
>>> On Thu, Mar 20, 2025 at 4:38 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> I know this is very last minute, but I was hoping that it might be possible to
>>>> squeeze in a session to discuss the following?
>>>>
>>>> Summary/Background:
>>>>
>>>> On arm64, physically contiguous and naturally aligned regions can take advantage
>>>> of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file
>>>> regions containing text, current readahead behaviour often yields small,
>>>> misaligned folios, preventing this optimization. This proposal introduces a
>>>> special-case path for executable mappings, performing synchronous reads of an
>>>> architecture-chosen size into large folios (64 KB on arm64). Early performance
>>>> tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9%
>>>> gains.
>>>>
>>>> I’ve previously posted attempts to enable this performance improvement ([1],
>>>> [2]), but there were objections and conversation fizzled out. Now that I have
>>>> more compelling performance data, I’m hoping there is now stronger
>>>> justification, and we can find a path forwards.
>>>>
>>>> What I’d Like to Cover:
>>>>
>>>>  - Describe how text memory should ideally be mapped and why it benefits
>>>>    performance.
>>>>
>>>>  - Brief review of performance data.
>>>>
>>>>  - Discuss options for the best way to encourage text into large folios:
>>>>      - Let the architecture request a preferred size
>>>>      - Extend VMA attributes to include preferred THP size hint
>>>
>>> We might need this for a couple of other cases.
>>>
>>> 1. The native heap—for example, a native heap like jemalloc—can configure
>>> the base "granularity" and then use MADV_DONTNEED/FREE at that granularity
>>> to manage memory. Currently, the default granularity is PAGE_SIZE, which can
>>> lead to excessive folio splitting. For instance, if we set jemalloc's
>>> granularity to
>>> 16KB while sysfs supports 16KB, 32KB, 64KB, etc., splitting can still occur.
>>> Therefore, in some cases, I believe the kernel should be aware of how
>>> userspace is managing memory.
>>>
>>> 2. Java heap GC compaction -  userfaultfd_move() things.
>>> I am considering adding support for batched PTE/folios moves in
>>> userfaultfd_move().
>>> If sysfs enables 16KB, 32KB, 64KB, 128KB, etc., but the userspace Java
>>> heap moves
>>> memory at a 16KB granularity, it could lead to excessive folio splitting.
>>
>> Would these heaps ever use a 64K granule or is that too big? If they can use
>> 64K, then one simple solution would be to only enable mTHP sizes upto 64K (which
>> is the magic size for arm64).
>>
> 
> I'm uncertain about how Lokesh plans to implement userfaultfd_move()
> mTHP support
> in what granularity he'll use in Java heap GC. However, regarding
> jemalloc, I've found
> that 64KB is actually too large - it ends up increasing memory usage.
> The issue is that
> we need at least 64KB of freed small objects before we can effectively use
> MADV_DONTNEED. Perhaps we could try 16KB instead.
> 
> The key requirement is that the kernel's maximum large folio size cannot exceed
> the memory management granularity used by userspace heap implementations.
> Before implementing madvise-based per-VMA large folios for Java heap, I plan
> to first propose a large-folio aware userfaultfd_move() and discuss
> this approach
> with Lokesh.

We very briefly discussed this at LSF/MM; Rik Van Riel suggested trying to
maintain a per-VMA heuristic capturing the granule that user space is using then
applying that to the mTHP policy.

> 
>> Alternatively they could use MADV_NOHUGEPAGE today and be guarranteed that
>> memory would remain mapped as small folios.
> 
> Right, I'm using this MADV_NOHUGEPAGE specifically for small size classes in
> jemalloc now as large folios will soon be splitted due to unaligned
> userspace heap
> management.
> 
>>
>> But I see the potential problem if you want to benefit from HPA with 16K granule
>> there but still enable 64K globally. We have briefly discussed the idea of
>> supporting MADV_HUGEPAGE via madvise_process() in the past; that has an extra
>> param that could encode the size hint(s).
>>
> 
> I'm not sure what granularity Lokesh plans to support for moving large folios in
> Java GC. But first, we need kernel support for userfaultfd_move() with mTHP.
> Maybe this could serve as a use case to justify the size hint in
> MADV_HUGEPAGE.
> 
>>>
>>> For exec, it seems we need a userspace-transparent approach. Asking each
>>> application to modify its code to madvise the kernel on its preferred exec folio
>>> size seems cumbersome.
>>
>> I would much prefer a transparent approach. If we did take the approach of using
>> a per-VMA size hint, I was thinking that could be handled by the dynamic linker.
>> Then it's only one place to update.
> 
> The dynamic linker (ld.so) primarily manages the runtime linking of
> shared libraries for
> executables. However, the initial memory mapping of the executable
> itself (the binary
> file, e.g., a.out) is performed by the kernel during program execution?

Yes, but for dynamically linked executables, the kernel also maps the
interpretter (the linker) and that's the first thing that executes in user
space, so it still has the opportunity to madvise() the main executable mappings.

> 
>>
>>>
>>> I mean, we could whitelist all execs by default unless an application explicitly
>>> requests to disable it?
>>
>> I guess the explicit disable would be MADV_NOHUGEPAGE. But I don't believe the
>> pagecache honours this right now; presumably because the memory is shared. What
>> would you do if one process disabled and another didn't?
> 
> Correct. My previous concern is that if memory-constrained devices
> could experience
> increased memory pressure due to mandatory 64KB read operations. 

Perhaps there is a case for continuing to honor ra->ra_pages (or align down to
64K boundary) so that a value of 0 continues to elide readahead?

> A particular
> concern is that the 64KiB folio remains in LRU queue when any single subpage
> is active, whereas smaller folios would have been reclaimable when inactive.

We are discussing this in the context of the new post.

> 
> However, this appears unrelated to your patch [1]. Perhaps such systems should
> disable file large folios entirely?

There is currently no way to disable large folios for page cache, other than to
use file systems that don't support large folios yet :). But I agree that this
is unrelated; if there is deemed to be a problem with large folios and you need
a general switch, that's going to be the case irrespective of my change.

Thanks,
Ryan


> 
> [1] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@arm.com/
> 
>>
>> Thanks,
>> Ryan
>>
>>>
>>>>      - Provide a sysfs knob
>>>>      - Plug into the “mapping min folio order” infrastructure
>>>>      - Other approaches?
>>>>
>>>> [1] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@arm.com/
>>>> [2] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@arm.com/
>>>>
>>>> Thanks,
>>>> Ryan
>>>
> 
> Thanks
> Barry