Re: [LSF/MM/BPF TOPIC] Per-process page size

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dev Jain <dev.jain@arm.com>
To: Kalesh Singh <kaleshsingh@google.com>
Cc: lsf-pc@lists.linux-foundation.org, ryan.roberts@arm.com,
	catalin.marinas@arm.com, will@kernel.org, ardb@kernel.org,
	willy@infradead.org, hughd@google.com,
	baolin.wang@linux.alibaba.com, akpm@linux-foundation.org,
	david@kernel.org, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, linux-mm@kvack.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	"Mateusz Maćkowski" <mmac@google.com>,
	"Adrian Barnaś" <abarnas@google.com>,
	"Marcin Szymczyk" <marcinszymczyk@google.com>
Subject: Re: [LSF/MM/BPF TOPIC] Per-process page size
Date: Thu, 26 Feb 2026 14:15:13 +0530	[thread overview]
Message-ID: <e1858dc5-6526-4aac-8392-f4f2e19da1bc@arm.com> (raw)
In-Reply-To: <CAC_TJvfD19E--wyeVyTmp-LP9ffoLQaUHruZARbdes2EnKgptQ@mail.gmail.com>



On 26/02/26 1:10 pm, Kalesh Singh wrote:
> On Tue, Feb 17, 2026 at 6:50 AM Dev Jain <dev.jain@arm.com> wrote:
>>
>> Hi everyone,
>>
>> We propose per-process page size on arm64. Although the proposal is for
>> arm64, perhaps the concept can be extended to other arches, thus the
>> generic topic name.
>>
>> -------------
>> INTRODUCTION
>> -------------
>> While mTHP has brought the performance of many workloads running on an arm64 4K
>> kernel closer to that of the performance on an arm64 64K kernel, a performance
>> gap still remains. This is attributed to a combination of greater number of
>> pgtable levels, less reach within the walk cache and higher data cache footprint
>> for pgtable memory. At the same time, 64K is not suitable for general
>> purpose environments due to it's significantly higher memory footprint.
>>
>> To solve this, we have been experimenting with a concept called "per-process
>> page size". This breaks the historic assumption of a single page size for the
>> entire system: a process will now operate on a page size ABI that is greater
>> than or equal to the kernel's page size. This is enabled by a key architectural
>> feature on Arm: the separation of user and kernel page tables.
>>
>> This can also lead to a future of a single kernel image instead of 4K, 16K
>> and 64K images.
>>
>> --------------
>> CURRENT DESIGN
>> --------------
>> The design is based on one core idea; most of the kernel continues to believe
>> there is only one page size in use across the whole system. That page size is
>> the size selected at compile-time, as is done today. But every process (more
>> accurately mm_struct) has a page size ABI which is one of the 3 page sizes
>> (4K, 16K or 64K) as long as that page size is greater than or equal to the
>> kernel page size (kernel page size is the macro PAGE_SIZE).
>>
>> Pagesize selection
>> ------------------
>> A process' selected page size ABI comes into force at execve() time and
>> remains fixed until the process exits or until the next execve(). Any forked
>> processes inherit the page size of their parent.
>> The personality() mechanism already exists for similar cases, so we propose
>> to extend it to enable specifying the required page size.
>>
>> There are 3 layers to the design. The first two are not arch-dependent,
>> and makes Linux support a per-process pagesize ABI. The last layer is
>> arch-specific.
>>
>> 1. ABI adapter
>> --------------
>> A translation layer is added at the syscall boundary to convert between the
>> process page size and the kernel page size. This effectively means enforcing
>> alignment requirements for addresses passed to syscalls and ensuring that
>> quantities passed as “number of pages” are interpreted relative to the process
>> page size and not the kernel page size. In this way the process has the illusion
>> that it is working in units of its page size, but the kernel is working in
>> units of the kernel page size.
>>
>> 2. Generic Linux MM enlightenment
>> ---------------------------------
>> We enlighten the Linux MM code to always hand out memory in the granularity
>> of process pages. Most of this work is greatly simplified because of the
>> existing mTHP allocation paths, and the ongoing support for large folios
>> across different areas of the kernel. The process order will be used as the
>> hard minimum mTHP order to allocate.
>>
>> File memory
>> -----------
>> For a growing list of compliant file systems, large folios can already be
>> stored in the page cache. There is even a mechanism, introduced to support
>> filesystems with block sizes larger than the system page size, to set a
>> hard-minimum size for folios on a per-address-space basis. This mechanism
>> will be reused and extended to service the per-process page size requirements.
>>
>> One key reason that the 64K kernel currently consumes considerably more memory
>> than the 4K kernel is that Linux systems often have lots of small
>> configuration files which each require a page in the page cache. But these
>> small files are (likely) only used by certain processes. So, we prefer to
>> continue to cache those using a 4K page.
>> Therefore, if a process with a larger page size maps a file whose pagecache
>> contains smaller folios, we drop them and re-read the range with a folio
>> order at least that of the process order.
>>
>> 3. Translation from Linux pagetable to native pagetable
>> -------------------------------------------------------
>> Assume the case of a kernel pagesize of 4K and app pagesize of 64K.
>> Now that enlightenment is done, it is guaranteed that every single mapping
>> in the 4K pagetable (which we call the Linux pagetable) is of granularity
>> at least 64K. In the arm64 MM code, we maintain a "native" pagetable per
>> mm_struct, which is based off a 64K geometry. Because of the guarantee
>> aforementioned, any pagetable operation on the Linux pagetable
>> (set_ptes, clear_flush_ptes, modify_prot_start_ptes, etc) is going to happen
>> at a granularity of at least 16 PTEs - therefore we can translate this
>> operation to modify a single PTE entry in the native pagetable.
>> Given that enlightenment may miss corner cases, we insert a warning in the
>> architecture code - on being presented with an operation not translatable
>> into a native operation, we fallback to the Linux pagetable, thus losing
>> the benefits borne out of the pagetable geometry but keeping
>> the emulation intact.
>>
>> -----------------------
>> What we want to discuss
>> -----------------------
>>  - Are there other arches which could benefit from this?
>>  - What level of compatibility we can achieve - is it even possible to
>>    contain userspace within the emulated ABI?
>>  - Rough edges of compatibility layer - pfnmaps, ksm, procfs, etc. For
>>    example, what happens when a 64K process opens a procfs file of
>>    a 4K process?
>>  - native pgtable implementation - perhaps inspiration can be taken
>>    from other arches with an involved pgtable logic (ppc, s390)?
>>
> 
> Hi Dev, Ryan,
> 
> I'd be very interested in joining this discussion at LSF/MM.

Thanks Kalesh for your interest!

> 
> On Android, we have a separate but very related use case: we emulate a
> larger userspace page size on x86, primarily to allow app developers
> to test their apps for 16KB compatibility using x86 emulators [1].
> 
> Similar to your proposed "ABI adapter" layer, our approach works by
> enforcing a larger 16KB granularity and alignment on the VMAs to
> emulate the userspace page size, while the underlying kernel still
> operates on a 4KB granularity [2].
> 
> In our emulation experience, we've run into a few specific rough edges:
> 
> 1. mmap and SIGBUS: Enforcing a larger VMA granularity means that
> mapping files can easily extend the VMA beyond the end of the file's
> valid offset. When userspace touches this padded area, the 4KB filemap
> fault cannot resolve to a valid index, resulting in a SIGBUS that
> applications aren't expecting.

You did mention in the other email the links below, and I went ahead
to compare :) I was puzzled to see some sort of VMA padding approach
in your patches. OTOH our approach pads anonymous pages. So for example,
if a 64K process maps a 12K sized file, we will map 52K/4K = 13 anonymous
pages into the 64K-aligned VMA.

Implementation-wise, we detect such a condition in filemap_fault
and return VM_FAULT_NEED_ANONPAGE, and redirect that to do_anonymous_page
to map 4K pages.

> 
> 2. userfaultfd: This inherently operates at the strict PTE granularity
> of the underlying kernel (4KB). Hiding this from a userspace that
> expects a 16KB/64KB fault granularity while the kernel still operates
> on 4KB granularity is messy ...

Indeed. We will have to fault in 16 4K pages.

> 
> 3. pagemap and PFN interfaces: As you noted with procfs, interfaces
> that expose or consume PFNs are problematic. Userspace tools reading
> /proc/pid/pagemap, /proc/kpagecount, /proc/kpageflags,
> /proc/kpagecgroup, and /sys/kernel/mm/page_idle/bitmap calculate
> offsets based on the userspace page size ABI, but the kernel returns
> 4KB PFNs which breaks such users.
> 
> 
> It would be great to explore if we can align on a unified approach to
> solve these.
> 
> [1] https://developer.android.com/guide/practices/page-sizes#16kb-emulator
> [2] https://source.android.com/docs/core/architecture/16kb-page-size/getting-started-cf-x86-64-pgagnostic
> 
> Thanks,
> Kalesh
> 
>> -------------
>> Key Attendees
>> -------------
>>  - Ryan Roberts (co-presenter)
>>  - mm folks (David Hildenbrand, Matthew Wilcox, Liam Howlett, Lorenzo Stoakes,
>>              and many others)
>>  - arch folks
>>

     prev parent reply	other threads:[~2026-02-26  8:45 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-17 14:50 Dev Jain
2026-02-17 15:22 ` Matthew Wilcox
2026-02-17 15:30   ` David Hildenbrand (Arm)
2026-02-17 15:51     ` Ryan Roberts
2026-02-20  4:49     ` Matthew Wilcox
2026-02-20 16:50       ` David Hildenbrand (Arm)
2026-02-23 13:02         ` [Lsf-pc] " Jan Kara
2026-02-18  8:39   ` Dev Jain
2026-02-18  8:58     ` Dev Jain
2026-02-18  9:15       ` David Hildenbrand (Arm)
2026-02-20  9:49   ` Arnd Bergmann
2026-02-20 13:37 ` Pedro Falcato
2026-02-23  5:07   ` Dev Jain
2026-02-23 12:49     ` Pedro Falcato
2026-02-23 13:01       ` David Hildenbrand (Arm)
2026-02-23 15:18     ` Matthew Wilcox
2026-02-23 16:28       ` David Hildenbrand (Arm)
2026-02-24  4:32         ` Dev Jain
2026-02-26  7:40 ` Kalesh Singh
2026-02-26  8:45   ` Dev Jain [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e1858dc5-6526-4aac-8392-f4f2e19da1bc@arm.com \
    --to=dev.jain@arm.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=abarnas@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=ardb@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=catalin.marinas@arm.com \
    --cc=david@kernel.org \
    --cc=hughd@google.com \
    --cc=kaleshsingh@google.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=marcinszymczyk@google.com \
    --cc=mhocko@suse.com \
    --cc=mmac@google.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox