Re: [LSF/MM/BPF TOPIC] Per-process page size

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Kalesh Singh <kaleshsingh@google.com>
To: Dev Jain <dev.jain@arm.com>
Cc: lsf-pc@lists.linux-foundation.org, ryan.roberts@arm.com,
	catalin.marinas@arm.com, will@kernel.org, ardb@kernel.org,
	willy@infradead.org, hughd@google.com,
	baolin.wang@linux.alibaba.com, akpm@linux-foundation.org,
	david@kernel.org, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, linux-mm@kvack.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	"Mateusz Maćkowski" <mmac@google.com>,
	"Adrian Barnaś" <abarnas@google.com>,
	"Marcin Szymczyk" <marcinszymczyk@google.com>
Subject: Re: [LSF/MM/BPF TOPIC] Per-process page size
Date: Thu, 26 Feb 2026 21:11:22 -0800	[thread overview]
Message-ID: <CAC_TJvcvybHqVAV8nAHEvN-UXUQ5hMjZx+_b2W3MY=xgqR9=6w@mail.gmail.com> (raw)
In-Reply-To: <e1858dc5-6526-4aac-8392-f4f2e19da1bc@arm.com>

On Thu, Feb 26, 2026 at 12:45 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 26/02/26 1:10 pm, Kalesh Singh wrote:
> > On Tue, Feb 17, 2026 at 6:50 AM Dev Jain <dev.jain@arm.com> wrote:
> >>
> >> Hi everyone,
> >>
> >> We propose per-process page size on arm64. Although the proposal is for
> >> arm64, perhaps the concept can be extended to other arches, thus the
> >> generic topic name.
> >>
> >> -------------
> >> INTRODUCTION
> >> -------------
> >> While mTHP has brought the performance of many workloads running on an arm64 4K
> >> kernel closer to that of the performance on an arm64 64K kernel, a performance
> >> gap still remains. This is attributed to a combination of greater number of
> >> pgtable levels, less reach within the walk cache and higher data cache footprint
> >> for pgtable memory. At the same time, 64K is not suitable for general
> >> purpose environments due to it's significantly higher memory footprint.
> >>
> >> To solve this, we have been experimenting with a concept called "per-process
> >> page size". This breaks the historic assumption of a single page size for the
> >> entire system: a process will now operate on a page size ABI that is greater
> >> than or equal to the kernel's page size. This is enabled by a key architectural
> >> feature on Arm: the separation of user and kernel page tables.
> >>
> >> This can also lead to a future of a single kernel image instead of 4K, 16K
> >> and 64K images.
> >>
> >> --------------
> >> CURRENT DESIGN
> >> --------------
> >> The design is based on one core idea; most of the kernel continues to believe
> >> there is only one page size in use across the whole system. That page size is
> >> the size selected at compile-time, as is done today. But every process (more
> >> accurately mm_struct) has a page size ABI which is one of the 3 page sizes
> >> (4K, 16K or 64K) as long as that page size is greater than or equal to the
> >> kernel page size (kernel page size is the macro PAGE_SIZE).
> >>
> >> Pagesize selection
> >> ------------------
> >> A process' selected page size ABI comes into force at execve() time and
> >> remains fixed until the process exits or until the next execve(). Any forked
> >> processes inherit the page size of their parent.
> >> The personality() mechanism already exists for similar cases, so we propose
> >> to extend it to enable specifying the required page size.
> >>
> >> There are 3 layers to the design. The first two are not arch-dependent,
> >> and makes Linux support a per-process pagesize ABI. The last layer is
> >> arch-specific.
> >>
> >> 1. ABI adapter
> >> --------------
> >> A translation layer is added at the syscall boundary to convert between the
> >> process page size and the kernel page size. This effectively means enforcing
> >> alignment requirements for addresses passed to syscalls and ensuring that
> >> quantities passed as “number of pages” are interpreted relative to the process
> >> page size and not the kernel page size. In this way the process has the illusion
> >> that it is working in units of its page size, but the kernel is working in
> >> units of the kernel page size.
> >>
> >> 2. Generic Linux MM enlightenment
> >> ---------------------------------
> >> We enlighten the Linux MM code to always hand out memory in the granularity
> >> of process pages. Most of this work is greatly simplified because of the
> >> existing mTHP allocation paths, and the ongoing support for large folios
> >> across different areas of the kernel. The process order will be used as the
> >> hard minimum mTHP order to allocate.
> >>
> >> File memory
> >> -----------
> >> For a growing list of compliant file systems, large folios can already be
> >> stored in the page cache. There is even a mechanism, introduced to support
> >> filesystems with block sizes larger than the system page size, to set a
> >> hard-minimum size for folios on a per-address-space basis. This mechanism
> >> will be reused and extended to service the per-process page size requirements.
> >>
> >> One key reason that the 64K kernel currently consumes considerably more memory
> >> than the 4K kernel is that Linux systems often have lots of small
> >> configuration files which each require a page in the page cache. But these
> >> small files are (likely) only used by certain processes. So, we prefer to
> >> continue to cache those using a 4K page.
> >> Therefore, if a process with a larger page size maps a file whose pagecache
> >> contains smaller folios, we drop them and re-read the range with a folio
> >> order at least that of the process order.
> >>
> >> 3. Translation from Linux pagetable to native pagetable
> >> -------------------------------------------------------
> >> Assume the case of a kernel pagesize of 4K and app pagesize of 64K.
> >> Now that enlightenment is done, it is guaranteed that every single mapping
> >> in the 4K pagetable (which we call the Linux pagetable) is of granularity
> >> at least 64K. In the arm64 MM code, we maintain a "native" pagetable per
> >> mm_struct, which is based off a 64K geometry. Because of the guarantee
> >> aforementioned, any pagetable operation on the Linux pagetable
> >> (set_ptes, clear_flush_ptes, modify_prot_start_ptes, etc) is going to happen
> >> at a granularity of at least 16 PTEs - therefore we can translate this
> >> operation to modify a single PTE entry in the native pagetable.
> >> Given that enlightenment may miss corner cases, we insert a warning in the
> >> architecture code - on being presented with an operation not translatable
> >> into a native operation, we fallback to the Linux pagetable, thus losing
> >> the benefits borne out of the pagetable geometry but keeping
> >> the emulation intact.
> >>
> >> -----------------------
> >> What we want to discuss
> >> -----------------------
> >>  - Are there other arches which could benefit from this?
> >>  - What level of compatibility we can achieve - is it even possible to
> >>    contain userspace within the emulated ABI?
> >>  - Rough edges of compatibility layer - pfnmaps, ksm, procfs, etc. For
> >>    example, what happens when a 64K process opens a procfs file of
> >>    a 4K process?
> >>  - native pgtable implementation - perhaps inspiration can be taken
> >>    from other arches with an involved pgtable logic (ppc, s390)?
> >>
> >
> > Hi Dev, Ryan,
> >
> > I'd be very interested in joining this discussion at LSF/MM.
>
> Thanks Kalesh for your interest!
>
> >
> > On Android, we have a separate but very related use case: we emulate a
> > larger userspace page size on x86, primarily to allow app developers
> > to test their apps for 16KB compatibility using x86 emulators [1].
> >
> > Similar to your proposed "ABI adapter" layer, our approach works by
> > enforcing a larger 16KB granularity and alignment on the VMAs to
> > emulate the userspace page size, while the underlying kernel still
> > operates on a 4KB granularity [2].
> >
> > In our emulation experience, we've run into a few specific rough edges:
> >
> > 1. mmap and SIGBUS: Enforcing a larger VMA granularity means that
> > mapping files can easily extend the VMA beyond the end of the file's
> > valid offset. When userspace touches this padded area, the 4KB filemap
> > fault cannot resolve to a valid index, resulting in a SIGBUS that
> > applications aren't expecting.
>
> You did mention in the other email the links below, and I went ahead
> to compare :) I was puzzled to see some sort of VMA padding approach
> in your patches. OTOH our approach pads anonymous pages. So for example,
> if a 64K process maps a 12K sized file, we will map 52K/4K = 13 anonymous
> pages into the 64K-aligned VMA.
>
> Implementation-wise, we detect such a condition in filemap_fault
> and return VM_FAULT_NEED_ANONPAGE, and redirect that to do_anonymous_page
> to map 4K pages.

Ah, the VMA padding patches you saw are actually for a different feature.

To handle the file mapping overhang, we currently insert a separate
anonymous VMA to cover the remainder of the emulated page range. Tough
I think your approach of returning VM_FAULT_NEED_ANONPAGE to fault
anonymous pages without needing to manage extra VMAs is a much cleaner
design :)

Thanks,
Kalesh

>
> >
> > 2. userfaultfd: This inherently operates at the strict PTE granularity
> > of the underlying kernel (4KB). Hiding this from a userspace that
> > expects a 16KB/64KB fault granularity while the kernel still operates
> > on 4KB granularity is messy ...
>
> Indeed. We will have to fault in 16 4K pages.
>
> >
> > 3. pagemap and PFN interfaces: As you noted with procfs, interfaces
> > that expose or consume PFNs are problematic. Userspace tools reading
> > /proc/pid/pagemap, /proc/kpagecount, /proc/kpageflags,
> > /proc/kpagecgroup, and /sys/kernel/mm/page_idle/bitmap calculate
> > offsets based on the userspace page size ABI, but the kernel returns
> > 4KB PFNs which breaks such users.
> >
> >
> > It would be great to explore if we can align on a unified approach to
> > solve these.
> >
> > [1] https://developer.android.com/guide/practices/page-sizes#16kb-emulator
> > [2] https://source.android.com/docs/core/architecture/16kb-page-size/getting-started-cf-x86-64-pgagnostic
> >
> > Thanks,
> > Kalesh
> >
> >> -------------
> >> Key Attendees
> >> -------------
> >>  - Ryan Roberts (co-presenter)
> >>  - mm folks (David Hildenbrand, Matthew Wilcox, Liam Howlett, Lorenzo Stoakes,
> >>              and many others)
> >>  - arch folks
> >>
>

     prev parent reply	other threads:[~2026-02-27  5:11 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-17 14:50 Dev Jain
2026-02-17 15:22 ` Matthew Wilcox
2026-02-17 15:30   ` David Hildenbrand (Arm)
2026-02-17 15:51     ` Ryan Roberts
2026-02-20  4:49     ` Matthew Wilcox
2026-02-20 16:50       ` David Hildenbrand (Arm)
2026-02-23 13:02         ` [Lsf-pc] " Jan Kara
2026-02-18  8:39   ` Dev Jain
2026-02-18  8:58     ` Dev Jain
2026-02-18  9:15       ` David Hildenbrand (Arm)
2026-02-20  9:49   ` Arnd Bergmann
2026-02-20 13:37 ` Pedro Falcato
2026-02-23  5:07   ` Dev Jain
2026-02-23 12:49     ` Pedro Falcato
2026-02-23 13:01       ` David Hildenbrand (Arm)
2026-02-23 15:18     ` Matthew Wilcox
2026-02-23 16:28       ` David Hildenbrand (Arm)
2026-02-24  4:32         ` Dev Jain
2026-02-26  7:40 ` Kalesh Singh
2026-02-26  8:45   ` Dev Jain
2026-02-27  5:11     ` Kalesh Singh [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAC_TJvcvybHqVAV8nAHEvN-UXUQ5hMjZx+_b2W3MY=xgqR9=6w@mail.gmail.com' \
    --to=kaleshsingh@google.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=abarnas@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=ardb@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=catalin.marinas@arm.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=hughd@google.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=marcinszymczyk@google.com \
    --cc=mhocko@suse.com \
    --cc=mmac@google.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox