[LSF/MM/BPF TOPIC] Per-process page size

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Per-process page size
@ 2026-02-17 14:50 Dev Jain
  2026-02-17 15:22 ` Matthew Wilcox
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Dev Jain @ 2026-02-17 14:50 UTC (permalink / raw)
  To: lsf-pc, ryan.roberts, catalin.marinas, will, ardb, willy, hughd,
	baolin.wang, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, linux-mm, linux-arm-kernel, linux-kernel
  Cc: Dev Jain

Hi everyone,

We propose per-process page size on arm64. Although the proposal is for
arm64, perhaps the concept can be extended to other arches, thus the
generic topic name.

-------------
INTRODUCTION
-------------
While mTHP has brought the performance of many workloads running on an arm64 4K
kernel closer to that of the performance on an arm64 64K kernel, a performance
gap still remains. This is attributed to a combination of greater number of
pgtable levels, less reach within the walk cache and higher data cache footprint
for pgtable memory. At the same time, 64K is not suitable for general
purpose environments due to it's significantly higher memory footprint.

To solve this, we have been experimenting with a concept called "per-process
page size". This breaks the historic assumption of a single page size for the
entire system: a process will now operate on a page size ABI that is greater
than or equal to the kernel's page size. This is enabled by a key architectural
feature on Arm: the separation of user and kernel page tables.

This can also lead to a future of a single kernel image instead of 4K, 16K
and 64K images.

--------------
CURRENT DESIGN
--------------
The design is based on one core idea; most of the kernel continues to believe
there is only one page size in use across the whole system. That page size is
the size selected at compile-time, as is done today. But every process (more
accurately mm_struct) has a page size ABI which is one of the 3 page sizes
(4K, 16K or 64K) as long as that page size is greater than or equal to the
kernel page size (kernel page size is the macro PAGE_SIZE).

Pagesize selection
------------------
A process' selected page size ABI comes into force at execve() time and
remains fixed until the process exits or until the next execve(). Any forked
processes inherit the page size of their parent.
The personality() mechanism already exists for similar cases, so we propose
to extend it to enable specifying the required page size.

There are 3 layers to the design. The first two are not arch-dependent,
and makes Linux support a per-process pagesize ABI. The last layer is
arch-specific.

1. ABI adapter
--------------
A translation layer is added at the syscall boundary to convert between the
process page size and the kernel page size. This effectively means enforcing
alignment requirements for addresses passed to syscalls and ensuring that
quantities passed as “number of pages” are interpreted relative to the process
page size and not the kernel page size. In this way the process has the illusion
that it is working in units of its page size, but the kernel is working in
units of the kernel page size.

2. Generic Linux MM enlightenment
---------------------------------
We enlighten the Linux MM code to always hand out memory in the granularity
of process pages. Most of this work is greatly simplified because of the
existing mTHP allocation paths, and the ongoing support for large folios
across different areas of the kernel. The process order will be used as the
hard minimum mTHP order to allocate.

File memory
-----------
For a growing list of compliant file systems, large folios can already be
stored in the page cache. There is even a mechanism, introduced to support
filesystems with block sizes larger than the system page size, to set a
hard-minimum size for folios on a per-address-space basis. This mechanism
will be reused and extended to service the per-process page size requirements.

One key reason that the 64K kernel currently consumes considerably more memory
than the 4K kernel is that Linux systems often have lots of small
configuration files which each require a page in the page cache. But these
small files are (likely) only used by certain processes. So, we prefer to
continue to cache those using a 4K page.
Therefore, if a process with a larger page size maps a file whose pagecache
contains smaller folios, we drop them and re-read the range with a folio
order at least that of the process order.

3. Translation from Linux pagetable to native pagetable
-------------------------------------------------------
Assume the case of a kernel pagesize of 4K and app pagesize of 64K.
Now that enlightenment is done, it is guaranteed that every single mapping
in the 4K pagetable (which we call the Linux pagetable) is of granularity
at least 64K. In the arm64 MM code, we maintain a "native" pagetable per
mm_struct, which is based off a 64K geometry. Because of the guarantee
aforementioned, any pagetable operation on the Linux pagetable
(set_ptes, clear_flush_ptes, modify_prot_start_ptes, etc) is going to happen
at a granularity of at least 16 PTEs - therefore we can translate this
operation to modify a single PTE entry in the native pagetable.
Given that enlightenment may miss corner cases, we insert a warning in the
architecture code - on being presented with an operation not translatable
into a native operation, we fallback to the Linux pagetable, thus losing
the benefits borne out of the pagetable geometry but keeping
the emulation intact.

-----------------------
What we want to discuss
-----------------------
 - Are there other arches which could benefit from this?
 - What level of compatibility we can achieve - is it even possible to
   contain userspace within the emulated ABI?
 - Rough edges of compatibility layer - pfnmaps, ksm, procfs, etc. For
   example, what happens when a 64K process opens a procfs file of
   a 4K process?
 - native pgtable implementation - perhaps inspiration can be taken
   from other arches with an involved pgtable logic (ppc, s390)?

-------------
Key Attendees
-------------
 - Ryan Roberts (co-presenter)
 - mm folks (David Hildenbrand, Matthew Wilcox, Liam Howlett, Lorenzo Stoakes,
             and many others)
 - arch folks


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-17 14:50 [LSF/MM/BPF TOPIC] Per-process page size Dev Jain
@ 2026-02-17 15:22 ` Matthew Wilcox
  2026-02-17 15:30   ` David Hildenbrand (Arm)
                     ` (2 more replies)
  2026-02-20 13:37 ` Pedro Falcato
  2026-02-26  7:40 ` Kalesh Singh
  2 siblings, 3 replies; 21+ messages in thread
From: Matthew Wilcox @ 2026-02-17 15:22 UTC (permalink / raw)
  To: Dev Jain
  Cc: lsf-pc, ryan.roberts, catalin.marinas, will, ardb, hughd,
	baolin.wang, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, linux-mm, linux-arm-kernel, linux-kernel

On Tue, Feb 17, 2026 at 08:20:26PM +0530, Dev Jain wrote:
> 2. Generic Linux MM enlightenment
> ---------------------------------
> We enlighten the Linux MM code to always hand out memory in the granularity

Please don't use the term "enlighten".  Tht's used to describe something
something or other with hypervisors.  Come up with a new term or use one
that already exists.

> File memory
> -----------
> For a growing list of compliant file systems, large folios can already be
> stored in the page cache. There is even a mechanism, introduced to support
> filesystems with block sizes larger than the system page size, to set a
> hard-minimum size for folios on a per-address-space basis. This mechanism
> will be reused and extended to service the per-process page size requirements.
> 
> One key reason that the 64K kernel currently consumes considerably more memory
> than the 4K kernel is that Linux systems often have lots of small
> configuration files which each require a page in the page cache. But these
> small files are (likely) only used by certain processes. So, we prefer to
> continue to cache those using a 4K page.
> Therefore, if a process with a larger page size maps a file whose pagecache
> contains smaller folios, we drop them and re-read the range with a folio
> order at least that of the process order.

That's going to be messy.  I don't have a good idea for solving this
problem, but the page cache really isn't set up to change minimum folio
order while the inode is in use.

>  - Are there other arches which could benefit from this?

Some architectures walk the page tables entirely in software, but on the
other hand, those tend to be, er, "legacy" architectures these days and
it's doubtful that anybody would invest in adding support.

Sounds like a good question for Arnd ;-)

>  - What level of compatibility we can achieve - is it even possible to
>    contain userspace within the emulated ABI?
>  - Rough edges of compatibility layer - pfnmaps, ksm, procfs, etc. For
>    example, what happens when a 64K process opens a procfs file of
>    a 4K process?
>  - native pgtable implementation - perhaps inspiration can be taken
>    from other arches with an involved pgtable logic (ppc, s390)?

I question who decides what page size a particular process will use.
The programmer?  The sysadmin?  It seems too disruptive for the kernel
to monitor and decide for the app what page size it will use.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-17 15:22 ` Matthew Wilcox
@ 2026-02-17 15:30   ` David Hildenbrand (Arm)
  2026-02-17 15:51     ` Ryan Roberts
  2026-02-20  4:49     ` Matthew Wilcox
  2026-02-18  8:39   ` Dev Jain
  2026-02-20  9:49   ` Arnd Bergmann
  2 siblings, 2 replies; 21+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-17 15:30 UTC (permalink / raw)
  To: Matthew Wilcox, Dev Jain
  Cc: lsf-pc, ryan.roberts, catalin.marinas, will, ardb, hughd,
	baolin.wang, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-arm-kernel, linux-kernel

On 2/17/26 16:22, Matthew Wilcox wrote:
> On Tue, Feb 17, 2026 at 08:20:26PM +0530, Dev Jain wrote:
>> 2. Generic Linux MM enlightenment
>> ---------------------------------
>> We enlighten the Linux MM code to always hand out memory in the granularity
> 
> Please don't use the term "enlighten".  Tht's used to describe something
> something or other with hypervisors.  Come up with a new term or use one
> that already exists.
> 
>> File memory
>> -----------
>> For a growing list of compliant file systems, large folios can already be
>> stored in the page cache. There is even a mechanism, introduced to support
>> filesystems with block sizes larger than the system page size, to set a
>> hard-minimum size for folios on a per-address-space basis. This mechanism
>> will be reused and extended to service the per-process page size requirements.
>>
>> One key reason that the 64K kernel currently consumes considerably more memory
>> than the 4K kernel is that Linux systems often have lots of small
>> configuration files which each require a page in the page cache. But these
>> small files are (likely) only used by certain processes. So, we prefer to
>> continue to cache those using a 4K page.
>> Therefore, if a process with a larger page size maps a file whose pagecache
>> contains smaller folios, we drop them and re-read the range with a folio
>> order at least that of the process order.
> 
> That's going to be messy.  I don't have a good idea for solving this
> problem, but the page cache really isn't set up to change minimum folio
> order while the inode is in use.

In a private conversation I also raised that some situations might make 
it impossible/hard to drop+re-read.

One example I cam up with if a folio is simply long-term R/O pinned. But 
I am also not quite sure how mlock might interfere here.

So yes, I think the page cache is likely the one of the most 
problematic/messy thing to handle.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-17 15:30   ` David Hildenbrand (Arm)
@ 2026-02-17 15:51     ` Ryan Roberts
  2026-02-20  4:49     ` Matthew Wilcox
  1 sibling, 0 replies; 21+ messages in thread
From: Ryan Roberts @ 2026-02-17 15:51 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Matthew Wilcox, Dev Jain
  Cc: lsf-pc, catalin.marinas, will, ardb, hughd, baolin.wang, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-arm-kernel, linux-kernel

On 17/02/2026 15:30, David Hildenbrand (Arm) wrote:
> On 2/17/26 16:22, Matthew Wilcox wrote:
>> On Tue, Feb 17, 2026 at 08:20:26PM +0530, Dev Jain wrote:
>>> 2. Generic Linux MM enlightenment
>>> ---------------------------------
>>> We enlighten the Linux MM code to always hand out memory in the granularity
>>
>> Please don't use the term "enlighten".  Tht's used to describe something
>> something or other with hypervisors.  Come up with a new term or use one
>> that already exists.
>>
>>> File memory
>>> -----------
>>> For a growing list of compliant file systems, large folios can already be
>>> stored in the page cache. There is even a mechanism, introduced to support
>>> filesystems with block sizes larger than the system page size, to set a
>>> hard-minimum size for folios on a per-address-space basis. This mechanism
>>> will be reused and extended to service the per-process page size requirements.
>>>
>>> One key reason that the 64K kernel currently consumes considerably more memory
>>> than the 4K kernel is that Linux systems often have lots of small
>>> configuration files which each require a page in the page cache. But these
>>> small files are (likely) only used by certain processes. So, we prefer to
>>> continue to cache those using a 4K page.
>>> Therefore, if a process with a larger page size maps a file whose pagecache
>>> contains smaller folios, we drop them and re-read the range with a folio
>>> order at least that of the process order.
>>
>> That's going to be messy.  I don't have a good idea for solving this
>> problem, but the page cache really isn't set up to change minimum folio
>> order while the inode is in use.

Dev has a prototype up and running, but based on your comments, I'm guessing
there is some horrible race that hasn't hit yet. Would be good to debug the gap
in understanding at some point!

> 
> In a private conversation I also raised that some situations might make it
> impossible/hard to drop+re-read.
> 
> One example I cam up with if a folio is simply long-term R/O pinned. But I am
> also not quite sure how mlock might interfere here.
> 
> So yes, I think the page cache is likely the one of the most problematic/messy
> thing to handle.
> 

I guess we could side step the problem for now, by initially requiring that the
minimum folio size always be the maximum supported process page size. That would
allow us to get something up and running at least. But then we lose the memory
saving benefits.

Of course, I'm conveniently ignoring that not all filesystems support large
folios, but perhaps we could do a generic fallback adapter with a bounce buffer
for that case?



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-17 15:30   ` David Hildenbrand (Arm)
  2026-02-17 15:51     ` Ryan Roberts
@ 2026-02-20  4:49     ` Matthew Wilcox
  2026-02-20 16:50       ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2026-02-20  4:49 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Dev Jain, lsf-pc, ryan.roberts, catalin.marinas, will, ardb,
	hughd, baolin.wang, akpm, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, linux-mm, linux-arm-kernel, linux-kernel

On Tue, Feb 17, 2026 at 04:30:59PM +0100, David Hildenbrand (Arm) wrote:
> In a private conversation I also raised that some situations might make it
> impossible/hard to drop+re-read.
> 
> One example I cam up with if a folio is simply long-term R/O pinned. But I
> am also not quite sure how mlock might interfere here.
> 
> So yes, I think the page cache is likely the one of the most
> problematic/messy thing to handle.

So what if we convert to max-supported-order the first time somebody
calls mmap on a given file?  Most files are never mmaped, so it won't
affect them.  And files that are mmaped are generally not written to.
So there should not be much in the page cache for the common case.
And if no pages from the file have been mmaped yet, they cannot be pinned
or mlocked.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-20  4:49     ` Matthew Wilcox
@ 2026-02-20 16:50       ` David Hildenbrand (Arm)
  2026-02-23 13:02         ` [Lsf-pc] " Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-20 16:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dev Jain, lsf-pc, ryan.roberts, catalin.marinas, will, ardb,
	hughd, baolin.wang, akpm, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, linux-mm, linux-arm-kernel, linux-kernel

On 2/20/26 05:49, Matthew Wilcox wrote:
> On Tue, Feb 17, 2026 at 04:30:59PM +0100, David Hildenbrand (Arm) wrote:
>> In a private conversation I also raised that some situations might make it
>> impossible/hard to drop+re-read.
>>
>> One example I cam up with if a folio is simply long-term R/O pinned. But I
>> am also not quite sure how mlock might interfere here.
>>
>> So yes, I think the page cache is likely the one of the most
>> problematic/messy thing to handle.
> 
> So what if we convert to max-supported-order the first time somebody
> calls mmap on a given file?  Most files are never mmaped, so it won't
> affect them. 

Yes!

> And files that are mmaped are generally not written to.

Well, let's say many mmaped files are not written to. :)

> So there should not be much in the page cache for the common case.

You'd assume many files to either get mmaped or read/written, yes.

> And if no pages from the file have been mmaped yet, they cannot be pinned
> or mlocked.

Is there some other way for someone to block a page from getting evicted 
from the pagecache?

We have this memfd_pin_folios() thing, but I don't think we have 
something comparable for ordinary pagecache files.

... putting them into a pipe and never reading from the pipe maybe (I 
assume that's what splice() does, but not sure if it actually places the 
pages in there or whether it creates a copy first)?

But yes, doing the conversion on first mmap() should handle mlock+gup.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-20 16:50       ` David Hildenbrand (Arm)
@ 2026-02-23 13:02         ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2026-02-23 13:02 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Matthew Wilcox, Dev Jain, lsf-pc, ryan.roberts, catalin.marinas,
	will, ardb, hughd, baolin.wang, akpm, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, linux-mm,
	linux-arm-kernel, linux-kernel

On Fri 20-02-26 17:50:44, David Hildenbrand (Arm) via Lsf-pc wrote:
> On 2/20/26 05:49, Matthew Wilcox wrote:
> > And if no pages from the file have been mmaped yet, they cannot be pinned
> > or mlocked.
> 
> Is there some other way for someone to block a page from getting evicted
> from the pagecache?
> 
> We have this memfd_pin_folios() thing, but I don't think we have something
> comparable for ordinary pagecache files.
> 
> ... putting them into a pipe and never reading from the pipe maybe (I assume
> that's what splice() does, but not sure if it actually places the pages in
> there or whether it creates a copy first)?

Standard splice copies data first (it's using standard IO callbacks such as
->read_iter) so that doesn't pin page cache AFAICT. Only vmsplice(2) does
but that requires mmap.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-17 15:22 ` Matthew Wilcox
  2026-02-17 15:30   ` David Hildenbrand (Arm)
@ 2026-02-18  8:39   ` Dev Jain
  2026-02-18  8:58     ` Dev Jain
  2026-02-20  9:49   ` Arnd Bergmann
  2 siblings, 1 reply; 21+ messages in thread
From: Dev Jain @ 2026-02-18  8:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, ryan.roberts, catalin.marinas, will, ardb, hughd,
	baolin.wang, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, linux-mm, linux-arm-kernel, linux-kernel


On 17/02/26 8:52 pm, Matthew Wilcox wrote:
> On Tue, Feb 17, 2026 at 08:20:26PM +0530, Dev Jain wrote:
>> 2. Generic Linux MM enlightenment
>> ---------------------------------
>> We enlighten the Linux MM code to always hand out memory in the granularity
> Please don't use the term "enlighten".  Tht's used to describe something
> something or other with hypervisors.  Come up with a new term or use one
> that already exists.

Sure.

>
>> File memory
>> -----------
>> For a growing list of compliant file systems, large folios can already be
>> stored in the page cache. There is even a mechanism, introduced to support
>> filesystems with block sizes larger than the system page size, to set a
>> hard-minimum size for folios on a per-address-space basis. This mechanism
>> will be reused and extended to service the per-process page size requirements.
>>
>> One key reason that the 64K kernel currently consumes considerably more memory
>> than the 4K kernel is that Linux systems often have lots of small
>> configuration files which each require a page in the page cache. But these
>> small files are (likely) only used by certain processes. So, we prefer to
>> continue to cache those using a 4K page.
>> Therefore, if a process with a larger page size maps a file whose pagecache
>> contains smaller folios, we drop them and re-read the range with a folio
>> order at least that of the process order.
> That's going to be messy.  I don't have a good idea for solving this
> problem, but the page cache really isn't set up to change minimum folio
> order while the inode is in use.

Holding mapping->invalidate_lock, bumping mapping->min_folio_order and
dropping-rereading the range suffers from a race - filemap_fault operating
on some other partially populated 64K range will observe in filemap_get_folio
that nothing is in the pagecache. Then, it will read the updated min_order
in __filemap_get_folio, then use filemap_add_folio to add a 64K folio, but since
the 64K range is partially populated, we get stuck in an infinite loop due to -EEXIST.

So I figured that deleting the entire pagecache is simpler. We will also bail
out early in __filemap_add_folio if the folio order asked by the caller to
create is less than mapping_min_folio_order. Eventually the caller is going
to read the correct min order. This algorithm avoids the race above, however...

my assumption here was that we are synchronized on mapping->invalidate_lock.
The kerneldoc above read_cache_folio() and some other comments convinced me
of that, but I just checked with a VM_WARN_ON(!is_rwsem_locked()) in
__filemap_add_folio and this doesn't seem to be the case for all code paths...
If the algorithm sounds reasonable, I wonder what is the correct synchronization
mechanism here.

>
>>  - Are there other arches which could benefit from this?
> Some architectures walk the page tables entirely in software, but on the
> other hand, those tend to be, er, "legacy" architectures these days and
> it's doubtful that anybody would invest in adding support.
>
> Sounds like a good question for Arnd ;-)
>
>>  - What level of compatibility we can achieve - is it even possible to
>>    contain userspace within the emulated ABI?
>>  - Rough edges of compatibility layer - pfnmaps, ksm, procfs, etc. For
>>    example, what happens when a 64K process opens a procfs file of
>>    a 4K process?
>>  - native pgtable implementation - perhaps inspiration can be taken
>>    from other arches with an involved pgtable logic (ppc, s390)?
> I question who decides what page size a particular process will use.
> The programmer?  The sysadmin?  It seems too disruptive for the kernel
> to monitor and decide for the app what page size it will use.

It's the sysadmin. The latter method you mention is similar to the problem
of the kernel choosing the correct mTHP order, which we don't have an
elegant idea for solving yet.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-18  8:39   ` Dev Jain
@ 2026-02-18  8:58     ` Dev Jain
  2026-02-18  9:15       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 21+ messages in thread
From: Dev Jain @ 2026-02-18  8:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, ryan.roberts, catalin.marinas, will, ardb, hughd,
	baolin.wang, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, linux-mm, linux-arm-kernel, linux-kernel


On 18/02/26 2:09 pm, Dev Jain wrote:
> On 17/02/26 8:52 pm, Matthew Wilcox wrote:
>> On Tue, Feb 17, 2026 at 08:20:26PM +0530, Dev Jain wrote:
>>> 2. Generic Linux MM enlightenment
>>> ---------------------------------
>>> We enlighten the Linux MM code to always hand out memory in the granularity
>> Please don't use the term "enlighten".  Tht's used to describe something
>> something or other with hypervisors.  Come up with a new term or use one
>> that already exists.
> Sure.
>
>>> File memory
>>> -----------
>>> For a growing list of compliant file systems, large folios can already be
>>> stored in the page cache. There is even a mechanism, introduced to support
>>> filesystems with block sizes larger than the system page size, to set a
>>> hard-minimum size for folios on a per-address-space basis. This mechanism
>>> will be reused and extended to service the per-process page size requirements.
>>>
>>> One key reason that the 64K kernel currently consumes considerably more memory
>>> than the 4K kernel is that Linux systems often have lots of small
>>> configuration files which each require a page in the page cache. But these
>>> small files are (likely) only used by certain processes. So, we prefer to
>>> continue to cache those using a 4K page.
>>> Therefore, if a process with a larger page size maps a file whose pagecache
>>> contains smaller folios, we drop them and re-read the range with a folio
>>> order at least that of the process order.
>> That's going to be messy.  I don't have a good idea for solving this
>> problem, but the page cache really isn't set up to change minimum folio
>> order while the inode is in use.
> Holding mapping->invalidate_lock, bumping mapping->min_folio_order and
> dropping-rereading the range suffers from a race - filemap_fault operating
> on some other partially populated 64K range will observe in filemap_get_folio
> that nothing is in the pagecache. Then, it will read the updated min_order
> in __filemap_get_folio, then use filemap_add_folio to add a 64K folio, but since
> the 64K range is partially populated, we get stuck in an infinite loop due to -EEXIST.
>
> So I figured that deleting the entire pagecache is simpler. We will also bail
> out early in __filemap_add_folio if the folio order asked by the caller to
> create is less than mapping_min_folio_order. Eventually the caller is going
> to read the correct min order. This algorithm avoids the race above, however...
>
> my assumption here was that we are synchronized on mapping->invalidate_lock.
> The kerneldoc above read_cache_folio() and some other comments convinced me
> of that, but I just checked with a VM_WARN_ON(!is_rwsem_locked()) in
> __filemap_add_folio and this doesn't seem to be the case for all code paths...
> If the algorithm sounds reasonable, I wonder what is the correct synchronization
> mechanism here.

I may have been vague here... to avoid the race I described above, we must
ensure that after all folios have been dropped from pagecache, and min order
is bumped up, no other code path remembers the old order and partially
populates a 64K range. For this we need synchronization.

>
>>>  - Are there other arches which could benefit from this?
>> Some architectures walk the page tables entirely in software, but on the
>> other hand, those tend to be, er, "legacy" architectures these days and
>> it's doubtful that anybody would invest in adding support.
>>
>> Sounds like a good question for Arnd ;-)
>>
>>>  - What level of compatibility we can achieve - is it even possible to
>>>    contain userspace within the emulated ABI?
>>>  - Rough edges of compatibility layer - pfnmaps, ksm, procfs, etc. For
>>>    example, what happens when a 64K process opens a procfs file of
>>>    a 4K process?
>>>  - native pgtable implementation - perhaps inspiration can be taken
>>>    from other arches with an involved pgtable logic (ppc, s390)?
>> I question who decides what page size a particular process will use.
>> The programmer?  The sysadmin?  It seems too disruptive for the kernel
>> to monitor and decide for the app what page size it will use.
> It's the sysadmin. The latter method you mention is similar to the problem
> of the kernel choosing the correct mTHP order, which we don't have an
> elegant idea for solving yet.
>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-18  8:58     ` Dev Jain
@ 2026-02-18  9:15       ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-18  9:15 UTC (permalink / raw)
  To: Dev Jain, Matthew Wilcox
  Cc: lsf-pc, ryan.roberts, catalin.marinas, will, ardb, hughd,
	baolin.wang, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-arm-kernel, linux-kernel

On 2/18/26 09:58, Dev Jain wrote:
> 
> On 18/02/26 2:09 pm, Dev Jain wrote:
>> On 17/02/26 8:52 pm, Matthew Wilcox wrote:
>>> Please don't use the term "enlighten".  Tht's used to describe something
>>> something or other with hypervisors.  Come up with a new term or use one
>>> that already exists.
>> Sure.
>>
>>> That's going to be messy.  I don't have a good idea for solving this
>>> problem, but the page cache really isn't set up to change minimum folio
>>> order while the inode is in use.
>> Holding mapping->invalidate_lock, bumping mapping->min_folio_order and
>> dropping-rereading the range suffers from a race - filemap_fault operating
>> on some other partially populated 64K range will observe in filemap_get_folio
>> that nothing is in the pagecache. Then, it will read the updated min_order
>> in __filemap_get_folio, then use filemap_add_folio to add a 64K folio, but since
>> the 64K range is partially populated, we get stuck in an infinite loop due to -EEXIST.
>>
>> So I figured that deleting the entire pagecache is simpler. We will also bail
>> out early in __filemap_add_folio if the folio order asked by the caller to
>> create is less than mapping_min_folio_order. Eventually the caller is going
>> to read the correct min order. This algorithm avoids the race above, however...
>>
>> my assumption here was that we are synchronized on mapping->invalidate_lock.
>> The kerneldoc above read_cache_folio() and some other comments convinced me
>> of that, but I just checked with a VM_WARN_ON(!is_rwsem_locked()) in
>> __filemap_add_folio and this doesn't seem to be the case for all code paths...
>> If the algorithm sounds reasonable, I wonder what is the correct synchronization
>> mechanism here.
> 
> I may have been vague here... to avoid the race I described above, we must
> ensure that after all folios have been dropped from pagecache, and min order
> is bumped up, no other code path remembers the old order and partially
> populates a 64K range. For this we need synchronization.

And I don't think you can reliably do that when other processes might be 
using the files concurrently.

It's best to start like Ryan suggested: lifting min_order on these 
systems for now and leaving dynamically switching the min order as 
future work.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-17 15:22 ` Matthew Wilcox
  2026-02-17 15:30   ` David Hildenbrand (Arm)
  2026-02-18  8:39   ` Dev Jain
@ 2026-02-20  9:49   ` Arnd Bergmann
  2 siblings, 0 replies; 21+ messages in thread
From: Arnd Bergmann @ 2026-02-20  9:49 UTC (permalink / raw)
  To: Matthew Wilcox, Dev Jain
  Cc: lsf-pc, Ryan Roberts, Catalin Marinas, Will Deacon,
	Ard Biesheuvel, Hugh Dickins, Baolin Wang, Andrew Morton,
	David Hildenbrand (Red Hat),
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-arm-kernel,
	linux-kernel

On Tue, Feb 17, 2026, at 16:22, Matthew Wilcox wrote:
> On Tue, Feb 17, 2026 at 08:20:26PM +0530, Dev Jain wrote:
>>
>>  - Are there other arches which could benefit from this?
>
> Some architectures walk the page tables entirely in software, but on the
> other hand, those tend to be, er, "legacy" architectures these days and
> it's doubtful that anybody would invest in adding support.
>
> Sounds like a good question for Arnd ;-)

I think Loongarch and RISC-V are the candidates for doing whatever
Arm does here. MIPS and PowerPC64 could do it in theory, but it's
less clear that someone will spend the effort here.

>>  - Rough edges of compatibility layer - pfnmaps, ksm, procfs, etc. For
>>    example, what happens when a 64K process opens a procfs file of
>>    a 4K process?

This would also be my main concern. There are hundreds of device drivers
that implement a custom .mmap() file operation, and a few dozen file
systems, all of which need to be audited and likely changed to allow
mapping larger granules.

>>  - native pgtable implementation - perhaps inspiration can be taken
>>    from other arches with an involved pgtable logic (ppc, s390)?
>
> I question who decides what page size a particular process will use.
> The programmer?  The sysadmin?

I would expect this to be done by a combination of these two, it
seems simple enough to have a wrapper like numactl or setarch to
start an application one way or another.

Another concern I have is for the actual performance trade-offs here.
As I understand it, the idea is to have most of the memory size
advantages of a 4KB page kernel, and most of the performance
advantages of a 64KB page kernel for the special applications that
care about this. However, the same is true for 16KB page kernel,
which also aims for the same trade-off with a much simpler model
and a different set of compatibility problems.

Do we expect per-process page size kernels to actually be better
than fixed 16KB page kernels, and better enough that it's worth
the added complexity? In particular, this approach would likely
only get the advantages of the TLB but not the file systems using
larger pages, while also suffering from the extra overhead of
compacting smaller pages in order to map them.

   Arnd

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-17 14:50 [LSF/MM/BPF TOPIC] Per-process page size Dev Jain
  2026-02-17 15:22 ` Matthew Wilcox
@ 2026-02-20 13:37 ` Pedro Falcato
  2026-02-23  5:07   ` Dev Jain
  2026-02-26  7:40 ` Kalesh Singh
  2 siblings, 1 reply; 21+ messages in thread
From: Pedro Falcato @ 2026-02-20 13:37 UTC (permalink / raw)
  To: Dev Jain
  Cc: lsf-pc, ryan.roberts, catalin.marinas, will, ardb, willy, hughd,
	baolin.wang, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, linux-mm, linux-arm-kernel, linux-kernel

On Tue, Feb 17, 2026 at 08:20:26PM +0530, Dev Jain wrote:
> Hi everyone,
> 
> We propose per-process page size on arm64. Although the proposal is for
> arm64, perhaps the concept can be extended to other arches, thus the
> generic topic name.
> 
> -------------
> INTRODUCTION
> -------------
> While mTHP has brought the performance of many workloads running on an arm64 4K
> kernel closer to that of the performance on an arm64 64K kernel, a performance
> gap still remains. This is attributed to a combination of greater number of
> pgtable levels, less reach within the walk cache and higher data cache footprint
> for pgtable memory. At the same time, 64K is not suitable for general
> purpose environments due to it's significantly higher memory footprint.

Could this perhaps be because of larger-page-size kernels being able to use
mTHP (and THP) more aggressively? It would be interesting to compare arm64
"4K" vs "4K with mTHP" vs "4K with _only_ mTHP" vs "64K" vs "64K with mTHP".

> 
> To solve this, we have been experimenting with a concept called "per-process
> page size". This breaks the historic assumption of a single page size for the
> entire system: a process will now operate on a page size ABI that is greater
> than or equal to the kernel's page size. This is enabled by a key architectural
> feature on Arm: the separation of user and kernel page tables.
> 
> This can also lead to a future of a single kernel image instead of 4K, 16K
> and 64K images.
> 
> --------------
> CURRENT DESIGN
> --------------
> The design is based on one core idea; most of the kernel continues to believe
> there is only one page size in use across the whole system. That page size is
> the size selected at compile-time, as is done today. But every process (more
> accurately mm_struct) has a page size ABI which is one of the 3 page sizes
> (4K, 16K or 64K) as long as that page size is greater than or equal to the
> kernel page size (kernel page size is the macro PAGE_SIZE).
> 
> Pagesize selection
> ------------------
> A process' selected page size ABI comes into force at execve() time and
> remains fixed until the process exits or until the next execve(). Any forked
> processes inherit the page size of their parent.
> The personality() mechanism already exists for similar cases, so we propose
> to extend it to enable specifying the required page size.
> 
> There are 3 layers to the design. The first two are not arch-dependent,
> and makes Linux support a per-process pagesize ABI. The last layer is
> arch-specific.
> 
> 1. ABI adapter
> --------------
> A translation layer is added at the syscall boundary to convert between the
> process page size and the kernel page size. This effectively means enforcing
> alignment requirements for addresses passed to syscalls and ensuring that
> quantities passed as “number of pages” are interpreted relative to the process
> page size and not the kernel page size. In this way the process has the illusion
> that it is working in units of its page size, but the kernel is working in
> units of the kernel page size.
> 
> 2. Generic Linux MM enlightenment
> ---------------------------------
> We enlighten the Linux MM code to always hand out memory in the granularity
> of process pages. Most of this work is greatly simplified because of the
> existing mTHP allocation paths, and the ongoing support for large folios
> across different areas of the kernel. The process order will be used as the
> hard minimum mTHP order to allocate.
> 
> File memory
> -----------
> For a growing list of compliant file systems, large folios can already be
> stored in the page cache. There is even a mechanism, introduced to support
> filesystems with block sizes larger than the system page size, to set a
> hard-minimum size for folios on a per-address-space basis. This mechanism
> will be reused and extended to service the per-process page size requirements.
> 
> One key reason that the 64K kernel currently consumes considerably more memory
> than the 4K kernel is that Linux systems often have lots of small
> configuration files which each require a page in the page cache. But these
> small files are (likely) only used by certain processes. So, we prefer to
> continue to cache those using a 4K page.
> Therefore, if a process with a larger page size maps a file whose pagecache
> contains smaller folios, we drop them and re-read the range with a folio
> order at least that of the process order.
> 
> 3. Translation from Linux pagetable to native pagetable
> -------------------------------------------------------
> Assume the case of a kernel pagesize of 4K and app pagesize of 64K.
> Now that enlightenment is done, it is guaranteed that every single mapping
> in the 4K pagetable (which we call the Linux pagetable) is of granularity
> at least 64K. In the arm64 MM code, we maintain a "native" pagetable per
> mm_struct, which is based off a 64K geometry. Because of the guarantee
> aforementioned, any pagetable operation on the Linux pagetable
> (set_ptes, clear_flush_ptes, modify_prot_start_ptes, etc) is going to happen
> at a granularity of at least 16 PTEs - therefore we can translate this
> operation to modify a single PTE entry in the native pagetable.
> Given that enlightenment may miss corner cases, we insert a warning in the
> architecture code - on being presented with an operation not translatable
> into a native operation, we fallback to the Linux pagetable, thus losing
> the benefits borne out of the pagetable geometry but keeping
> the emulation intact.

I don't understand. What exactly are you trying to do here? Maintain 2
different paging structures, one for core mm and the other for the arch? As
done in architectures with no radix tree paging structures?

If so, that's wildly inefficient, unless you're willing to go into reclaimable
page tables on the arm64 side. And that brings extra problems and extra fun :)

-- 
Pedro


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-20 13:37 ` Pedro Falcato
@ 2026-02-23  5:07   ` Dev Jain
  2026-02-23 12:49     ` Pedro Falcato
  2026-02-23 15:18     ` Matthew Wilcox
  0 siblings, 2 replies; 21+ messages in thread
From: Dev Jain @ 2026-02-23  5:07 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: lsf-pc, ryan.roberts, catalin.marinas, will, ardb, willy, hughd,
	baolin.wang, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, linux-mm, linux-arm-kernel, linux-kernel

>>
>> 3. Translation from Linux pagetable to native pagetable
>> -------------------------------------------------------
>> Assume the case of a kernel pagesize of 4K and app pagesize of 64K.
>> Now that enlightenment is done, it is guaranteed that every single mapping
>> in the 4K pagetable (which we call the Linux pagetable) is of granularity
>> at least 64K. In the arm64 MM code, we maintain a "native" pagetable per
>> mm_struct, which is based off a 64K geometry. Because of the guarantee
>> aforementioned, any pagetable operation on the Linux pagetable
>> (set_ptes, clear_flush_ptes, modify_prot_start_ptes, etc) is going to happen
>> at a granularity of at least 16 PTEs - therefore we can translate this
>> operation to modify a single PTE entry in the native pagetable.
>> Given that enlightenment may miss corner cases, we insert a warning in the
>> architecture code - on being presented with an operation not translatable
>> into a native operation, we fallback to the Linux pagetable, thus losing
>> the benefits borne out of the pagetable geometry but keeping
>> the emulation intact.
> I don't understand. What exactly are you trying to do here? Maintain 2
> different paging structures, one for core mm and the other for the arch? As
> done in architectures with no radix tree paging structures?

The mm->pgd will be the software pagetable. So suppose that do_anonymous_page is
doing set_ptes on the PTE table belonging to the software pagetable. We will
hook a "native_set_ptes" into set_ptes, which will set the ptes on a different
pagetable maintained by arm64 code (probably mm_context_t->native_pgd).

>
> If so, that's wildly inefficient, unless you're willing to go into reclaimable
> page tables on the arm64 side. And that brings extra problems and extra fun :)

I didn't understand the reclaimable reference, but yes we need to make this efficient.
So for the above example I gave, native_set_ptes knows the virtual address to set -
walking the native hierarchy from native_pgd->native_pmd->native_pte (in case of 64K native
geometry) is inefficient. So we need to maintain a lookup mechanism from a linux pgtable
pointer to the native pgtable pointer.
The idea we have currently is to store such lookup in the struct ptdesc of the pagetable page.
For 4K Linux pagetable and 64K native pagetable, 512M/2M = 256 Linux PTE tables correspond
to different sections of the native PTE table. We will maintain the pointer to the relevant
section in the native PTE table, in the struct ptdesc of the pagetable page of the Linux
PTE table.
The other case is that a single Linux pgtable leaf entry corresponds to multiple native
leaf entries - take the case of a Linux PMD table which maps 1G of memory, this corresponds
to 2 native PTE tables (2 x 512M). We will have to store a list of pointers here.

>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-23  5:07   ` Dev Jain
@ 2026-02-23 12:49     ` Pedro Falcato
  2026-02-23 13:01       ` David Hildenbrand (Arm)
  2026-02-23 15:18     ` Matthew Wilcox
  1 sibling, 1 reply; 21+ messages in thread
From: Pedro Falcato @ 2026-02-23 12:49 UTC (permalink / raw)
  To: Dev Jain
  Cc: lsf-pc, ryan.roberts, catalin.marinas, will, ardb, willy, hughd,
	baolin.wang, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, linux-mm, linux-arm-kernel, linux-kernel

On Mon, Feb 23, 2026 at 10:37:55AM +0530, Dev Jain wrote:
> >>
> >> 3. Translation from Linux pagetable to native pagetable
> >> -------------------------------------------------------
> >> Assume the case of a kernel pagesize of 4K and app pagesize of 64K.
> >> Now that enlightenment is done, it is guaranteed that every single mapping
> >> in the 4K pagetable (which we call the Linux pagetable) is of granularity
> >> at least 64K. In the arm64 MM code, we maintain a "native" pagetable per
> >> mm_struct, which is based off a 64K geometry. Because of the guarantee
> >> aforementioned, any pagetable operation on the Linux pagetable
> >> (set_ptes, clear_flush_ptes, modify_prot_start_ptes, etc) is going to happen
> >> at a granularity of at least 16 PTEs - therefore we can translate this
> >> operation to modify a single PTE entry in the native pagetable.
> >> Given that enlightenment may miss corner cases, we insert a warning in the
> >> architecture code - on being presented with an operation not translatable
> >> into a native operation, we fallback to the Linux pagetable, thus losing
> >> the benefits borne out of the pagetable geometry but keeping
> >> the emulation intact.
> > I don't understand. What exactly are you trying to do here? Maintain 2
> > different paging structures, one for core mm and the other for the arch? As
> > done in architectures with no radix tree paging structures?
> 
> The mm->pgd will be the software pagetable. So suppose that do_anonymous_page is
> doing set_ptes on the PTE table belonging to the software pagetable. We will
> hook a "native_set_ptes" into set_ptes, which will set the ptes on a different
> pagetable maintained by arm64 code (probably mm_context_t->native_pgd).

Traditionally, you do this kind of funky manipulation in update_mmu_cache.

But this is still an extremely complex and invasive change (that I assume most
people would not like to see) with dubious benefit.

> 
> >
> > If so, that's wildly inefficient, unless you're willing to go into reclaimable
> > page tables on the arm64 side. And that brings extra problems and extra fun :)
> 
> I didn't understand the reclaimable reference, but yes we need to make this efficient.

I'm not talking about CPU runtime efficiency, but memory efficiency. Doing
this makes you essentially duplicate page tables - not exactly ideal. This is
a Known Problem in classic UNIX systems which do something similar
(but not the same): anonymous memory pointers are stored in some intermediary
structure (SunOS and UVM call it "amap"), and paging structures are entirely
redundant there. They can freely tear down a page table because they can freely
put it together from the amap and file mappings (what they call vm_object and
we call address_space).

Anyway, I'm boring you with these funny historical details so you can understand
the similarities: the Linux page table format generally matches hardware, and
we store anonymous memory "state" there, so you can't ever tear-down a pgtable
without losing state of whatever was mapped there before. However, if you go
down the "arm64 now has a separate pgtable structure", the roles switch:
arm64's internal page table format makes for the real page tables, and linux's
pgtable structure is nothing more than an "amap". So you could (and perhaps
should) freely reclaim arm64 MMU page tables once memory pressure hits, because
they are freely discardable.

Does this make sense?

-- 
Pedro

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-23 12:49     ` Pedro Falcato
@ 2026-02-23 13:01       ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-23 13:01 UTC (permalink / raw)
  To: Pedro Falcato, Dev Jain
  Cc: lsf-pc, ryan.roberts, catalin.marinas, will, ardb, willy, hughd,
	baolin.wang, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-arm-kernel, linux-kernel

On 2/23/26 13:49, Pedro Falcato wrote:
> On Mon, Feb 23, 2026 at 10:37:55AM +0530, Dev Jain wrote:
>>> I don't understand. What exactly are you trying to do here? Maintain 2
>>> different paging structures, one for core mm and the other for the arch? As
>>> done in architectures with no radix tree paging structures?
>>
>> The mm->pgd will be the software pagetable. So suppose that do_anonymous_page is
>> doing set_ptes on the PTE table belonging to the software pagetable. We will
>> hook a "native_set_ptes" into set_ptes, which will set the ptes on a different
>> pagetable maintained by arm64 code (probably mm_context_t->native_pgd).
> 
> Traditionally, you do this kind of funky manipulation in update_mmu_cache.
> 
> But this is still an extremely complex and invasive change (that I assume most
> people would not like to see) with dubious benefit.
> 
>>
>>>
>>> If so, that's wildly inefficient, unless you're willing to go into reclaimable
>>> page tables on the arm64 side. And that brings extra problems and extra fun :)
>>
>> I didn't understand the reclaimable reference, but yes we need to make this efficient.
> 
> I'm not talking about CPU runtime efficiency, but memory efficiency. Doing
> this makes you essentially duplicate page tables - not exactly ideal. This is
> a Known Problem in classic UNIX systems which do something similar
> (but not the same): anonymous memory pointers are stored in some intermediary
> structure (SunOS and UVM call it "amap"), and paging structures are entirely
> redundant there. They can freely tear down a page table because they can freely
> put it together from the amap and file mappings (what they call vm_object and
> we call address_space).
> 
> Anyway, I'm boring you with these funny historical details so you can understand
> the similarities: the Linux page table format generally matches hardware, and
> we store anonymous memory "state" there, so you can't ever tear-down a pgtable
> without losing state of whatever was mapped there before. However, if you go
> down the "arm64 now has a separate pgtable structure", the roles switch:
> arm64's internal page table format makes for the real page tables, and linux's
> pgtable structure is nothing more than an "amap". So you could (and perhaps
> should) freely reclaim arm64 MMU page tables once memory pressure hits, because
> they are freely discardable.
> 
> Does this make sense?

I've been thinking about building the 64k page tables similar to how 
HMM/KVM handles it, invalidating them through mmu notifiers etc and 
building them on demand.

Considering the 64k MMU of a process just like a special device that 
builds its own page tables.

This way, they could get reclaimed more easily and most of the core + 
arm64 page able manipulation code could be kept as is.

However, I don't know how much the performance impact of that approach 
would be.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-23  5:07   ` Dev Jain
  2026-02-23 12:49     ` Pedro Falcato
@ 2026-02-23 15:18     ` Matthew Wilcox
  2026-02-23 16:28       ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2026-02-23 15:18 UTC (permalink / raw)
  To: Dev Jain
  Cc: Pedro Falcato, lsf-pc, ryan.roberts, catalin.marinas, will, ardb,
	hughd, baolin.wang, akpm, david, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, linux-mm, linux-arm-kernel,
	linux-kernel

On Mon, Feb 23, 2026 at 10:37:55AM +0530, Dev Jain wrote:
> I didn't understand the reclaimable reference, but yes we need to make this efficient.

this goes over 80 columns so much and so often, it's painful to read.
so i didn't.

> So for the above example I gave, native_set_ptes knows the virtual address to set -
> walking the native hierarchy from native_pgd->native_pmd->native_pte (in case of 64K native
> geometry) is inefficient. So we need to maintain a lookup mechanism from a linux pgtable
> pointer to the native pgtable pointer.
> The idea we have currently is to store such lookup in the struct ptdesc of the pagetable page.
> For 4K Linux pagetable and 64K native pagetable, 512M/2M = 256 Linux PTE tables correspond
> to different sections of the native PTE table. We will maintain the pointer to the relevant
> section in the native PTE table, in the struct ptdesc of the pagetable page of the Linux
> PTE table.
> The other case is that a single Linux pgtable leaf entry corresponds to multiple native
> leaf entries - take the case of a Linux PMD table which maps 1G of memory, this corresponds
> to 2 native PTE tables (2 x 512M). We will have to store a list of pointers here.
> 
> >
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-23 15:18     ` Matthew Wilcox
@ 2026-02-23 16:28       ` David Hildenbrand (Arm)
  2026-02-24  4:32         ` Dev Jain
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-23 16:28 UTC (permalink / raw)
  To: Matthew Wilcox, Dev Jain
  Cc: Pedro Falcato, lsf-pc, ryan.roberts, catalin.marinas, will, ardb,
	hughd, baolin.wang, akpm, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, linux-mm, linux-arm-kernel, linux-kernel,
	Jens Axboe

On 2/23/26 16:18, Matthew Wilcox wrote:
> On Mon, Feb 23, 2026 at 10:37:55AM +0530, Dev Jain wrote:
>> I didn't understand the reclaimable reference, but yes we need to make this efficient.
> 
> this goes over 80 columns so much and so often, it's painful to read.
> so i didn't.

I just found out that Thunderbird was lying to me the whole time.

If you're using "Toggle Line Wrap" plugin you might think that mails are
properly wrapped, you know, like *they are displayed*. And even lore
displays them properly.

But in the back, Thunderbird set "format=flowed" and screws you.

So, if anyone else believes that they are sending properly wrapped mails
with Thunderbird, read

	Documentation/process/email-clients.rst

and make sure that "mailnews.send_plaintext_flowed" is set to false.


Thanks Jens for the pointer!

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-23 16:28       ` David Hildenbrand (Arm)
@ 2026-02-24  4:32         ` Dev Jain
  0 siblings, 0 replies; 21+ messages in thread
From: Dev Jain @ 2026-02-24  4:32 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Matthew Wilcox
  Cc: Pedro Falcato, lsf-pc, ryan.roberts, catalin.marinas, will, ardb,
	hughd, baolin.wang, akpm, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, linux-mm, linux-arm-kernel, linux-kernel,
	Jens Axboe



On 23/02/26 9:58 pm, David Hildenbrand (Arm) wrote:
> On 2/23/26 16:18, Matthew Wilcox wrote:
>> On Mon, Feb 23, 2026 at 10:37:55AM +0530, Dev Jain wrote:
>>> I didn't understand the reclaimable reference, but yes we need to make this efficient.
>>
>> this goes over 80 columns so much and so often, it's painful to read.
>> so i didn't.
> 
> I just found out that Thunderbird was lying to me the whole time.
> 
> If you're using "Toggle Line Wrap" plugin you might think that mails are
> properly wrapped, you know, like *they are displayed*. And even lore
> displays them properly.
> 
> But in the back, Thunderbird set "format=flowed" and screws you.
> 
> So, if anyone else believes that they are sending properly wrapped mails
> with Thunderbird, read
> 
> 	Documentation/process/email-clients.rst
> 
> and make sure that "mailnews.send_plaintext_flowed" is set to false.
> 
> 
> Thanks Jens for the pointer!
> 

Thanks for letting me know about Toggle Line Wrap. This works, along with
mailnews.wraplength. If I set this to 0, which is what email-clients.rst
sugggests, it doesn't work. Thunderbird is confusing.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-17 14:50 [LSF/MM/BPF TOPIC] Per-process page size Dev Jain
  2026-02-17 15:22 ` Matthew Wilcox
  2026-02-20 13:37 ` Pedro Falcato
@ 2026-02-26  7:40 ` Kalesh Singh
  2026-02-26  8:45   ` Dev Jain
  2 siblings, 1 reply; 21+ messages in thread
From: Kalesh Singh @ 2026-02-26  7:40 UTC (permalink / raw)
  To: Dev Jain
  Cc: lsf-pc, ryan.roberts, catalin.marinas, will, ardb, willy, hughd,
	baolin.wang, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, linux-mm, linux-arm-kernel, linux-kernel,
	Mateusz Maćkowski, Adrian Barnaś,
	Marcin Szymczyk

On Tue, Feb 17, 2026 at 6:50 AM Dev Jain <dev.jain@arm.com> wrote:
>
> Hi everyone,
>
> We propose per-process page size on arm64. Although the proposal is for
> arm64, perhaps the concept can be extended to other arches, thus the
> generic topic name.
>
> -------------
> INTRODUCTION
> -------------
> While mTHP has brought the performance of many workloads running on an arm64 4K
> kernel closer to that of the performance on an arm64 64K kernel, a performance
> gap still remains. This is attributed to a combination of greater number of
> pgtable levels, less reach within the walk cache and higher data cache footprint
> for pgtable memory. At the same time, 64K is not suitable for general
> purpose environments due to it's significantly higher memory footprint.
>
> To solve this, we have been experimenting with a concept called "per-process
> page size". This breaks the historic assumption of a single page size for the
> entire system: a process will now operate on a page size ABI that is greater
> than or equal to the kernel's page size. This is enabled by a key architectural
> feature on Arm: the separation of user and kernel page tables.
>
> This can also lead to a future of a single kernel image instead of 4K, 16K
> and 64K images.
>
> --------------
> CURRENT DESIGN
> --------------
> The design is based on one core idea; most of the kernel continues to believe
> there is only one page size in use across the whole system. That page size is
> the size selected at compile-time, as is done today. But every process (more
> accurately mm_struct) has a page size ABI which is one of the 3 page sizes
> (4K, 16K or 64K) as long as that page size is greater than or equal to the
> kernel page size (kernel page size is the macro PAGE_SIZE).
>
> Pagesize selection
> ------------------
> A process' selected page size ABI comes into force at execve() time and
> remains fixed until the process exits or until the next execve(). Any forked
> processes inherit the page size of their parent.
> The personality() mechanism already exists for similar cases, so we propose
> to extend it to enable specifying the required page size.
>
> There are 3 layers to the design. The first two are not arch-dependent,
> and makes Linux support a per-process pagesize ABI. The last layer is
> arch-specific.
>
> 1. ABI adapter
> --------------
> A translation layer is added at the syscall boundary to convert between the
> process page size and the kernel page size. This effectively means enforcing
> alignment requirements for addresses passed to syscalls and ensuring that
> quantities passed as “number of pages” are interpreted relative to the process
> page size and not the kernel page size. In this way the process has the illusion
> that it is working in units of its page size, but the kernel is working in
> units of the kernel page size.
>
> 2. Generic Linux MM enlightenment
> ---------------------------------
> We enlighten the Linux MM code to always hand out memory in the granularity
> of process pages. Most of this work is greatly simplified because of the
> existing mTHP allocation paths, and the ongoing support for large folios
> across different areas of the kernel. The process order will be used as the
> hard minimum mTHP order to allocate.
>
> File memory
> -----------
> For a growing list of compliant file systems, large folios can already be
> stored in the page cache. There is even a mechanism, introduced to support
> filesystems with block sizes larger than the system page size, to set a
> hard-minimum size for folios on a per-address-space basis. This mechanism
> will be reused and extended to service the per-process page size requirements.
>
> One key reason that the 64K kernel currently consumes considerably more memory
> than the 4K kernel is that Linux systems often have lots of small
> configuration files which each require a page in the page cache. But these
> small files are (likely) only used by certain processes. So, we prefer to
> continue to cache those using a 4K page.
> Therefore, if a process with a larger page size maps a file whose pagecache
> contains smaller folios, we drop them and re-read the range with a folio
> order at least that of the process order.
>
> 3. Translation from Linux pagetable to native pagetable
> -------------------------------------------------------
> Assume the case of a kernel pagesize of 4K and app pagesize of 64K.
> Now that enlightenment is done, it is guaranteed that every single mapping
> in the 4K pagetable (which we call the Linux pagetable) is of granularity
> at least 64K. In the arm64 MM code, we maintain a "native" pagetable per
> mm_struct, which is based off a 64K geometry. Because of the guarantee
> aforementioned, any pagetable operation on the Linux pagetable
> (set_ptes, clear_flush_ptes, modify_prot_start_ptes, etc) is going to happen
> at a granularity of at least 16 PTEs - therefore we can translate this
> operation to modify a single PTE entry in the native pagetable.
> Given that enlightenment may miss corner cases, we insert a warning in the
> architecture code - on being presented with an operation not translatable
> into a native operation, we fallback to the Linux pagetable, thus losing
> the benefits borne out of the pagetable geometry but keeping
> the emulation intact.
>
> -----------------------
> What we want to discuss
> -----------------------
>  - Are there other arches which could benefit from this?
>  - What level of compatibility we can achieve - is it even possible to
>    contain userspace within the emulated ABI?
>  - Rough edges of compatibility layer - pfnmaps, ksm, procfs, etc. For
>    example, what happens when a 64K process opens a procfs file of
>    a 4K process?
>  - native pgtable implementation - perhaps inspiration can be taken
>    from other arches with an involved pgtable logic (ppc, s390)?
>

Hi Dev, Ryan,

I'd be very interested in joining this discussion at LSF/MM.

On Android, we have a separate but very related use case: we emulate a
larger userspace page size on x86, primarily to allow app developers
to test their apps for 16KB compatibility using x86 emulators [1].

Similar to your proposed "ABI adapter" layer, our approach works by
enforcing a larger 16KB granularity and alignment on the VMAs to
emulate the userspace page size, while the underlying kernel still
operates on a 4KB granularity [2].

In our emulation experience, we've run into a few specific rough edges:

1. mmap and SIGBUS: Enforcing a larger VMA granularity means that
mapping files can easily extend the VMA beyond the end of the file's
valid offset. When userspace touches this padded area, the 4KB filemap
fault cannot resolve to a valid index, resulting in a SIGBUS that
applications aren't expecting.

2. userfaultfd: This inherently operates at the strict PTE granularity
of the underlying kernel (4KB). Hiding this from a userspace that
expects a 16KB/64KB fault granularity while the kernel still operates
on 4KB granularity is messy ...

3. pagemap and PFN interfaces: As you noted with procfs, interfaces
that expose or consume PFNs are problematic. Userspace tools reading
/proc/pid/pagemap, /proc/kpagecount, /proc/kpageflags,
/proc/kpagecgroup, and /sys/kernel/mm/page_idle/bitmap calculate
offsets based on the userspace page size ABI, but the kernel returns
4KB PFNs which breaks such users.


It would be great to explore if we can align on a unified approach to
solve these.

[1] https://developer.android.com/guide/practices/page-sizes#16kb-emulator
[2] https://source.android.com/docs/core/architecture/16kb-page-size/getting-started-cf-x86-64-pgagnostic

Thanks,
Kalesh

> -------------
> Key Attendees
> -------------
>  - Ryan Roberts (co-presenter)
>  - mm folks (David Hildenbrand, Matthew Wilcox, Liam Howlett, Lorenzo Stoakes,
>              and many others)
>  - arch folks
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-26  7:40 ` Kalesh Singh
@ 2026-02-26  8:45   ` Dev Jain
  2026-02-27  5:11     ` Kalesh Singh
  0 siblings, 1 reply; 21+ messages in thread
From: Dev Jain @ 2026-02-26  8:45 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: lsf-pc, ryan.roberts, catalin.marinas, will, ardb, willy, hughd,
	baolin.wang, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, linux-mm, linux-arm-kernel, linux-kernel,
	Mateusz Maćkowski, Adrian Barnaś,
	Marcin Szymczyk



On 26/02/26 1:10 pm, Kalesh Singh wrote:
> On Tue, Feb 17, 2026 at 6:50 AM Dev Jain <dev.jain@arm.com> wrote:
>>
>> Hi everyone,
>>
>> We propose per-process page size on arm64. Although the proposal is for
>> arm64, perhaps the concept can be extended to other arches, thus the
>> generic topic name.
>>
>> -------------
>> INTRODUCTION
>> -------------
>> While mTHP has brought the performance of many workloads running on an arm64 4K
>> kernel closer to that of the performance on an arm64 64K kernel, a performance
>> gap still remains. This is attributed to a combination of greater number of
>> pgtable levels, less reach within the walk cache and higher data cache footprint
>> for pgtable memory. At the same time, 64K is not suitable for general
>> purpose environments due to it's significantly higher memory footprint.
>>
>> To solve this, we have been experimenting with a concept called "per-process
>> page size". This breaks the historic assumption of a single page size for the
>> entire system: a process will now operate on a page size ABI that is greater
>> than or equal to the kernel's page size. This is enabled by a key architectural
>> feature on Arm: the separation of user and kernel page tables.
>>
>> This can also lead to a future of a single kernel image instead of 4K, 16K
>> and 64K images.
>>
>> --------------
>> CURRENT DESIGN
>> --------------
>> The design is based on one core idea; most of the kernel continues to believe
>> there is only one page size in use across the whole system. That page size is
>> the size selected at compile-time, as is done today. But every process (more
>> accurately mm_struct) has a page size ABI which is one of the 3 page sizes
>> (4K, 16K or 64K) as long as that page size is greater than or equal to the
>> kernel page size (kernel page size is the macro PAGE_SIZE).
>>
>> Pagesize selection
>> ------------------
>> A process' selected page size ABI comes into force at execve() time and
>> remains fixed until the process exits or until the next execve(). Any forked
>> processes inherit the page size of their parent.
>> The personality() mechanism already exists for similar cases, so we propose
>> to extend it to enable specifying the required page size.
>>
>> There are 3 layers to the design. The first two are not arch-dependent,
>> and makes Linux support a per-process pagesize ABI. The last layer is
>> arch-specific.
>>
>> 1. ABI adapter
>> --------------
>> A translation layer is added at the syscall boundary to convert between the
>> process page size and the kernel page size. This effectively means enforcing
>> alignment requirements for addresses passed to syscalls and ensuring that
>> quantities passed as “number of pages” are interpreted relative to the process
>> page size and not the kernel page size. In this way the process has the illusion
>> that it is working in units of its page size, but the kernel is working in
>> units of the kernel page size.
>>
>> 2. Generic Linux MM enlightenment
>> ---------------------------------
>> We enlighten the Linux MM code to always hand out memory in the granularity
>> of process pages. Most of this work is greatly simplified because of the
>> existing mTHP allocation paths, and the ongoing support for large folios
>> across different areas of the kernel. The process order will be used as the
>> hard minimum mTHP order to allocate.
>>
>> File memory
>> -----------
>> For a growing list of compliant file systems, large folios can already be
>> stored in the page cache. There is even a mechanism, introduced to support
>> filesystems with block sizes larger than the system page size, to set a
>> hard-minimum size for folios on a per-address-space basis. This mechanism
>> will be reused and extended to service the per-process page size requirements.
>>
>> One key reason that the 64K kernel currently consumes considerably more memory
>> than the 4K kernel is that Linux systems often have lots of small
>> configuration files which each require a page in the page cache. But these
>> small files are (likely) only used by certain processes. So, we prefer to
>> continue to cache those using a 4K page.
>> Therefore, if a process with a larger page size maps a file whose pagecache
>> contains smaller folios, we drop them and re-read the range with a folio
>> order at least that of the process order.
>>
>> 3. Translation from Linux pagetable to native pagetable
>> -------------------------------------------------------
>> Assume the case of a kernel pagesize of 4K and app pagesize of 64K.
>> Now that enlightenment is done, it is guaranteed that every single mapping
>> in the 4K pagetable (which we call the Linux pagetable) is of granularity
>> at least 64K. In the arm64 MM code, we maintain a "native" pagetable per
>> mm_struct, which is based off a 64K geometry. Because of the guarantee
>> aforementioned, any pagetable operation on the Linux pagetable
>> (set_ptes, clear_flush_ptes, modify_prot_start_ptes, etc) is going to happen
>> at a granularity of at least 16 PTEs - therefore we can translate this
>> operation to modify a single PTE entry in the native pagetable.
>> Given that enlightenment may miss corner cases, we insert a warning in the
>> architecture code - on being presented with an operation not translatable
>> into a native operation, we fallback to the Linux pagetable, thus losing
>> the benefits borne out of the pagetable geometry but keeping
>> the emulation intact.
>>
>> -----------------------
>> What we want to discuss
>> -----------------------
>>  - Are there other arches which could benefit from this?
>>  - What level of compatibility we can achieve - is it even possible to
>>    contain userspace within the emulated ABI?
>>  - Rough edges of compatibility layer - pfnmaps, ksm, procfs, etc. For
>>    example, what happens when a 64K process opens a procfs file of
>>    a 4K process?
>>  - native pgtable implementation - perhaps inspiration can be taken
>>    from other arches with an involved pgtable logic (ppc, s390)?
>>
> 
> Hi Dev, Ryan,
> 
> I'd be very interested in joining this discussion at LSF/MM.

Thanks Kalesh for your interest!

> 
> On Android, we have a separate but very related use case: we emulate a
> larger userspace page size on x86, primarily to allow app developers
> to test their apps for 16KB compatibility using x86 emulators [1].
> 
> Similar to your proposed "ABI adapter" layer, our approach works by
> enforcing a larger 16KB granularity and alignment on the VMAs to
> emulate the userspace page size, while the underlying kernel still
> operates on a 4KB granularity [2].
> 
> In our emulation experience, we've run into a few specific rough edges:
> 
> 1. mmap and SIGBUS: Enforcing a larger VMA granularity means that
> mapping files can easily extend the VMA beyond the end of the file's
> valid offset. When userspace touches this padded area, the 4KB filemap
> fault cannot resolve to a valid index, resulting in a SIGBUS that
> applications aren't expecting.

You did mention in the other email the links below, and I went ahead
to compare :) I was puzzled to see some sort of VMA padding approach
in your patches. OTOH our approach pads anonymous pages. So for example,
if a 64K process maps a 12K sized file, we will map 52K/4K = 13 anonymous
pages into the 64K-aligned VMA.

Implementation-wise, we detect such a condition in filemap_fault
and return VM_FAULT_NEED_ANONPAGE, and redirect that to do_anonymous_page
to map 4K pages.

> 
> 2. userfaultfd: This inherently operates at the strict PTE granularity
> of the underlying kernel (4KB). Hiding this from a userspace that
> expects a 16KB/64KB fault granularity while the kernel still operates
> on 4KB granularity is messy ...

Indeed. We will have to fault in 16 4K pages.

> 
> 3. pagemap and PFN interfaces: As you noted with procfs, interfaces
> that expose or consume PFNs are problematic. Userspace tools reading
> /proc/pid/pagemap, /proc/kpagecount, /proc/kpageflags,
> /proc/kpagecgroup, and /sys/kernel/mm/page_idle/bitmap calculate
> offsets based on the userspace page size ABI, but the kernel returns
> 4KB PFNs which breaks such users.
> 
> 
> It would be great to explore if we can align on a unified approach to
> solve these.
> 
> [1] https://developer.android.com/guide/practices/page-sizes#16kb-emulator
> [2] https://source.android.com/docs/core/architecture/16kb-page-size/getting-started-cf-x86-64-pgagnostic
> 
> Thanks,
> Kalesh
> 
>> -------------
>> Key Attendees
>> -------------
>>  - Ryan Roberts (co-presenter)
>>  - mm folks (David Hildenbrand, Matthew Wilcox, Liam Howlett, Lorenzo Stoakes,
>>              and many others)
>>  - arch folks
>>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Per-process page size
  2026-02-26  8:45   ` Dev Jain
@ 2026-02-27  5:11     ` Kalesh Singh
  0 siblings, 0 replies; 21+ messages in thread
From: Kalesh Singh @ 2026-02-27  5:11 UTC (permalink / raw)
  To: Dev Jain
  Cc: lsf-pc, ryan.roberts, catalin.marinas, will, ardb, willy, hughd,
	baolin.wang, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, linux-mm, linux-arm-kernel, linux-kernel,
	Mateusz Maćkowski, Adrian Barnaś,
	Marcin Szymczyk

On Thu, Feb 26, 2026 at 12:45 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 26/02/26 1:10 pm, Kalesh Singh wrote:
> > On Tue, Feb 17, 2026 at 6:50 AM Dev Jain <dev.jain@arm.com> wrote:
> >>
> >> Hi everyone,
> >>
> >> We propose per-process page size on arm64. Although the proposal is for
> >> arm64, perhaps the concept can be extended to other arches, thus the
> >> generic topic name.
> >>
> >> -------------
> >> INTRODUCTION
> >> -------------
> >> While mTHP has brought the performance of many workloads running on an arm64 4K
> >> kernel closer to that of the performance on an arm64 64K kernel, a performance
> >> gap still remains. This is attributed to a combination of greater number of
> >> pgtable levels, less reach within the walk cache and higher data cache footprint
> >> for pgtable memory. At the same time, 64K is not suitable for general
> >> purpose environments due to it's significantly higher memory footprint.
> >>
> >> To solve this, we have been experimenting with a concept called "per-process
> >> page size". This breaks the historic assumption of a single page size for the
> >> entire system: a process will now operate on a page size ABI that is greater
> >> than or equal to the kernel's page size. This is enabled by a key architectural
> >> feature on Arm: the separation of user and kernel page tables.
> >>
> >> This can also lead to a future of a single kernel image instead of 4K, 16K
> >> and 64K images.
> >>
> >> --------------
> >> CURRENT DESIGN
> >> --------------
> >> The design is based on one core idea; most of the kernel continues to believe
> >> there is only one page size in use across the whole system. That page size is
> >> the size selected at compile-time, as is done today. But every process (more
> >> accurately mm_struct) has a page size ABI which is one of the 3 page sizes
> >> (4K, 16K or 64K) as long as that page size is greater than or equal to the
> >> kernel page size (kernel page size is the macro PAGE_SIZE).
> >>
> >> Pagesize selection
> >> ------------------
> >> A process' selected page size ABI comes into force at execve() time and
> >> remains fixed until the process exits or until the next execve(). Any forked
> >> processes inherit the page size of their parent.
> >> The personality() mechanism already exists for similar cases, so we propose
> >> to extend it to enable specifying the required page size.
> >>
> >> There are 3 layers to the design. The first two are not arch-dependent,
> >> and makes Linux support a per-process pagesize ABI. The last layer is
> >> arch-specific.
> >>
> >> 1. ABI adapter
> >> --------------
> >> A translation layer is added at the syscall boundary to convert between the
> >> process page size and the kernel page size. This effectively means enforcing
> >> alignment requirements for addresses passed to syscalls and ensuring that
> >> quantities passed as “number of pages” are interpreted relative to the process
> >> page size and not the kernel page size. In this way the process has the illusion
> >> that it is working in units of its page size, but the kernel is working in
> >> units of the kernel page size.
> >>
> >> 2. Generic Linux MM enlightenment
> >> ---------------------------------
> >> We enlighten the Linux MM code to always hand out memory in the granularity
> >> of process pages. Most of this work is greatly simplified because of the
> >> existing mTHP allocation paths, and the ongoing support for large folios
> >> across different areas of the kernel. The process order will be used as the
> >> hard minimum mTHP order to allocate.
> >>
> >> File memory
> >> -----------
> >> For a growing list of compliant file systems, large folios can already be
> >> stored in the page cache. There is even a mechanism, introduced to support
> >> filesystems with block sizes larger than the system page size, to set a
> >> hard-minimum size for folios on a per-address-space basis. This mechanism
> >> will be reused and extended to service the per-process page size requirements.
> >>
> >> One key reason that the 64K kernel currently consumes considerably more memory
> >> than the 4K kernel is that Linux systems often have lots of small
> >> configuration files which each require a page in the page cache. But these
> >> small files are (likely) only used by certain processes. So, we prefer to
> >> continue to cache those using a 4K page.
> >> Therefore, if a process with a larger page size maps a file whose pagecache
> >> contains smaller folios, we drop them and re-read the range with a folio
> >> order at least that of the process order.
> >>
> >> 3. Translation from Linux pagetable to native pagetable
> >> -------------------------------------------------------
> >> Assume the case of a kernel pagesize of 4K and app pagesize of 64K.
> >> Now that enlightenment is done, it is guaranteed that every single mapping
> >> in the 4K pagetable (which we call the Linux pagetable) is of granularity
> >> at least 64K. In the arm64 MM code, we maintain a "native" pagetable per
> >> mm_struct, which is based off a 64K geometry. Because of the guarantee
> >> aforementioned, any pagetable operation on the Linux pagetable
> >> (set_ptes, clear_flush_ptes, modify_prot_start_ptes, etc) is going to happen
> >> at a granularity of at least 16 PTEs - therefore we can translate this
> >> operation to modify a single PTE entry in the native pagetable.
> >> Given that enlightenment may miss corner cases, we insert a warning in the
> >> architecture code - on being presented with an operation not translatable
> >> into a native operation, we fallback to the Linux pagetable, thus losing
> >> the benefits borne out of the pagetable geometry but keeping
> >> the emulation intact.
> >>
> >> -----------------------
> >> What we want to discuss
> >> -----------------------
> >>  - Are there other arches which could benefit from this?
> >>  - What level of compatibility we can achieve - is it even possible to
> >>    contain userspace within the emulated ABI?
> >>  - Rough edges of compatibility layer - pfnmaps, ksm, procfs, etc. For
> >>    example, what happens when a 64K process opens a procfs file of
> >>    a 4K process?
> >>  - native pgtable implementation - perhaps inspiration can be taken
> >>    from other arches with an involved pgtable logic (ppc, s390)?
> >>
> >
> > Hi Dev, Ryan,
> >
> > I'd be very interested in joining this discussion at LSF/MM.
>
> Thanks Kalesh for your interest!
>
> >
> > On Android, we have a separate but very related use case: we emulate a
> > larger userspace page size on x86, primarily to allow app developers
> > to test their apps for 16KB compatibility using x86 emulators [1].
> >
> > Similar to your proposed "ABI adapter" layer, our approach works by
> > enforcing a larger 16KB granularity and alignment on the VMAs to
> > emulate the userspace page size, while the underlying kernel still
> > operates on a 4KB granularity [2].
> >
> > In our emulation experience, we've run into a few specific rough edges:
> >
> > 1. mmap and SIGBUS: Enforcing a larger VMA granularity means that
> > mapping files can easily extend the VMA beyond the end of the file's
> > valid offset. When userspace touches this padded area, the 4KB filemap
> > fault cannot resolve to a valid index, resulting in a SIGBUS that
> > applications aren't expecting.
>
> You did mention in the other email the links below, and I went ahead
> to compare :) I was puzzled to see some sort of VMA padding approach
> in your patches. OTOH our approach pads anonymous pages. So for example,
> if a 64K process maps a 12K sized file, we will map 52K/4K = 13 anonymous
> pages into the 64K-aligned VMA.
>
> Implementation-wise, we detect such a condition in filemap_fault
> and return VM_FAULT_NEED_ANONPAGE, and redirect that to do_anonymous_page
> to map 4K pages.

Ah, the VMA padding patches you saw are actually for a different feature.

To handle the file mapping overhang, we currently insert a separate
anonymous VMA to cover the remainder of the emulated page range. Tough
I think your approach of returning VM_FAULT_NEED_ANONPAGE to fault
anonymous pages without needing to manage extra VMAs is a much cleaner
design :)

Thanks,
Kalesh

>
> >
> > 2. userfaultfd: This inherently operates at the strict PTE granularity
> > of the underlying kernel (4KB). Hiding this from a userspace that
> > expects a 16KB/64KB fault granularity while the kernel still operates
> > on 4KB granularity is messy ...
>
> Indeed. We will have to fault in 16 4K pages.
>
> >
> > 3. pagemap and PFN interfaces: As you noted with procfs, interfaces
> > that expose or consume PFNs are problematic. Userspace tools reading
> > /proc/pid/pagemap, /proc/kpagecount, /proc/kpageflags,
> > /proc/kpagecgroup, and /sys/kernel/mm/page_idle/bitmap calculate
> > offsets based on the userspace page size ABI, but the kernel returns
> > 4KB PFNs which breaks such users.
> >
> >
> > It would be great to explore if we can align on a unified approach to
> > solve these.
> >
> > [1] https://developer.android.com/guide/practices/page-sizes#16kb-emulator
> > [2] https://source.android.com/docs/core/architecture/16kb-page-size/getting-started-cf-x86-64-pgagnostic
> >
> > Thanks,
> > Kalesh
> >
> >> -------------
> >> Key Attendees
> >> -------------
> >>  - Ryan Roberts (co-presenter)
> >>  - mm folks (David Hildenbrand, Matthew Wilcox, Liam Howlett, Lorenzo Stoakes,
> >>              and many others)
> >>  - arch folks
> >>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2026-02-27  5:11 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-17 14:50 [LSF/MM/BPF TOPIC] Per-process page size Dev Jain
2026-02-17 15:22 ` Matthew Wilcox
2026-02-17 15:30   ` David Hildenbrand (Arm)
2026-02-17 15:51     ` Ryan Roberts
2026-02-20  4:49     ` Matthew Wilcox
2026-02-20 16:50       ` David Hildenbrand (Arm)
2026-02-23 13:02         ` [Lsf-pc] " Jan Kara
2026-02-18  8:39   ` Dev Jain
2026-02-18  8:58     ` Dev Jain
2026-02-18  9:15       ` David Hildenbrand (Arm)
2026-02-20  9:49   ` Arnd Bergmann
2026-02-20 13:37 ` Pedro Falcato
2026-02-23  5:07   ` Dev Jain
2026-02-23 12:49     ` Pedro Falcato
2026-02-23 13:01       ` David Hildenbrand (Arm)
2026-02-23 15:18     ` Matthew Wilcox
2026-02-23 16:28       ` David Hildenbrand (Arm)
2026-02-24  4:32         ` Dev Jain
2026-02-26  7:40 ` Kalesh Singh
2026-02-26  8:45   ` Dev Jain
2026-02-27  5:11     ` Kalesh Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox