* [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
@ 2026-02-19 15:08 Kiryl Shutsemau
2026-02-19 15:17 ` Peter Zijlstra
` (6 more replies)
0 siblings, 7 replies; 33+ messages in thread
From: Kiryl Shutsemau @ 2026-02-19 15:08 UTC (permalink / raw)
To: lsf-pc, linux-mm
Cc: x86, linux-kernel, Andrew Morton, David Hildenbrand,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
Johannes Weiner, Usama Arif
No, there's no new hardware (that I know of). I want to explore what page size
means.
The kernel uses the same value - PAGE_SIZE - for two things:
- the order-0 buddy allocation size;
- the granularity of virtual address space mapping;
I think we can benefit from separating these two meanings and allowing
order-0 allocations to be larger than the virtual address space covered by a
PTE entry.
The main motivation is scalability. Managing memory on multi-terabyte
machines in 4k is suboptimal, to say the least.
Potential benefits of the approach (assuming 64k pages):
- The order-0 page size cuts struct page overhead by a factor of 16. From
~1.6% of RAM to ~0.1%;
- TLB wins on machines with TLB coalescing as long as mapping is naturally
aligned;
- Order-5 allocation is 2M, resulting in less pressure on the zone lock;
- 1G pages are within possibility for the buddy allocator - order-14
allocation. It can open the road to 1G THPs.
- As with THP, fewer pages - less pressure on the LRU lock;
- ...
The trade-off is memory waste (similar to what we have on architectures with
native 64k pages today) and complexity, mostly in the core-MM code.
== Design considerations ==
I want to split PAGE_SIZE into two distinct values:
- PTE_SIZE defines the virtual address space granularity;
- PG_SIZE defines the size of the order-0 buddy allocation;
PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
requires conversion, and keep existing code working while conversion is in
progress.
The same split happens for other page-related macros: mask, shift,
alignment helpers, etc.
PFNs are in PTE_SIZE units.
The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
units.
Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
for userspace. But we might want to communicate PG_SIZE to userspace to
get the optimal results for userspace that cares.
PTE_SIZE granularity requires a substantial rework of page fault and VMA
handling:
- A struct page pointer and pgprot_t are not enough to create a PTE entry.
We also need the offset within the page we are creating the PTE for.
- Since the VMA start can be aligned arbitrarily with respect to the
underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
which is in PTE_SIZE units.
- The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
misaligned cases;
Page faults into file mappings are relatively simple to handle as we
always have the page cache to refer to. So you can map only the part of the
page that fits in the page table, similarly to fault-around.
Anonymous and file-CoW faults should also be simple as long as the VMA is
aligned to PG_SIZE in both the virtual address space and with respect to
vm_pgoff. We might waste some memory on the ends of the VMA, but it is
tolerable.
Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
pages across a page table boundary. In the worst case, a page is mapped across
a PGD entry boundary and PTEs for the page have to be put in two separate
subtrees of page tables.
A naive implementation would map different pages on different sides of a
page table boundary and accept the waste of one page per page table crossing.
The hope is that misaligned mappings are rare, but this is suboptimal.
mremap(2) is the ultimate stress test for the design.
On x86, page tables are allocated from the buddy allocator and if PG_SIZE
is greater than 4 KB, we need a way to pack multiple page tables into a
single page. We could use the slab allocator for this, but it would
require relocating the page-table metadata out of struct page.
Things I have not thought much about yet:
- Accounting for wasted memory;
- rmap;
- mapcount;
- A lot of arch-specific code;
- <insert my blind spot here>;
== Status ==
I have a POC implementation on top of v6.17:
git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git pte_size
It is WIP and full of hacks I am trying to make sense of now.
It compiles with my minimalistic kernel config and can boot to a shell with
both 16k and 64k base page sizes. The shell doesn't crash immediately, but
sometimes I wonder why :P
The patchset is large:
378 files changed, 3348 insertions(+), 3102 deletions(-)
and it is far from being complete.
== Goals ==
I want to get feedback for the overall design and possible ways to
upstream.
My plan is to submit an RFC-quality patchset before the summit.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 15:08 [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Kiryl Shutsemau
@ 2026-02-19 15:17 ` Peter Zijlstra
2026-02-19 15:20 ` Peter Zijlstra
2026-02-19 15:33 ` Pedro Falcato
` (5 subsequent siblings)
6 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2026-02-19 15:17 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Matthew Wilcox, Johannes Weiner, Usama Arif
On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
> No, there's no new hardware (that I know of). I want to explore what page size
> means.
>
> The kernel uses the same value - PAGE_SIZE - for two things:
>
> - the order-0 buddy allocation size;
>
> - the granularity of virtual address space mapping;
>
> I think we can benefit from separating these two meanings and allowing
> order-0 allocations to be larger than the virtual address space covered by a
> PTE entry.
Didn't AA do this a decade ago or somesuch?
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 15:17 ` Peter Zijlstra
@ 2026-02-19 15:20 ` Peter Zijlstra
2026-02-19 15:27 ` Kiryl Shutsemau
0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2026-02-19 15:20 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Matthew Wilcox, Johannes Weiner, Usama Arif
On Thu, Feb 19, 2026 at 04:17:29PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
> > No, there's no new hardware (that I know of). I want to explore what page size
> > means.
> >
> > The kernel uses the same value - PAGE_SIZE - for two things:
> >
> > - the order-0 buddy allocation size;
> >
> > - the granularity of virtual address space mapping;
> >
> > I think we can benefit from separating these two meanings and allowing
> > order-0 allocations to be larger than the virtual address space covered by a
> > PTE entry.
>
> Didn't AA do this a decade ago or somesuch?
https://lwn.net/Articles/240914/
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 15:20 ` Peter Zijlstra
@ 2026-02-19 15:27 ` Kiryl Shutsemau
0 siblings, 0 replies; 33+ messages in thread
From: Kiryl Shutsemau @ 2026-02-19 15:27 UTC (permalink / raw)
To: Peter Zijlstra
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Matthew Wilcox, Johannes Weiner, Usama Arif
On Thu, Feb 19, 2026 at 04:20:45PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 19, 2026 at 04:17:29PM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
> > > No, there's no new hardware (that I know of). I want to explore what page size
> > > means.
> > >
> > > The kernel uses the same value - PAGE_SIZE - for two things:
> > >
> > > - the order-0 buddy allocation size;
> > >
> > > - the granularity of virtual address space mapping;
> > >
> > > I think we can benefit from separating these two meanings and allowing
> > > order-0 allocations to be larger than the virtual address space covered by a
> > > PTE entry.
> >
> > Didn't AA do this a decade ago or somesuch?
>
> https://lwn.net/Articles/240914/
Oh, 2007. It predates me in kernel. Will read up. Thanks!
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 15:08 [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Kiryl Shutsemau
2026-02-19 15:17 ` Peter Zijlstra
@ 2026-02-19 15:33 ` Pedro Falcato
2026-02-19 15:50 ` Kiryl Shutsemau
2026-02-19 15:39 ` David Hildenbrand (Arm)
` (4 subsequent siblings)
6 siblings, 1 reply; 33+ messages in thread
From: Pedro Falcato @ 2026-02-19 15:33 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Matthew Wilcox, Johannes Weiner, Usama Arif
On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
> No, there's no new hardware (that I know of). I want to explore what page size
> means.
>
> The kernel uses the same value - PAGE_SIZE - for two things:
>
> - the order-0 buddy allocation size;
>
> - the granularity of virtual address space mapping;
>
> I think we can benefit from separating these two meanings and allowing
> order-0 allocations to be larger than the virtual address space covered by a
> PTE entry.
>
Doesn't this idea make less sense these days, with mTHP? Simply by toggling one
of the entries in /sys/kernel/mm/transparent_hugepage.
> The main motivation is scalability. Managing memory on multi-terabyte
> machines in 4k is suboptimal, to say the least.
>
> Potential benefits of the approach (assuming 64k pages):
>
> - The order-0 page size cuts struct page overhead by a factor of 16. From
> ~1.6% of RAM to ~0.1%;
>
> - TLB wins on machines with TLB coalescing as long as mapping is naturally
> aligned;
>
> - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
>
> - 1G pages are within possibility for the buddy allocator - order-14
> allocation. It can open the road to 1G THPs.
>
> - As with THP, fewer pages - less pressure on the LRU lock;
We could perhaps add a way to enforce a min_order globally on the page cache,
as a way to address it.
There are some points there which aren't addressed by mTHP work in any way
(1G THPs for one), others which are being addressed separately (memdesc work
trying to cut down on struct page overhead).
(I also don't understand your point about order-5 allocation, AFAIK pcp will
cache up to COSTLY_ORDER (3) and PMD order, but I'm probably not seeing the
full picture)
--
Pedro
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 15:08 [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Kiryl Shutsemau
2026-02-19 15:17 ` Peter Zijlstra
2026-02-19 15:33 ` Pedro Falcato
@ 2026-02-19 15:39 ` David Hildenbrand (Arm)
2026-02-19 15:54 ` Kiryl Shutsemau
` (2 more replies)
2026-02-19 17:08 ` Dave Hansen
` (3 subsequent siblings)
6 siblings, 3 replies; 33+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-19 15:39 UTC (permalink / raw)
To: Kiryl Shutsemau, lsf-pc, linux-mm
Cc: x86, linux-kernel, Andrew Morton, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, Matthew Wilcox, Johannes Weiner, Usama Arif
On 2/19/26 16:08, Kiryl Shutsemau wrote:
> No, there's no new hardware (that I know of). I want to explore what page size
> means.
>
> The kernel uses the same value - PAGE_SIZE - for two things:
>
> - the order-0 buddy allocation size;
>
> - the granularity of virtual address space mapping;
>
> I think we can benefit from separating these two meanings and allowing
> order-0 allocations to be larger than the virtual address space covered by a
> PTE entry.
>
> The main motivation is scalability. Managing memory on multi-terabyte
> machines in 4k is suboptimal, to say the least.
>
> Potential benefits of the approach (assuming 64k pages):
>
> - The order-0 page size cuts struct page overhead by a factor of 16. From
> ~1.6% of RAM to ~0.1%;
>
> - TLB wins on machines with TLB coalescing as long as mapping is naturally
> aligned;
>
> - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
>
> - 1G pages are within possibility for the buddy allocator - order-14
> allocation. It can open the road to 1G THPs.
>
> - As with THP, fewer pages - less pressure on the LRU lock;
>
> - ...
>
> The trade-off is memory waste (similar to what we have on architectures with
> native 64k pages today) and complexity, mostly in the core-MM code.
>
> == Design considerations ==
>
> I want to split PAGE_SIZE into two distinct values:
>
> - PTE_SIZE defines the virtual address space granularity;
>
> - PG_SIZE defines the size of the order-0 buddy allocation;
>
> PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
> requires conversion, and keep existing code working while conversion is in
> progress.
>
> The same split happens for other page-related macros: mask, shift,
> alignment helpers, etc.
>
> PFNs are in PTE_SIZE units.
>
> The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
> units.
>
> Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
> for userspace. But we might want to communicate PG_SIZE to userspace to
> get the optimal results for userspace that cares.
>
> PTE_SIZE granularity requires a substantial rework of page fault and VMA
> handling:
>
> - A struct page pointer and pgprot_t are not enough to create a PTE entry.
> We also need the offset within the page we are creating the PTE for.
>
> - Since the VMA start can be aligned arbitrarily with respect to the
> underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
> which is in PTE_SIZE units.
>
> - The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
> misaligned cases;
>
> Page faults into file mappings are relatively simple to handle as we
> always have the page cache to refer to. So you can map only the part of the
> page that fits in the page table, similarly to fault-around.
>
> Anonymous and file-CoW faults should also be simple as long as the VMA is
> aligned to PG_SIZE in both the virtual address space and with respect to
> vm_pgoff. We might waste some memory on the ends of the VMA, but it is
> tolerable.
>
> Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
> pages across a page table boundary. In the worst case, a page is mapped across
> a PGD entry boundary and PTEs for the page have to be put in two separate
> subtrees of page tables.
>
> A naive implementation would map different pages on different sides of a
> page table boundary and accept the waste of one page per page table crossing.
> The hope is that misaligned mappings are rare, but this is suboptimal.
>
> mremap(2) is the ultimate stress test for the design.
>
> On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> is greater than 4 KB, we need a way to pack multiple page tables into a
> single page. We could use the slab allocator for this, but it would
> require relocating the page-table metadata out of struct page.
When discussing per-process page sizes with Ryan and Dev, I mentioned
that having a larger emulated page size could be interesting for other
architectures as well.
That is, we would emulate a 64K page size on Intel for user space as
well, but let the OS work with 4K pages.
We'd only allocate+map large folios into user space + pagecache, but
still allow for page tables etc. to not waste memory.
So "most" of your allocations in the system would actually be at least
64k, reducing zone lock contention etc.
It doesn't solve all the problems you wanted to tackle on your list
(e.g., "struct page" overhead, which will be sorted out by memdescs).
--
Cheers,
David
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 15:33 ` Pedro Falcato
@ 2026-02-19 15:50 ` Kiryl Shutsemau
2026-02-19 15:53 ` David Hildenbrand (Arm)
0 siblings, 1 reply; 33+ messages in thread
From: Kiryl Shutsemau @ 2026-02-19 15:50 UTC (permalink / raw)
To: Pedro Falcato
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Matthew Wilcox, Johannes Weiner, Usama Arif
On Thu, Feb 19, 2026 at 03:33:47PM +0000, Pedro Falcato wrote:
> On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
> > No, there's no new hardware (that I know of). I want to explore what page size
> > means.
> >
> > The kernel uses the same value - PAGE_SIZE - for two things:
> >
> > - the order-0 buddy allocation size;
> >
> > - the granularity of virtual address space mapping;
> >
> > I think we can benefit from separating these two meanings and allowing
> > order-0 allocations to be larger than the virtual address space covered by a
> > PTE entry.
> >
>
> Doesn't this idea make less sense these days, with mTHP? Simply by toggling one
> of the entries in /sys/kernel/mm/transparent_hugepage.
mTHP is still best effort. This is way you don't need to care about
fragmentation, you will get your 64k page as long as you have free
memory.
> > The main motivation is scalability. Managing memory on multi-terabyte
> > machines in 4k is suboptimal, to say the least.
> >
> > Potential benefits of the approach (assuming 64k pages):
> >
> > - The order-0 page size cuts struct page overhead by a factor of 16. From
> > ~1.6% of RAM to ~0.1%;
> >
> > - TLB wins on machines with TLB coalescing as long as mapping is naturally
> > aligned;
> >
> > - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
> >
> > - 1G pages are within possibility for the buddy allocator - order-14
> > allocation. It can open the road to 1G THPs.
> >
> > - As with THP, fewer pages - less pressure on the LRU lock;
>
> We could perhaps add a way to enforce a min_order globally on the page cache,
> as a way to address it.
Raising min_order is not free. I puts more pressure on page allocator.
> There are some points there which aren't addressed by mTHP work in any way
> (1G THPs for one), others which are being addressed separately (memdesc work
> trying to cut down on struct page overhead).
>
> (I also don't understand your point about order-5 allocation, AFAIK pcp will
> cache up to COSTLY_ORDER (3) and PMD order, but I'm probably not seeing the
> full picture)
With higher base page size, page allocator doesn't need to do as much
work to merge/split buddy pages. So serving the same 2M as order-5 is
cheaper than order-9.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 15:50 ` Kiryl Shutsemau
@ 2026-02-19 15:53 ` David Hildenbrand (Arm)
2026-02-19 19:31 ` Pedro Falcato
0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-19 15:53 UTC (permalink / raw)
To: Kiryl Shutsemau, Pedro Falcato
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
Johannes Weiner, Usama Arif
On 2/19/26 16:50, Kiryl Shutsemau wrote:
> On Thu, Feb 19, 2026 at 03:33:47PM +0000, Pedro Falcato wrote:
>> On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
>>> No, there's no new hardware (that I know of). I want to explore what page size
>>> means.
>>>
>>> The kernel uses the same value - PAGE_SIZE - for two things:
>>>
>>> - the order-0 buddy allocation size;
>>>
>>> - the granularity of virtual address space mapping;
>>>
>>> I think we can benefit from separating these two meanings and allowing
>>> order-0 allocations to be larger than the virtual address space covered by a
>>> PTE entry.
>>>
>>
>> Doesn't this idea make less sense these days, with mTHP? Simply by toggling one
>> of the entries in /sys/kernel/mm/transparent_hugepage.
>
> mTHP is still best effort. This is way you don't need to care about
> fragmentation, you will get your 64k page as long as you have free
> memory.
>
>>> The main motivation is scalability. Managing memory on multi-terabyte
>>> machines in 4k is suboptimal, to say the least.
>>>
>>> Potential benefits of the approach (assuming 64k pages):
>>>
>>> - The order-0 page size cuts struct page overhead by a factor of 16. From
>>> ~1.6% of RAM to ~0.1%;
>>>
>>> - TLB wins on machines with TLB coalescing as long as mapping is naturally
>>> aligned;
>>>
>>> - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
>>>
>>> - 1G pages are within possibility for the buddy allocator - order-14
>>> allocation. It can open the road to 1G THPs.
>>>
>>> - As with THP, fewer pages - less pressure on the LRU lock;
>>
>> We could perhaps add a way to enforce a min_order globally on the page cache,
>> as a way to address it.
>
> Raising min_order is not free. I puts more pressure on page allocator.
>
>> There are some points there which aren't addressed by mTHP work in any way
>> (1G THPs for one), others which are being addressed separately (memdesc work
>> trying to cut down on struct page overhead).
>>
>> (I also don't understand your point about order-5 allocation, AFAIK pcp will
>> cache up to COSTLY_ORDER (3) and PMD order, but I'm probably not seeing the
>> full picture)
>
> With higher base page size, page allocator doesn't need to do as much
> work to merge/split buddy pages. So serving the same 2M as order-5 is
> cheaper than order-9.
I think the idea is that if most of your allocations (anon + pagecache)
are 64k instead of 4k, on average, you'll just naturally do less merging
splitting.
--
Cheers,
David
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 15:39 ` David Hildenbrand (Arm)
@ 2026-02-19 15:54 ` Kiryl Shutsemau
2026-02-19 16:09 ` David Hildenbrand (Arm)
2026-02-19 17:09 ` Kiryl Shutsemau
2026-02-19 23:24 ` Kalesh Singh
2 siblings, 1 reply; 33+ messages in thread
From: Kiryl Shutsemau @ 2026-02-19 15:54 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
Johannes Weiner, Usama Arif
On Thu, Feb 19, 2026 at 04:39:34PM +0100, David Hildenbrand (Arm) wrote:
> On 2/19/26 16:08, Kiryl Shutsemau wrote:
> > No, there's no new hardware (that I know of). I want to explore what page size
> > means.
> >
> > The kernel uses the same value - PAGE_SIZE - for two things:
> >
> > - the order-0 buddy allocation size;
> >
> > - the granularity of virtual address space mapping;
> >
> > I think we can benefit from separating these two meanings and allowing
> > order-0 allocations to be larger than the virtual address space covered by a
> > PTE entry.
> >
> > The main motivation is scalability. Managing memory on multi-terabyte
> > machines in 4k is suboptimal, to say the least.
> >
> > Potential benefits of the approach (assuming 64k pages):
> >
> > - The order-0 page size cuts struct page overhead by a factor of 16. From
> > ~1.6% of RAM to ~0.1%;
> >
> > - TLB wins on machines with TLB coalescing as long as mapping is naturally
> > aligned;
> >
> > - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
> >
> > - 1G pages are within possibility for the buddy allocator - order-14
> > allocation. It can open the road to 1G THPs.
> >
> > - As with THP, fewer pages - less pressure on the LRU lock;
> >
> > - ...
> >
> > The trade-off is memory waste (similar to what we have on architectures with
> > native 64k pages today) and complexity, mostly in the core-MM code.
> >
> > == Design considerations ==
> >
> > I want to split PAGE_SIZE into two distinct values:
> >
> > - PTE_SIZE defines the virtual address space granularity;
> >
> > - PG_SIZE defines the size of the order-0 buddy allocation;
> >
> > PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
> > requires conversion, and keep existing code working while conversion is in
> > progress.
> >
> > The same split happens for other page-related macros: mask, shift,
> > alignment helpers, etc.
> >
> > PFNs are in PTE_SIZE units.
> >
> > The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
> > units.
> >
> > Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
> > for userspace. But we might want to communicate PG_SIZE to userspace to
> > get the optimal results for userspace that cares.
> >
> > PTE_SIZE granularity requires a substantial rework of page fault and VMA
> > handling:
> >
> > - A struct page pointer and pgprot_t are not enough to create a PTE entry.
> > We also need the offset within the page we are creating the PTE for.
> >
> > - Since the VMA start can be aligned arbitrarily with respect to the
> > underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
> > which is in PTE_SIZE units.
> >
> > - The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
> > misaligned cases;
> >
> > Page faults into file mappings are relatively simple to handle as we
> > always have the page cache to refer to. So you can map only the part of the
> > page that fits in the page table, similarly to fault-around.
> >
> > Anonymous and file-CoW faults should also be simple as long as the VMA is
> > aligned to PG_SIZE in both the virtual address space and with respect to
> > vm_pgoff. We might waste some memory on the ends of the VMA, but it is
> > tolerable.
> >
> > Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
> > pages across a page table boundary. In the worst case, a page is mapped across
> > a PGD entry boundary and PTEs for the page have to be put in two separate
> > subtrees of page tables.
> >
> > A naive implementation would map different pages on different sides of a
> > page table boundary and accept the waste of one page per page table crossing.
> > The hope is that misaligned mappings are rare, but this is suboptimal.
> >
> > mremap(2) is the ultimate stress test for the design.
> >
> > On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> > is greater than 4 KB, we need a way to pack multiple page tables into a
> > single page. We could use the slab allocator for this, but it would
> > require relocating the page-table metadata out of struct page.
>
> When discussing per-process page sizes with Ryan and Dev, I mentioned that
> having a larger emulated page size could be interesting for other
> architectures as well.
>
> That is, we would emulate a 64K page size on Intel for user space as well,
> but let the OS work with 4K pages.
>
> We'd only allocate+map large folios into user space + pagecache, but still
> allow for page tables etc. to not waste memory.
>
> So "most" of your allocations in the system would actually be at least 64k,
> reducing zone lock contention etc.
I am not convinced emulation would help zone lock contention. I expect
contention to be higher if page allocator would see a mix of 4k and 64k
requests. It sounds like constant split/merge under the lock.
> It doesn't solve all the problems you wanted to tackle on your list (e.g.,
> "struct page" overhead, which will be sorted out by memdescs).
I don't think we can serve 1G pages out of buddy allocator with 4k
order-0. And without it, I don't see how to get to a viable 1G THPs.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 15:54 ` Kiryl Shutsemau
@ 2026-02-19 16:09 ` David Hildenbrand (Arm)
2026-02-20 2:55 ` Zi Yan
0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-19 16:09 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
Johannes Weiner, Usama Arif
On 2/19/26 16:54, Kiryl Shutsemau wrote:
> On Thu, Feb 19, 2026 at 04:39:34PM +0100, David Hildenbrand (Arm) wrote:
>> On 2/19/26 16:08, Kiryl Shutsemau wrote:
>>> No, there's no new hardware (that I know of). I want to explore what page size
>>> means.
>>>
>>> The kernel uses the same value - PAGE_SIZE - for two things:
>>>
>>> - the order-0 buddy allocation size;
>>>
>>> - the granularity of virtual address space mapping;
>>>
>>> I think we can benefit from separating these two meanings and allowing
>>> order-0 allocations to be larger than the virtual address space covered by a
>>> PTE entry.
>>>
>>> The main motivation is scalability. Managing memory on multi-terabyte
>>> machines in 4k is suboptimal, to say the least.
>>>
>>> Potential benefits of the approach (assuming 64k pages):
>>>
>>> - The order-0 page size cuts struct page overhead by a factor of 16. From
>>> ~1.6% of RAM to ~0.1%;
>>>
>>> - TLB wins on machines with TLB coalescing as long as mapping is naturally
>>> aligned;
>>>
>>> - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
>>>
>>> - 1G pages are within possibility for the buddy allocator - order-14
>>> allocation. It can open the road to 1G THPs.
>>>
>>> - As with THP, fewer pages - less pressure on the LRU lock;
>>>
>>> - ...
>>>
>>> The trade-off is memory waste (similar to what we have on architectures with
>>> native 64k pages today) and complexity, mostly in the core-MM code.
>>>
>>> == Design considerations ==
>>>
>>> I want to split PAGE_SIZE into two distinct values:
>>>
>>> - PTE_SIZE defines the virtual address space granularity;
>>>
>>> - PG_SIZE defines the size of the order-0 buddy allocation;
>>>
>>> PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
>>> requires conversion, and keep existing code working while conversion is in
>>> progress.
>>>
>>> The same split happens for other page-related macros: mask, shift,
>>> alignment helpers, etc.
>>>
>>> PFNs are in PTE_SIZE units.
>>>
>>> The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
>>> units.
>>>
>>> Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
>>> for userspace. But we might want to communicate PG_SIZE to userspace to
>>> get the optimal results for userspace that cares.
>>>
>>> PTE_SIZE granularity requires a substantial rework of page fault and VMA
>>> handling:
>>>
>>> - A struct page pointer and pgprot_t are not enough to create a PTE entry.
>>> We also need the offset within the page we are creating the PTE for.
>>>
>>> - Since the VMA start can be aligned arbitrarily with respect to the
>>> underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
>>> which is in PTE_SIZE units.
>>>
>>> - The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
>>> misaligned cases;
>>>
>>> Page faults into file mappings are relatively simple to handle as we
>>> always have the page cache to refer to. So you can map only the part of the
>>> page that fits in the page table, similarly to fault-around.
>>>
>>> Anonymous and file-CoW faults should also be simple as long as the VMA is
>>> aligned to PG_SIZE in both the virtual address space and with respect to
>>> vm_pgoff. We might waste some memory on the ends of the VMA, but it is
>>> tolerable.
>>>
>>> Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
>>> pages across a page table boundary. In the worst case, a page is mapped across
>>> a PGD entry boundary and PTEs for the page have to be put in two separate
>>> subtrees of page tables.
>>>
>>> A naive implementation would map different pages on different sides of a
>>> page table boundary and accept the waste of one page per page table crossing.
>>> The hope is that misaligned mappings are rare, but this is suboptimal.
>>>
>>> mremap(2) is the ultimate stress test for the design.
>>>
>>> On x86, page tables are allocated from the buddy allocator and if PG_SIZE
>>> is greater than 4 KB, we need a way to pack multiple page tables into a
>>> single page. We could use the slab allocator for this, but it would
>>> require relocating the page-table metadata out of struct page.
>>
>> When discussing per-process page sizes with Ryan and Dev, I mentioned that
>> having a larger emulated page size could be interesting for other
>> architectures as well.
>>
>> That is, we would emulate a 64K page size on Intel for user space as well,
>> but let the OS work with 4K pages.
>>
>> We'd only allocate+map large folios into user space + pagecache, but still
>> allow for page tables etc. to not waste memory.
>>
>> So "most" of your allocations in the system would actually be at least 64k,
>> reducing zone lock contention etc.
>
> I am not convinced emulation would help zone lock contention. I expect
> contention to be higher if page allocator would see a mix of 4k and 64k
> requests. It sounds like constant split/merge under the lock.
If most your allocations are larger, then there isn't that much
splitting/merging.
There will be some for the < 64k allocations of course, but when all
user space+page cache is >= 64 then the split/merge + zone lock should
be heavily reduced.
>
>> It doesn't solve all the problems you wanted to tackle on your list (e.g.,
>> "struct page" overhead, which will be sorted out by memdescs).
>
> I don't think we can serve 1G pages out of buddy allocator with 4k
> order-0. And without it, I don't see how to get to a viable 1G THPs.
Zi Yan was one working on this, and I think we had ideas on how to make
that work in the long run.
--
Cheers,
David
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 15:08 [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Kiryl Shutsemau
` (2 preceding siblings ...)
2026-02-19 15:39 ` David Hildenbrand (Arm)
@ 2026-02-19 17:08 ` Dave Hansen
2026-02-19 22:05 ` Kiryl Shutsemau
2026-02-19 17:30 ` Dave Hansen
` (2 subsequent siblings)
6 siblings, 1 reply; 33+ messages in thread
From: Dave Hansen @ 2026-02-19 17:08 UTC (permalink / raw)
To: Kiryl Shutsemau, lsf-pc, linux-mm
Cc: x86, linux-kernel, Andrew Morton, David Hildenbrand,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
Johannes Weiner, Usama Arif
On 2/19/26 07:08, Kiryl Shutsemau wrote:
> - The order-0 page size cuts struct page overhead by a factor of 16. From
> ~1.6% of RAM to ~0.1%;
First of all, this looks like fun. Nice work! I'm not opposed at all in
concept to cleaning up things and doing the logical separation you
described to split buddy granularity and mapping granularity. That seems
like a worthy endeavor and some of the union/#define tricks look like a
likely viable way to do it incrementally.
But I don't think there's going to be a lot of memory savings in the
end. Maybe this would bring the mem= hyperscalers back into the fold and
have them actually start using 'struct page' again for their VM memory.
Dunno.
But, let's look at my kernel directory and round the file sizes up to
4k, 16k and 64k:
find . -printf '%s\n' | while read size; do echo \
$(((size + 0x0fff) & 0xfffff000)) \
$(((size + 0x3fff) & 0xffffc000)) \
$(((size + 0xffff) & 0xffff0000));
done
... and add them all up:
11,297,648 KB - on disk
11,297,712 KB - in a 4k page cache
12,223,488 KB - in a 16k page cache
16,623,296 KB - in a 64k page cache
So a 64k page cache eats ~5GB of extra memory for a kernel tree (well,
_my_ kernel tree). In other words, if you are looking for memory savings
on my laptop, you'll need ~300GB of RAM before 'struct page' overhead
overwhelms the page cache bloat from a single kernel tree.
The whole kernel obviously isn't in the page cache all at the same time.
The page cache across the system is also obviously different than a
kernel tree, but you get the point.
That's not to diminish how useful something like this might be,
especially for folks that are sensitive to 'struct page' overhead or
allocator performance.
But, it will mostly be getting better performance at the _cost_ of
consuming more RAM, not saving RAM.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 15:39 ` David Hildenbrand (Arm)
2026-02-19 15:54 ` Kiryl Shutsemau
@ 2026-02-19 17:09 ` Kiryl Shutsemau
2026-02-20 10:24 ` David Hildenbrand (Arm)
2026-02-19 23:24 ` Kalesh Singh
2 siblings, 1 reply; 33+ messages in thread
From: Kiryl Shutsemau @ 2026-02-19 17:09 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
Johannes Weiner, Usama Arif
On Thu, Feb 19, 2026 at 04:39:34PM +0100, David Hildenbrand (Arm) wrote:
> On 2/19/26 16:08, Kiryl Shutsemau wrote:
> > No, there's no new hardware (that I know of). I want to explore what page size
> > means.
> >
> > The kernel uses the same value - PAGE_SIZE - for two things:
> >
> > - the order-0 buddy allocation size;
> >
> > - the granularity of virtual address space mapping;
> >
> > I think we can benefit from separating these two meanings and allowing
> > order-0 allocations to be larger than the virtual address space covered by a
> > PTE entry.
> >
> > The main motivation is scalability. Managing memory on multi-terabyte
> > machines in 4k is suboptimal, to say the least.
> >
> > Potential benefits of the approach (assuming 64k pages):
> >
> > - The order-0 page size cuts struct page overhead by a factor of 16. From
> > ~1.6% of RAM to ~0.1%;
> >
> > - TLB wins on machines with TLB coalescing as long as mapping is naturally
> > aligned;
> >
> > - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
> >
> > - 1G pages are within possibility for the buddy allocator - order-14
> > allocation. It can open the road to 1G THPs.
> >
> > - As with THP, fewer pages - less pressure on the LRU lock;
> >
> > - ...
> >
> > The trade-off is memory waste (similar to what we have on architectures with
> > native 64k pages today) and complexity, mostly in the core-MM code.
> >
> > == Design considerations ==
> >
> > I want to split PAGE_SIZE into two distinct values:
> >
> > - PTE_SIZE defines the virtual address space granularity;
> >
> > - PG_SIZE defines the size of the order-0 buddy allocation;
> >
> > PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
> > requires conversion, and keep existing code working while conversion is in
> > progress.
> >
> > The same split happens for other page-related macros: mask, shift,
> > alignment helpers, etc.
> >
> > PFNs are in PTE_SIZE units.
> >
> > The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
> > units.
> >
> > Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
> > for userspace. But we might want to communicate PG_SIZE to userspace to
> > get the optimal results for userspace that cares.
> >
> > PTE_SIZE granularity requires a substantial rework of page fault and VMA
> > handling:
> >
> > - A struct page pointer and pgprot_t are not enough to create a PTE entry.
> > We also need the offset within the page we are creating the PTE for.
> >
> > - Since the VMA start can be aligned arbitrarily with respect to the
> > underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
> > which is in PTE_SIZE units.
> >
> > - The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
> > misaligned cases;
> >
> > Page faults into file mappings are relatively simple to handle as we
> > always have the page cache to refer to. So you can map only the part of the
> > page that fits in the page table, similarly to fault-around.
> >
> > Anonymous and file-CoW faults should also be simple as long as the VMA is
> > aligned to PG_SIZE in both the virtual address space and with respect to
> > vm_pgoff. We might waste some memory on the ends of the VMA, but it is
> > tolerable.
> >
> > Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
> > pages across a page table boundary. In the worst case, a page is mapped across
> > a PGD entry boundary and PTEs for the page have to be put in two separate
> > subtrees of page tables.
> >
> > A naive implementation would map different pages on different sides of a
> > page table boundary and accept the waste of one page per page table crossing.
> > The hope is that misaligned mappings are rare, but this is suboptimal.
> >
> > mremap(2) is the ultimate stress test for the design.
> >
> > On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> > is greater than 4 KB, we need a way to pack multiple page tables into a
> > single page. We could use the slab allocator for this, but it would
> > require relocating the page-table metadata out of struct page.
>
> When discussing per-process page sizes with Ryan and Dev, I mentioned that
> having a larger emulated page size could be interesting for other
> architectures as well.
>
> That is, we would emulate a 64K page size on Intel for user space as well,
> but let the OS work with 4K pages.
Just to clarify, do you want it to be enforced on userspace ABI.
Like, all mappings are 64k aligned?
> We'd only allocate+map large folios into user space + pagecache, but still
> allow for page tables etc. to not waste memory.
Waste of memory for page table is solvable and pretty straight forward.
Most of such cases can be solve mechanically by switching to slab.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 15:08 [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Kiryl Shutsemau
` (3 preceding siblings ...)
2026-02-19 17:08 ` Dave Hansen
@ 2026-02-19 17:30 ` Dave Hansen
2026-02-19 22:14 ` Kiryl Shutsemau
2026-02-19 17:47 ` Matthew Wilcox
2026-02-20 9:04 ` David Laight
6 siblings, 1 reply; 33+ messages in thread
From: Dave Hansen @ 2026-02-19 17:30 UTC (permalink / raw)
To: Kiryl Shutsemau, lsf-pc, linux-mm
Cc: x86, linux-kernel, Andrew Morton, David Hildenbrand,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
Johannes Weiner, Usama Arif
On 2/19/26 07:08, Kiryl Shutsemau wrote:
...
> The patchset is large:
>
> 378 files changed, 3348 insertions(+), 3102 deletions(-)
A few notes about the diffstats:
$ git diff v6.17..HEAD arch/x86 | diffstat | tail -1
105 files changed, 874 insertions(+), 843 deletions(-)
$ git diff v6.17..HEAD mm | diffstat | tail -1
53 files changed, 1136 insertions(+), 1069 deletions(-)
The vast, vast majority of this seems to be the renames. Stuff like:
> - new = round_down(new, PAGE_SIZE);
> + new = round_down(new, PTE_SIZE);
or even less worrying:
> -int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid);
> +int set_direct_map_valid_noflush(struct page *page, unsigned numpages, bool valid);
That stuff obviously needs to be audited but it's far less concerning
than the logic changes.
So just for review sanity, if you go forward with this, I'd very much
appreciate a strong separation of the purely mechanical bits from any
logic changes.
> On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> is greater than 4 KB, we need a way to pack multiple page tables into a
> single page. We could use the slab allocator for this, but it would
> require relocating the page-table metadata out of struct page.
Others mentioned this, but I think this essentially gates what you are
doing behind a full tree conversion over to ptdescs.
The most useful thing we can do with this series is look at it and
decide what _other_ things need to get done before the tree could
possibly go in that direction, like ptdesc or a the disambiguation
between PTE_SIZE and PG_SIZE that you've kicked off here.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 15:08 [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Kiryl Shutsemau
` (4 preceding siblings ...)
2026-02-19 17:30 ` Dave Hansen
@ 2026-02-19 17:47 ` Matthew Wilcox
2026-02-19 22:26 ` Kiryl Shutsemau
2026-02-20 9:04 ` David Laight
6 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2026-02-19 17:47 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Johannes Weiner, Usama Arif
On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
> On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> is greater than 4 KB, we need a way to pack multiple page tables into a
> single page. We could use the slab allocator for this, but it would
> require relocating the page-table metadata out of struct page.
Have you looked at the s390/ppc implementations (yes, they're different,
no, that sucks)? slab seems like the wrong approach to me.
There's a third approach that I've never looked at which is to allocate
the larger size, then just use it for N consecutive entries.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 15:53 ` David Hildenbrand (Arm)
@ 2026-02-19 19:31 ` Pedro Falcato
0 siblings, 0 replies; 33+ messages in thread
From: Pedro Falcato @ 2026-02-19 19:31 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Kiryl Shutsemau, lsf-pc, linux-mm, x86, linux-kernel,
Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Matthew Wilcox, Johannes Weiner, Usama Arif
On Thu, Feb 19, 2026 at 04:53:10PM +0100, David Hildenbrand (Arm) wrote:
> On 2/19/26 16:50, Kiryl Shutsemau wrote:
> > On Thu, Feb 19, 2026 at 03:33:47PM +0000, Pedro Falcato wrote:
> > > On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
> > > > No, there's no new hardware (that I know of). I want to explore what page size
> > > > means.
> > > >
> > > > The kernel uses the same value - PAGE_SIZE - for two things:
> > > >
> > > > - the order-0 buddy allocation size;
> > > >
> > > > - the granularity of virtual address space mapping;
> > > >
> > > > I think we can benefit from separating these two meanings and allowing
> > > > order-0 allocations to be larger than the virtual address space covered by a
> > > > PTE entry.
> > > >
> > >
> > > Doesn't this idea make less sense these days, with mTHP? Simply by toggling one
> > > of the entries in /sys/kernel/mm/transparent_hugepage.
> >
> > mTHP is still best effort. This is way you don't need to care about
> > fragmentation, you will get your 64k page as long as you have free
> > memory.
> >
> > > > The main motivation is scalability. Managing memory on multi-terabyte
> > > > machines in 4k is suboptimal, to say the least.
> > > >
> > > > Potential benefits of the approach (assuming 64k pages):
> > > >
> > > > - The order-0 page size cuts struct page overhead by a factor of 16. From
> > > > ~1.6% of RAM to ~0.1%;
> > > >
> > > > - TLB wins on machines with TLB coalescing as long as mapping is naturally
> > > > aligned;
> > > >
> > > > - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
> > > >
> > > > - 1G pages are within possibility for the buddy allocator - order-14
> > > > allocation. It can open the road to 1G THPs.
> > > >
> > > > - As with THP, fewer pages - less pressure on the LRU lock;
> > >
> > > We could perhaps add a way to enforce a min_order globally on the page cache,
> > > as a way to address it.
> >
> > Raising min_order is not free. I puts more pressure on page allocator.
> >
> > > There are some points there which aren't addressed by mTHP work in any way
> > > (1G THPs for one), others which are being addressed separately (memdesc work
> > > trying to cut down on struct page overhead).
> > >
> > > (I also don't understand your point about order-5 allocation, AFAIK pcp will
> > > cache up to COSTLY_ORDER (3) and PMD order, but I'm probably not seeing the
> > > full picture)
> >
> > With higher base page size, page allocator doesn't need to do as much
> > work to merge/split buddy pages. So serving the same 2M as order-5 is
> > cheaper than order-9.
>
> I think the idea is that if most of your allocations (anon + pagecache) are
> 64k instead of 4k, on average, you'll just naturally do less merging
> splitting.
Yep. That plus slab_min_order would hopefully yield a system where 90%+
(depending on how your filesystem's buffer cache works) allocations are 64K.
--
Pedro
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 17:08 ` Dave Hansen
@ 2026-02-19 22:05 ` Kiryl Shutsemau
2026-02-20 3:28 ` Liam R. Howlett
0 siblings, 1 reply; 33+ messages in thread
From: Kiryl Shutsemau @ 2026-02-19 22:05 UTC (permalink / raw)
To: Dave Hansen
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Matthew Wilcox, Johannes Weiner, Usama Arif
On Thu, Feb 19, 2026 at 09:08:57AM -0800, Dave Hansen wrote:
> On 2/19/26 07:08, Kiryl Shutsemau wrote:
> > - The order-0 page size cuts struct page overhead by a factor of 16. From
> > ~1.6% of RAM to ~0.1%;
> ...
> But, it will mostly be getting better performance at the _cost_ of
> consuming more RAM, not saving RAM.
That's fair.
The problem with struct page memory consumption is that it is static and
cannot be reclaimed. You pay the struct page tax no matter what.
Page cache rounding overhead can be large, but a motivated userspace can
keep it under control by avoiding splitting a dataset into many small
files. And this memory is reclaimable.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 17:30 ` Dave Hansen
@ 2026-02-19 22:14 ` Kiryl Shutsemau
2026-02-19 22:21 ` Dave Hansen
0 siblings, 1 reply; 33+ messages in thread
From: Kiryl Shutsemau @ 2026-02-19 22:14 UTC (permalink / raw)
To: Dave Hansen
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Matthew Wilcox, Johannes Weiner, Usama Arif
On Thu, Feb 19, 2026 at 09:30:36AM -0800, Dave Hansen wrote:
> On 2/19/26 07:08, Kiryl Shutsemau wrote:
> ...
> > The patchset is large:
> >
> > 378 files changed, 3348 insertions(+), 3102 deletions(-)
>
> A few notes about the diffstats:
>
> $ git diff v6.17..HEAD arch/x86 | diffstat | tail -1
> 105 files changed, 874 insertions(+), 843 deletions(-)
> $ git diff v6.17..HEAD mm | diffstat | tail -1
> 53 files changed, 1136 insertions(+), 1069 deletions(-)
>
> The vast, vast majority of this seems to be the renames. Stuff like:
>
> > - new = round_down(new, PAGE_SIZE);
> > + new = round_down(new, PTE_SIZE);
>
> or even less worrying:
>
> > -int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid);
> > +int set_direct_map_valid_noflush(struct page *page, unsigned numpages, bool valid);
>
> That stuff obviously needs to be audited but it's far less concerning
> than the logic changes.
>
> So just for review sanity, if you go forward with this, I'd very much
> appreciate a strong separation of the purely mechanical bits from any
> logic changes.
That's the plan. That's the only way I can keep myself sane :P
> > On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> > is greater than 4 KB, we need a way to pack multiple page tables into a
> > single page. We could use the slab allocator for this, but it would
> > require relocating the page-table metadata out of struct page.
>
> Others mentioned this, but I think this essentially gates what you are
> doing behind a full tree conversion over to ptdescs.
I have not followed ptdescs closely. Need to catch up.
For PoC, I will just waste full order-0 page for page table. Packing is
not required for correctness.
> The most useful thing we can do with this series is look at it and
> decide what _other_ things need to get done before the tree could
> possibly go in that direction, like ptdesc or a the disambiguation
> between PTE_SIZE and PG_SIZE that you've kicked off here.
Right.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 22:14 ` Kiryl Shutsemau
@ 2026-02-19 22:21 ` Dave Hansen
0 siblings, 0 replies; 33+ messages in thread
From: Dave Hansen @ 2026-02-19 22:21 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Matthew Wilcox, Johannes Weiner, Usama Arif
On 2/19/26 14:14, Kiryl Shutsemau wrote:
>> Others mentioned this, but I think this essentially gates what you are
>> doing behind a full tree conversion over to ptdescs.
> I have not followed ptdescs closely. Need to catch up.
>
> For PoC, I will just waste full order-0 page for page table. Packing is
> not required for correctness.
Yeah, I guess padding it out is ugly but effective.
I was trying to figure out how it would apply to the KPTI pgd because we
just flip bit 12 to switch between user and kernel PGDs. But I guess the
8k of PGDs in the current allocation will fit fine in 128k, so it's
weird but functional.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 17:47 ` Matthew Wilcox
@ 2026-02-19 22:26 ` Kiryl Shutsemau
0 siblings, 0 replies; 33+ messages in thread
From: Kiryl Shutsemau @ 2026-02-19 22:26 UTC (permalink / raw)
To: Matthew Wilcox
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Johannes Weiner, Usama Arif
On Thu, Feb 19, 2026 at 05:47:22PM +0000, Matthew Wilcox wrote:
> On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
> > On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> > is greater than 4 KB, we need a way to pack multiple page tables into a
> > single page. We could use the slab allocator for this, but it would
> > require relocating the page-table metadata out of struct page.
>
> Have you looked at the s390/ppc implementations (yes, they're different,
> no, that sucks)?
No, will check it out tomorrow.
> slab seems like the wrong approach to me.
I was the first thing that came to mind. I have not put much time into
it
> There's a third approach that I've never looked at which is to allocate
> the larger size, then just use it for N consecutive entries.
Yeah, that's a possible way. We would need to populate 16 page table
entries of the parent page table. But you don't need to care about
fragmentation within the page.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 15:39 ` David Hildenbrand (Arm)
2026-02-19 15:54 ` Kiryl Shutsemau
2026-02-19 17:09 ` Kiryl Shutsemau
@ 2026-02-19 23:24 ` Kalesh Singh
2026-02-20 12:10 ` Kiryl Shutsemau
2 siblings, 1 reply; 33+ messages in thread
From: Kalesh Singh @ 2026-02-19 23:24 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Kiryl Shutsemau, lsf-pc, linux-mm, x86, linux-kernel,
Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Matthew Wilcox, Johannes Weiner, Usama Arif, android-mm,
Adrian Barnaś,
Mateusz Maćkowski, Steven Moreland
On Thu, Feb 19, 2026 at 7:39 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 2/19/26 16:08, Kiryl Shutsemau wrote:
> > No, there's no new hardware (that I know of). I want to explore what page size
> > means.
> >
> > The kernel uses the same value - PAGE_SIZE - for two things:
> >
> > - the order-0 buddy allocation size;
> >
> > - the granularity of virtual address space mapping;
> >
> > I think we can benefit from separating these two meanings and allowing
> > order-0 allocations to be larger than the virtual address space covered by a
> > PTE entry.
> >
> > The main motivation is scalability. Managing memory on multi-terabyte
> > machines in 4k is suboptimal, to say the least.
> >
> > Potential benefits of the approach (assuming 64k pages):
> >
> > - The order-0 page size cuts struct page overhead by a factor of 16. From
> > ~1.6% of RAM to ~0.1%;
> >
> > - TLB wins on machines with TLB coalescing as long as mapping is naturally
> > aligned;
> >
> > - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
> >
> > - 1G pages are within possibility for the buddy allocator - order-14
> > allocation. It can open the road to 1G THPs.
> >
> > - As with THP, fewer pages - less pressure on the LRU lock;
> >
> > - ...
> >
> > The trade-off is memory waste (similar to what we have on architectures with
> > native 64k pages today) and complexity, mostly in the core-MM code.
> >
> > == Design considerations ==
> >
> > I want to split PAGE_SIZE into two distinct values:
> >
> > - PTE_SIZE defines the virtual address space granularity;
> >
> > - PG_SIZE defines the size of the order-0 buddy allocation;
> >
> > PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
> > requires conversion, and keep existing code working while conversion is in
> > progress.
> >
> > The same split happens for other page-related macros: mask, shift,
> > alignment helpers, etc.
> >
> > PFNs are in PTE_SIZE units.
> >
> > The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
> > units.
> >
> > Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
> > for userspace. But we might want to communicate PG_SIZE to userspace to
> > get the optimal results for userspace that cares.
> >
> > PTE_SIZE granularity requires a substantial rework of page fault and VMA
> > handling:
> >
> > - A struct page pointer and pgprot_t are not enough to create a PTE entry.
> > We also need the offset within the page we are creating the PTE for.
> >
> > - Since the VMA start can be aligned arbitrarily with respect to the
> > underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
> > which is in PTE_SIZE units.
> >
> > - The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
> > misaligned cases;
> >
> > Page faults into file mappings are relatively simple to handle as we
> > always have the page cache to refer to. So you can map only the part of the
> > page that fits in the page table, similarly to fault-around.
> >
> > Anonymous and file-CoW faults should also be simple as long as the VMA is
> > aligned to PG_SIZE in both the virtual address space and with respect to
> > vm_pgoff. We might waste some memory on the ends of the VMA, but it is
> > tolerable.
> >
> > Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
> > pages across a page table boundary. In the worst case, a page is mapped across
> > a PGD entry boundary and PTEs for the page have to be put in two separate
> > subtrees of page tables.
> >
> > A naive implementation would map different pages on different sides of a
> > page table boundary and accept the waste of one page per page table crossing.
> > The hope is that misaligned mappings are rare, but this is suboptimal.
> >
> > mremap(2) is the ultimate stress test for the design.
> >
> > On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> > is greater than 4 KB, we need a way to pack multiple page tables into a
> > single page. We could use the slab allocator for this, but it would
> > require relocating the page-table metadata out of struct page.
>
> When discussing per-process page sizes with Ryan and Dev, I mentioned
> that having a larger emulated page size could be interesting for other
> architectures as well.
>
> That is, we would emulate a 64K page size on Intel for user space as
> well, but let the OS work with 4K pages.
>
> We'd only allocate+map large folios into user space + pagecache, but
> still allow for page tables etc. to not waste memory.
>
> So "most" of your allocations in the system would actually be at least
> 64k, reducing zone lock contention etc.
>
>
> It doesn't solve all the problems you wanted to tackle on your list
> (e.g., "struct page" overhead, which will be sorted out by memdescs).
Hi Kiryl,
I'd be interested to discuss this at LSFMM.
On Android, we have a separate but related use case: we emulate the
userspace page size on x86, primarily to enable app developers to
conduct compatibility testing of their apps for 16KB Android devices.
[1]
It mainly works by enforcing a larger granularity on the VMAs to
emulate a userspace page size, somewhat similar to what David
mentioned, while the underlying kernel still operates on a 4KB
granularity. [2]
IIUC the current design would not enfore the larger granularity /
alignment for VMAs to avoid breaking ABI. However, I'd be interest to
discuss whether it can be extended to cover this usecase as well.
[1] https://developer.android.com/guide/practices/page-sizes#16kb-emulator
[2] https://source.android.com/docs/core/architecture/16kb-page-size/getting-started-cf-x86-64-pgagnostic
Thanks,
Kalesh
>
> --
> Cheers,
>
> David
>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 16:09 ` David Hildenbrand (Arm)
@ 2026-02-20 2:55 ` Zi Yan
0 siblings, 0 replies; 33+ messages in thread
From: Zi Yan @ 2026-02-20 2:55 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Kiryl Shutsemau, lsf-pc, linux-mm, x86, linux-kernel,
Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Matthew Wilcox, Johannes Weiner, Usama Arif
On 19 Feb 2026, at 11:09, David Hildenbrand (Arm) wrote:
> On 2/19/26 16:54, Kiryl Shutsemau wrote:
>> On Thu, Feb 19, 2026 at 04:39:34PM +0100, David Hildenbrand (Arm) wrote:
>>> On 2/19/26 16:08, Kiryl Shutsemau wrote:
>>>> No, there's no new hardware (that I know of). I want to explore what page size
>>>> means.
>>>>
>>>> The kernel uses the same value - PAGE_SIZE - for two things:
>>>>
>>>> - the order-0 buddy allocation size;
>>>>
>>>> - the granularity of virtual address space mapping;
>>>>
>>>> I think we can benefit from separating these two meanings and allowing
>>>> order-0 allocations to be larger than the virtual address space covered by a
>>>> PTE entry.
>>>>
>>>> The main motivation is scalability. Managing memory on multi-terabyte
>>>> machines in 4k is suboptimal, to say the least.
>>>>
>>>> Potential benefits of the approach (assuming 64k pages):
>>>>
>>>> - The order-0 page size cuts struct page overhead by a factor of 16. From
>>>> ~1.6% of RAM to ~0.1%;
>>>>
>>>> - TLB wins on machines with TLB coalescing as long as mapping is naturally
>>>> aligned;
>>>>
>>>> - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
>>>>
>>>> - 1G pages are within possibility for the buddy allocator - order-14
>>>> allocation. It can open the road to 1G THPs.
>>>>
>>>> - As with THP, fewer pages - less pressure on the LRU lock;
>>>>
>>>> - ...
>>>>
>>>> The trade-off is memory waste (similar to what we have on architectures with
>>>> native 64k pages today) and complexity, mostly in the core-MM code.
>>>>
>>>> == Design considerations ==
>>>>
>>>> I want to split PAGE_SIZE into two distinct values:
>>>>
>>>> - PTE_SIZE defines the virtual address space granularity;
>>>>
>>>> - PG_SIZE defines the size of the order-0 buddy allocation;
>>>>
>>>> PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
>>>> requires conversion, and keep existing code working while conversion is in
>>>> progress.
>>>>
>>>> The same split happens for other page-related macros: mask, shift,
>>>> alignment helpers, etc.
>>>>
>>>> PFNs are in PTE_SIZE units.
>>>>
>>>> The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
>>>> units.
>>>>
>>>> Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
>>>> for userspace. But we might want to communicate PG_SIZE to userspace to
>>>> get the optimal results for userspace that cares.
>>>>
>>>> PTE_SIZE granularity requires a substantial rework of page fault and VMA
>>>> handling:
>>>>
>>>> - A struct page pointer and pgprot_t are not enough to create a PTE entry.
>>>> We also need the offset within the page we are creating the PTE for.
>>>>
>>>> - Since the VMA start can be aligned arbitrarily with respect to the
>>>> underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
>>>> which is in PTE_SIZE units.
>>>>
>>>> - The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
>>>> misaligned cases;
>>>>
>>>> Page faults into file mappings are relatively simple to handle as we
>>>> always have the page cache to refer to. So you can map only the part of the
>>>> page that fits in the page table, similarly to fault-around.
>>>>
>>>> Anonymous and file-CoW faults should also be simple as long as the VMA is
>>>> aligned to PG_SIZE in both the virtual address space and with respect to
>>>> vm_pgoff. We might waste some memory on the ends of the VMA, but it is
>>>> tolerable.
>>>>
>>>> Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
>>>> pages across a page table boundary. In the worst case, a page is mapped across
>>>> a PGD entry boundary and PTEs for the page have to be put in two separate
>>>> subtrees of page tables.
>>>>
>>>> A naive implementation would map different pages on different sides of a
>>>> page table boundary and accept the waste of one page per page table crossing.
>>>> The hope is that misaligned mappings are rare, but this is suboptimal.
>>>>
>>>> mremap(2) is the ultimate stress test for the design.
>>>>
>>>> On x86, page tables are allocated from the buddy allocator and if PG_SIZE
>>>> is greater than 4 KB, we need a way to pack multiple page tables into a
>>>> single page. We could use the slab allocator for this, but it would
>>>> require relocating the page-table metadata out of struct page.
>>>
>>> When discussing per-process page sizes with Ryan and Dev, I mentioned that
>>> having a larger emulated page size could be interesting for other
>>> architectures as well.
>>>
>>> That is, we would emulate a 64K page size on Intel for user space as well,
>>> but let the OS work with 4K pages.
>>>
>>> We'd only allocate+map large folios into user space + pagecache, but still
>>> allow for page tables etc. to not waste memory.
>>>
>>> So "most" of your allocations in the system would actually be at least 64k,
>>> reducing zone lock contention etc.
>>
>> I am not convinced emulation would help zone lock contention. I expect
>> contention to be higher if page allocator would see a mix of 4k and 64k
>> requests. It sounds like constant split/merge under the lock.
>
> If most your allocations are larger, then there isn't that much splitting/merging.
>
> There will be some for the < 64k allocations of course, but when all user space+page cache is >= 64 then the split/merge + zone lock should be heavily reduced.
>
>>
>>> It doesn't solve all the problems you wanted to tackle on your list (e.g.,
>>> "struct page" overhead, which will be sorted out by memdescs).
>>
>> I don't think we can serve 1G pages out of buddy allocator with 4k
>> order-0. And without it, I don't see how to get to a viable 1G THPs.
>
> Zi Yan was one working on this, and I think we had ideas on how to make that work in the long run.
Right. The idea is to add super pageblock (or whatever name), which consists of N consecutive
pageblocks, so that anti fragmentation can work at larger granularity, e.g., 1GB, to create
free pages. Whether 1GB free pages from memory compaction need to go into buddy allocator
or not is debatable.
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 22:05 ` Kiryl Shutsemau
@ 2026-02-20 3:28 ` Liam R. Howlett
2026-02-20 12:33 ` Kiryl Shutsemau
0 siblings, 1 reply; 33+ messages in thread
From: Liam R. Howlett @ 2026-02-20 3:28 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Dave Hansen, lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Mike Rapoport, Matthew Wilcox,
Johannes Weiner, Usama Arif
* Kiryl Shutsemau <kas@kernel.org> [260219 17:05]:
> On Thu, Feb 19, 2026 at 09:08:57AM -0800, Dave Hansen wrote:
> > On 2/19/26 07:08, Kiryl Shutsemau wrote:
> > > - The order-0 page size cuts struct page overhead by a factor of 16. From
> > > ~1.6% of RAM to ~0.1%;
> > ...
> > But, it will mostly be getting better performance at the _cost_ of
> > consuming more RAM, not saving RAM.
>
> That's fair.
>
> The problem with struct page memory consumption is that it is static and
> cannot be reclaimed. You pay the struct page tax no matter what.
>
> Page cache rounding overhead can be large, but a motivated userspace can
> keep it under control by avoiding splitting a dataset into many small
> files. And this memory is reclaimable.
>
But we are in reclaim a lot more these days. As I'm sure you are aware,
we are trying to maximize the resources (both cpu and ram) of any
machine powered on. Entering reclaim will consume the cpu time and will
affect other tasks.
Especially with multiple workload machines, the tendency is to have a
primary focus with the lower desired work being killed, if necessary.
Reducing the overhead just means more secondary tasks, or a bigger
footprint of the ones already executing.
Increasing the memory pressure will degrade the primary workload more
frequently, even if we recover enough to avoid OOMing the secondary.
While in the struct page tax world, the secondary task would be killed
after a shorter (and less frequently executed) reclaim comes up short.
So, I would think that we would be degrading the primary workload in an
attempt to keep the secondary alive? Maybe I'm over-simplifying here?
Near the other end of the spectrum, we have chromebooks that are
constantly in reclaim, even with 4k pages. I guess these machines would
be destine to maintain the same page size they use today. That is, this
solution for the struct page tax is only useful if you have a lot of
memory. But then again, that's where the bookkeeping costs become hard
to take.
Thanks,
Liam
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 15:08 [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Kiryl Shutsemau
` (5 preceding siblings ...)
2026-02-19 17:47 ` Matthew Wilcox
@ 2026-02-20 9:04 ` David Laight
2026-02-20 12:12 ` Kiryl Shutsemau
6 siblings, 1 reply; 33+ messages in thread
From: David Laight @ 2026-02-20 9:04 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Matthew Wilcox, Johannes Weiner, Usama Arif
On Thu, 19 Feb 2026 15:08:51 +0000
Kiryl Shutsemau <kas@kernel.org> wrote:
> No, there's no new hardware (that I know of). I want to explore what page size
> means.
>
> The kernel uses the same value - PAGE_SIZE - for two things:
>
> - the order-0 buddy allocation size;
>
> - the granularity of virtual address space mapping;
Also the 'random' buffers that are PAGE_SIZE rather than 4k.
I also wonder how is affects mmap of kernel memory and the alignement
of PCIe windows (etc).
David
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 17:09 ` Kiryl Shutsemau
@ 2026-02-20 10:24 ` David Hildenbrand (Arm)
2026-02-20 12:07 ` Kiryl Shutsemau
0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-20 10:24 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
Johannes Weiner, Usama Arif
>> When discussing per-process page sizes with Ryan and Dev, I mentioned that
>> having a larger emulated page size could be interesting for other
>> architectures as well.
>>
>> That is, we would emulate a 64K page size on Intel for user space as well,
>> but let the OS work with 4K pages.
>
> Just to clarify, do you want it to be enforced on userspace ABI.
> Like, all mappings are 64k aligned?
Right, see the proposal from Dev on the list.
From user-space POV, the pagesize would be 64K for these emulated
processes. That is, VMAs must be suitable aligned etc.
One key thing I think is that you could run such emulated-64k process
(that actually support it!) with 4k processes on the same machine, like
Arm is considering.
You would have no weird "vma crosses base pages" handling, which is just
rather nasty and makes my head hurt.
>
>> We'd only allocate+map large folios into user space + pagecache, but still
>> allow for page tables etc. to not waste memory.
>
> Waste of memory for page table is solvable and pretty straight forward.
> Most of such cases can be solve mechanically by switching to slab.
Well, yes, like Willy says, there are already similar custom solutions
for s390x and ppc.
Pasha talked recently about the memory waste of 16k kernel stacks and
how we would want to reduce that to 4k. In your proposal, it would be
64k, unless you somehow manage to allocate multiple kernel stacks from
the same 64k page. My head hurts thinking about whether that could work,
maybe it could (no idea about guard pages in there, though).
Let's take a look at the history of page size usage on Arm (people can
feel free to correct me):
(1) Most distros were using 64k on Arm.
(2) People realized that 64k was suboptimal many use cases (memory
waste for stacks, pagecache, etc) and started to switch to 4k. I
remember that mostly HPC-centric users sticked to 64k, but there was
also demand from others to be able to stay on 64k.
(3) Arm improved performance on a 4k kernel by adding cont-pte support,
trying to get closer to 64k native performance.
(4) Achieving 64k native performance is hard, which is why per-process
page sizes are being explored to get the best out of both worlds
(use 64k page size only where it really matters for performance).
Arm clearly has the added benefit of actually benefiting from hardware
support for 64k.
IIUC, what you are proposing feels a bit like traveling back in time
when it comes to the memory waste problem that Arm users encountered.
Where do you see the big difference to 64k on Arm in your proposal?
Would you currently also be running 64k Arm in production and the memory
waste etc is acceptable?
--
Cheers,
David
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-20 10:24 ` David Hildenbrand (Arm)
@ 2026-02-20 12:07 ` Kiryl Shutsemau
2026-02-20 16:30 ` David Hildenbrand (Arm)
0 siblings, 1 reply; 33+ messages in thread
From: Kiryl Shutsemau @ 2026-02-20 12:07 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
Johannes Weiner, Usama Arif
On Fri, Feb 20, 2026 at 11:24:37AM +0100, David Hildenbrand (Arm) wrote:
> > > When discussing per-process page sizes with Ryan and Dev, I mentioned that
> > > having a larger emulated page size could be interesting for other
> > > architectures as well.
> > >
> > > That is, we would emulate a 64K page size on Intel for user space as well,
> > > but let the OS work with 4K pages.
> >
> > Just to clarify, do you want it to be enforced on userspace ABI.
> > Like, all mappings are 64k aligned?
>
> Right, see the proposal from Dev on the list.
>
> From user-space POV, the pagesize would be 64K for these emulated processes.
> That is, VMAs must be suitable aligned etc.
Well, it will drastically limit the adoption. We have too much legacy
stuff on x86.
> > > We'd only allocate+map large folios into user space + pagecache, but still
> > > allow for page tables etc. to not waste memory.
> >
> > Waste of memory for page table is solvable and pretty straight forward.
> > Most of such cases can be solve mechanically by switching to slab.
>
> Well, yes, like Willy says, there are already similar custom solutions for
> s390x and ppc.
>
> Pasha talked recently about the memory waste of 16k kernel stacks and how we
> would want to reduce that to 4k. In your proposal, it would be 64k, unless
> you somehow manage to allocate multiple kernel stacks from the same 64k
> page. My head hurts thinking about whether that could work, maybe it could
> (no idea about guard pages in there, though).
Kernel stack is allocated from vmalloc. I think mapping them with
sub-page granularity should be doable.
BTW, do you see any reason why slab-allocated stack wouldn't work for
large base page sizes? There's no requirement for it be aligned to page
or PTE, right?
> Let's take a look at the history of page size usage on Arm (people can feel
> free to correct me):
>
> (1) Most distros were using 64k on Arm.
>
> (2) People realized that 64k was suboptimal many use cases (memory
> waste for stacks, pagecache, etc) and started to switch to 4k. I
> remember that mostly HPC-centric users sticked to 64k, but there was
> also demand from others to be able to stay on 64k.
>
> (3) Arm improved performance on a 4k kernel by adding cont-pte support,
> trying to get closer to 64k native performance.
>
> (4) Achieving 64k native performance is hard, which is why per-process
> page sizes are being explored to get the best out of both worlds
> (use 64k page size only where it really matters for performance).
>
> Arm clearly has the added benefit of actually benefiting from hardware
> support for 64k.
>
> IIUC, what you are proposing feels a bit like traveling back in time when it
> comes to the memory waste problem that Arm users encountered.
>
> Where do you see the big difference to 64k on Arm in your proposal? Would
> you currently also be running 64k Arm in production and the memory waste etc
> is acceptable?
That's the point. I don't see a big difference to 64k Arm. I want to
bring this option to x86: at some machine size it makes sense trade
memory consumption for scalability. I am targeting it to machines with
over 2TiB of RAM.
BTW, we do run 64k Arm in our fleet. There's some growing pains, but it
looks good in general We have no plans to switch to 4k (or 16k) at the
moment. 512M THPs also look good on some workloads.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-19 23:24 ` Kalesh Singh
@ 2026-02-20 12:10 ` Kiryl Shutsemau
2026-02-20 19:21 ` Kalesh Singh
0 siblings, 1 reply; 33+ messages in thread
From: Kiryl Shutsemau @ 2026-02-20 12:10 UTC (permalink / raw)
To: Kalesh Singh
Cc: David Hildenbrand (Arm),
lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
Johannes Weiner, Usama Arif, android-mm, Adrian Barnaś,
Mateusz Maćkowski, Steven Moreland
On Thu, Feb 19, 2026 at 03:24:37PM -0800, Kalesh Singh wrote:
> On Thu, Feb 19, 2026 at 7:39 AM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
> >
> > On 2/19/26 16:08, Kiryl Shutsemau wrote:
> > > No, there's no new hardware (that I know of). I want to explore what page size
> > > means.
> > >
> > > The kernel uses the same value - PAGE_SIZE - for two things:
> > >
> > > - the order-0 buddy allocation size;
> > >
> > > - the granularity of virtual address space mapping;
> > >
> > > I think we can benefit from separating these two meanings and allowing
> > > order-0 allocations to be larger than the virtual address space covered by a
> > > PTE entry.
> > >
> > > The main motivation is scalability. Managing memory on multi-terabyte
> > > machines in 4k is suboptimal, to say the least.
> > >
> > > Potential benefits of the approach (assuming 64k pages):
> > >
> > > - The order-0 page size cuts struct page overhead by a factor of 16. From
> > > ~1.6% of RAM to ~0.1%;
> > >
> > > - TLB wins on machines with TLB coalescing as long as mapping is naturally
> > > aligned;
> > >
> > > - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
> > >
> > > - 1G pages are within possibility for the buddy allocator - order-14
> > > allocation. It can open the road to 1G THPs.
> > >
> > > - As with THP, fewer pages - less pressure on the LRU lock;
> > >
> > > - ...
> > >
> > > The trade-off is memory waste (similar to what we have on architectures with
> > > native 64k pages today) and complexity, mostly in the core-MM code.
> > >
> > > == Design considerations ==
> > >
> > > I want to split PAGE_SIZE into two distinct values:
> > >
> > > - PTE_SIZE defines the virtual address space granularity;
> > >
> > > - PG_SIZE defines the size of the order-0 buddy allocation;
> > >
> > > PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
> > > requires conversion, and keep existing code working while conversion is in
> > > progress.
> > >
> > > The same split happens for other page-related macros: mask, shift,
> > > alignment helpers, etc.
> > >
> > > PFNs are in PTE_SIZE units.
> > >
> > > The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
> > > units.
> > >
> > > Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
> > > for userspace. But we might want to communicate PG_SIZE to userspace to
> > > get the optimal results for userspace that cares.
> > >
> > > PTE_SIZE granularity requires a substantial rework of page fault and VMA
> > > handling:
> > >
> > > - A struct page pointer and pgprot_t are not enough to create a PTE entry.
> > > We also need the offset within the page we are creating the PTE for.
> > >
> > > - Since the VMA start can be aligned arbitrarily with respect to the
> > > underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
> > > which is in PTE_SIZE units.
> > >
> > > - The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
> > > misaligned cases;
> > >
> > > Page faults into file mappings are relatively simple to handle as we
> > > always have the page cache to refer to. So you can map only the part of the
> > > page that fits in the page table, similarly to fault-around.
> > >
> > > Anonymous and file-CoW faults should also be simple as long as the VMA is
> > > aligned to PG_SIZE in both the virtual address space and with respect to
> > > vm_pgoff. We might waste some memory on the ends of the VMA, but it is
> > > tolerable.
> > >
> > > Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
> > > pages across a page table boundary. In the worst case, a page is mapped across
> > > a PGD entry boundary and PTEs for the page have to be put in two separate
> > > subtrees of page tables.
> > >
> > > A naive implementation would map different pages on different sides of a
> > > page table boundary and accept the waste of one page per page table crossing.
> > > The hope is that misaligned mappings are rare, but this is suboptimal.
> > >
> > > mremap(2) is the ultimate stress test for the design.
> > >
> > > On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> > > is greater than 4 KB, we need a way to pack multiple page tables into a
> > > single page. We could use the slab allocator for this, but it would
> > > require relocating the page-table metadata out of struct page.
> >
> > When discussing per-process page sizes with Ryan and Dev, I mentioned
> > that having a larger emulated page size could be interesting for other
> > architectures as well.
> >
> > That is, we would emulate a 64K page size on Intel for user space as
> > well, but let the OS work with 4K pages.
> >
> > We'd only allocate+map large folios into user space + pagecache, but
> > still allow for page tables etc. to not waste memory.
> >
> > So "most" of your allocations in the system would actually be at least
> > 64k, reducing zone lock contention etc.
> >
> >
> > It doesn't solve all the problems you wanted to tackle on your list
> > (e.g., "struct page" overhead, which will be sorted out by memdescs).
>
> Hi Kiryl,
>
> I'd be interested to discuss this at LSFMM.
>
> On Android, we have a separate but related use case: we emulate the
> userspace page size on x86, primarily to enable app developers to
> conduct compatibility testing of their apps for 16KB Android devices.
> [1]
>
> It mainly works by enforcing a larger granularity on the VMAs to
> emulate a userspace page size, somewhat similar to what David
> mentioned, while the underlying kernel still operates on a 4KB
> granularity. [2]
>
> IIUC the current design would not enfore the larger granularity /
> alignment for VMAs to avoid breaking ABI. However, I'd be interest to
> discuss whether it can be extended to cover this usecase as well.
I don't want to break ABI, but might add a knob (maybe personality(2) ?)
for enforcement to see what breaks.
In general, I would prefer to advertise a new value to userspace that
would mean preferred virtual address space granularity.
>
> [1] https://developer.android.com/guide/practices/page-sizes#16kb-emulator
> [2] https://source.android.com/docs/core/architecture/16kb-page-size/getting-started-cf-x86-64-pgagnostic
>
> Thanks,
> Kalesh
>
>
>
>
> >
> > --
> > Cheers,
> >
> > David
> >
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-20 9:04 ` David Laight
@ 2026-02-20 12:12 ` Kiryl Shutsemau
0 siblings, 0 replies; 33+ messages in thread
From: Kiryl Shutsemau @ 2026-02-20 12:12 UTC (permalink / raw)
To: David Laight
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Matthew Wilcox, Johannes Weiner, Usama Arif
On Fri, Feb 20, 2026 at 09:04:09AM +0000, David Laight wrote:
> On Thu, 19 Feb 2026 15:08:51 +0000
> Kiryl Shutsemau <kas@kernel.org> wrote:
>
> > No, there's no new hardware (that I know of). I want to explore what page size
> > means.
> >
> > The kernel uses the same value - PAGE_SIZE - for two things:
> >
> > - the order-0 buddy allocation size;
> >
> > - the granularity of virtual address space mapping;
>
> Also the 'random' buffers that are PAGE_SIZE rather than 4k.
Yeah, in some places we use PAGE_SIZE just because without any reason.
> I also wonder how is affects mmap of kernel memory and the alignement
> of PCIe windows (etc).
Kernel, as userspace, is free to map memory PTE_SIZE granularity.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-20 3:28 ` Liam R. Howlett
@ 2026-02-20 12:33 ` Kiryl Shutsemau
2026-02-20 15:17 ` Liam R. Howlett
0 siblings, 1 reply; 33+ messages in thread
From: Kiryl Shutsemau @ 2026-02-20 12:33 UTC (permalink / raw)
To: Liam R. Howlett, Dave Hansen, lsf-pc, linux-mm, x86,
linux-kernel, Andrew Morton, David Hildenbrand, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Lorenzo Stoakes,
Mike Rapoport, Matthew Wilcox, Johannes Weiner, Usama Arif
On Thu, Feb 19, 2026 at 10:28:20PM -0500, Liam R. Howlett wrote:
> * Kiryl Shutsemau <kas@kernel.org> [260219 17:05]:
> > On Thu, Feb 19, 2026 at 09:08:57AM -0800, Dave Hansen wrote:
> > > On 2/19/26 07:08, Kiryl Shutsemau wrote:
> > > > - The order-0 page size cuts struct page overhead by a factor of 16. From
> > > > ~1.6% of RAM to ~0.1%;
> > > ...
> > > But, it will mostly be getting better performance at the _cost_ of
> > > consuming more RAM, not saving RAM.
> >
> > That's fair.
> >
> > The problem with struct page memory consumption is that it is static and
> > cannot be reclaimed. You pay the struct page tax no matter what.
> >
> > Page cache rounding overhead can be large, but a motivated userspace can
> > keep it under control by avoiding splitting a dataset into many small
> > files. And this memory is reclaimable.
> >
>
> But we are in reclaim a lot more these days. As I'm sure you are aware,
> we are trying to maximize the resources (both cpu and ram) of any
> machine powered on. Entering reclaim will consume the cpu time and will
> affect other tasks.
>
> Especially with multiple workload machines, the tendency is to have a
> primary focus with the lower desired work being killed, if necessary.
> Reducing the overhead just means more secondary tasks, or a bigger
> footprint of the ones already executing.
>
> Increasing the memory pressure will degrade the primary workload more
> frequently, even if we recover enough to avoid OOMing the secondary.
>
> While in the struct page tax world, the secondary task would be killed
> after a shorter (and less frequently executed) reclaim comes up short.
> So, I would think that we would be degrading the primary workload in an
> attempt to keep the secondary alive? Maybe I'm over-simplifying here?
I am not sure I fully follow your point.
Sizing tasks and scheduling tasks between machines is hard in general.
I don't think the balance between struct page tax and page cache
rounding overhead is going to be the primary factor.
> Near the other end of the spectrum, we have chromebooks that are
> constantly in reclaim, even with 4k pages. I guess these machines would
> be destine to maintain the same page size they use today. That is, this
> solution for the struct page tax is only useful if you have a lot of
> memory. But then again, that's where the bookkeeping costs become hard
> to take.
Smaller machines are not target for 64k pages. They will not benefit
from them.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-20 12:33 ` Kiryl Shutsemau
@ 2026-02-20 15:17 ` Liam R. Howlett
2026-02-20 15:50 ` Kiryl Shutsemau
0 siblings, 1 reply; 33+ messages in thread
From: Liam R. Howlett @ 2026-02-20 15:17 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Dave Hansen, lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Mike Rapoport, Matthew Wilcox,
Johannes Weiner, Usama Arif
* Kiryl Shutsemau <kas@kernel.org> [260220 07:33]:
> On Thu, Feb 19, 2026 at 10:28:20PM -0500, Liam R. Howlett wrote:
> > * Kiryl Shutsemau <kas@kernel.org> [260219 17:05]:
> > > On Thu, Feb 19, 2026 at 09:08:57AM -0800, Dave Hansen wrote:
> > > > On 2/19/26 07:08, Kiryl Shutsemau wrote:
> > > > > - The order-0 page size cuts struct page overhead by a factor of 16. From
> > > > > ~1.6% of RAM to ~0.1%;
> > > > ...
> > > > But, it will mostly be getting better performance at the _cost_ of
> > > > consuming more RAM, not saving RAM.
> > >
> > > That's fair.
> > >
> > > The problem with struct page memory consumption is that it is static and
> > > cannot be reclaimed. You pay the struct page tax no matter what.
> > >
> > > Page cache rounding overhead can be large, but a motivated userspace can
> > > keep it under control by avoiding splitting a dataset into many small
> > > files. And this memory is reclaimable.
> > >
> >
> > But we are in reclaim a lot more these days. As I'm sure you are aware,
> > we are trying to maximize the resources (both cpu and ram) of any
> > machine powered on. Entering reclaim will consume the cpu time and will
> > affect other tasks.
> >
> > Especially with multiple workload machines, the tendency is to have a
> > primary focus with the lower desired work being killed, if necessary.
> > Reducing the overhead just means more secondary tasks, or a bigger
> > footprint of the ones already executing.
> >
> > Increasing the memory pressure will degrade the primary workload more
> > frequently, even if we recover enough to avoid OOMing the secondary.
> >
> > While in the struct page tax world, the secondary task would be killed
> > after a shorter (and less frequently executed) reclaim comes up short.
> > So, I would think that we would be degrading the primary workload in an
> > attempt to keep the secondary alive? Maybe I'm over-simplifying here?
>
> I am not sure I fully follow your point.
>
> Sizing tasks and scheduling tasks between machines is hard in general.
> I don't think the balance between struct page tax and page cache
> rounding overhead is going to be the primary factor.
I think there are more trade offs than what you listed. It's still
probably worth doing, but I wanted to know if you though that this would
cause us to spend more time in reclaim, which seems to be implied above.
So, another trade-off might be all the reclaim penalty being paid more
frequently?
...
Thanks,
Liam
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-20 15:17 ` Liam R. Howlett
@ 2026-02-20 15:50 ` Kiryl Shutsemau
0 siblings, 0 replies; 33+ messages in thread
From: Kiryl Shutsemau @ 2026-02-20 15:50 UTC (permalink / raw)
To: Liam R. Howlett, Dave Hansen, lsf-pc, linux-mm, x86,
linux-kernel, Andrew Morton, David Hildenbrand, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Lorenzo Stoakes,
Mike Rapoport, Matthew Wilcox, Johannes Weiner, Usama Arif
On Fri, Feb 20, 2026 at 10:17:45AM -0500, Liam R. Howlett wrote:
> * Kiryl Shutsemau <kas@kernel.org> [260220 07:33]:
> > On Thu, Feb 19, 2026 at 10:28:20PM -0500, Liam R. Howlett wrote:
> > > * Kiryl Shutsemau <kas@kernel.org> [260219 17:05]:
> > > > On Thu, Feb 19, 2026 at 09:08:57AM -0800, Dave Hansen wrote:
> > > > > On 2/19/26 07:08, Kiryl Shutsemau wrote:
> > > > > > - The order-0 page size cuts struct page overhead by a factor of 16. From
> > > > > > ~1.6% of RAM to ~0.1%;
> > > > > ...
> > > > > But, it will mostly be getting better performance at the _cost_ of
> > > > > consuming more RAM, not saving RAM.
> > > >
> > > > That's fair.
> > > >
> > > > The problem with struct page memory consumption is that it is static and
> > > > cannot be reclaimed. You pay the struct page tax no matter what.
> > > >
> > > > Page cache rounding overhead can be large, but a motivated userspace can
> > > > keep it under control by avoiding splitting a dataset into many small
> > > > files. And this memory is reclaimable.
> > > >
> > >
> > > But we are in reclaim a lot more these days. As I'm sure you are aware,
> > > we are trying to maximize the resources (both cpu and ram) of any
> > > machine powered on. Entering reclaim will consume the cpu time and will
> > > affect other tasks.
> > >
> > > Especially with multiple workload machines, the tendency is to have a
> > > primary focus with the lower desired work being killed, if necessary.
> > > Reducing the overhead just means more secondary tasks, or a bigger
> > > footprint of the ones already executing.
> > >
> > > Increasing the memory pressure will degrade the primary workload more
> > > frequently, even if we recover enough to avoid OOMing the secondary.
> > >
> > > While in the struct page tax world, the secondary task would be killed
> > > after a shorter (and less frequently executed) reclaim comes up short.
> > > So, I would think that we would be degrading the primary workload in an
> > > attempt to keep the secondary alive? Maybe I'm over-simplifying here?
> >
> > I am not sure I fully follow your point.
> >
> > Sizing tasks and scheduling tasks between machines is hard in general.
> > I don't think the balance between struct page tax and page cache
> > rounding overhead is going to be the primary factor.
>
> I think there are more trade offs than what you listed. It's still
> probably worth doing, but I wanted to know if you though that this would
> cause us to spend more time in reclaim, which seems to be implied above.
> So, another trade-off might be all the reclaim penalty being paid more
> frequently?
I am not sure.
Kernel would need to do less work in reclaim per unit of memory.
Depending on workloads you might see less allocation events and
therefore less frequent reclaim.
It's all too hand-wavy at the stage.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-20 12:07 ` Kiryl Shutsemau
@ 2026-02-20 16:30 ` David Hildenbrand (Arm)
2026-02-20 19:33 ` Kalesh Singh
0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-20 16:30 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
Johannes Weiner, Usama Arif
On 2/20/26 13:07, Kiryl Shutsemau wrote:
> On Fri, Feb 20, 2026 at 11:24:37AM +0100, David Hildenbrand (Arm) wrote:
>>>
>>> Just to clarify, do you want it to be enforced on userspace ABI.
>>> Like, all mappings are 64k aligned?
>>
>> Right, see the proposal from Dev on the list.
>>
>> From user-space POV, the pagesize would be 64K for these emulated processes.
>> That is, VMAs must be suitable aligned etc.
>
> Well, it will drastically limit the adoption. We have too much legacy
> stuff on x86.
I'd assume that many applications nowadays can deal with differing page
sizes (thanks to some other architectures paving the way).
But yes, some real legacy stuff, or stuff that ever only cared about
intel still hardcodes pagesize=4k.
In Meta's fleet, I'd be quite interesting how much conversion there
would have to be done.
For legacy apps, you could still run them as 4k pagesize on the same
system, of course.
>
>>>
>>> Waste of memory for page table is solvable and pretty straight forward.
>>> Most of such cases can be solve mechanically by switching to slab.
>>
>> Well, yes, like Willy says, there are already similar custom solutions for
>> s390x and ppc.
>>
>> Pasha talked recently about the memory waste of 16k kernel stacks and how we
>> would want to reduce that to 4k. In your proposal, it would be 64k, unless
>> you somehow manage to allocate multiple kernel stacks from the same 64k
>> page. My head hurts thinking about whether that could work, maybe it could
>> (no idea about guard pages in there, though).
>
> Kernel stack is allocated from vmalloc. I think mapping them with
> sub-page granularity should be doable.
I still have to wrap my head around the sub-page mapping here as well.
It's scary.
Re mapcount: I think if any part of the page is mapped, it would be
considered mapped -> mapcount += 1.
>
> BTW, do you see any reason why slab-allocated stack wouldn't work for
> large base page sizes? There's no requirement for it be aligned to page
> or PTE, right?
I'd assume that would work. Devil is in the detail with these things
before we have memdescs.
E.g., page table have a dedicated type (PGTY_table) and store separate
metadata in the ptdesc. For kernel stack there was once a proposal to
have a type but it is not upstream.
>
>> Let's take a look at the history of page size usage on Arm (people can feel
>> free to correct me):
>>
>> (1) Most distros were using 64k on Arm.
>>
>> (2) People realized that 64k was suboptimal many use cases (memory
>> waste for stacks, pagecache, etc) and started to switch to 4k. I
>> remember that mostly HPC-centric users sticked to 64k, but there was
>> also demand from others to be able to stay on 64k.
>>
>> (3) Arm improved performance on a 4k kernel by adding cont-pte support,
>> trying to get closer to 64k native performance.
>>
>> (4) Achieving 64k native performance is hard, which is why per-process
>> page sizes are being explored to get the best out of both worlds
>> (use 64k page size only where it really matters for performance).
>>
>> Arm clearly has the added benefit of actually benefiting from hardware
>> support for 64k.
>>
>> IIUC, what you are proposing feels a bit like traveling back in time when it
>> comes to the memory waste problem that Arm users encountered.
>>
>> Where do you see the big difference to 64k on Arm in your proposal? Would
>> you currently also be running 64k Arm in production and the memory waste etc
>> is acceptable?
>
> That's the point. I don't see a big difference to 64k Arm. I want to
> bring this option to x86: at some machine size it makes sense trade
> memory consumption for scalability. I am targeting it to machines with
> over 2TiB of RAM.
>
> BTW, we do run 64k Arm in our fleet. There's some growing pains, but it
> looks good in general We have no plans to switch to 4k (or 16k) at the
> moment. 512M THPs also look good on some workloads.
Okay, that's valuable information, thanks!
Being able to remove the sub-page mapping part (or being able to just
hide it somewhere deep down in arch code) would make this a lot easier
to digest.
--
Cheers,
David
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-20 12:10 ` Kiryl Shutsemau
@ 2026-02-20 19:21 ` Kalesh Singh
0 siblings, 0 replies; 33+ messages in thread
From: Kalesh Singh @ 2026-02-20 19:21 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: David Hildenbrand (Arm),
lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
Johannes Weiner, Usama Arif, android-mm, Adrian Barnaś,
Mateusz Maćkowski, Steven Moreland
On Fri, Feb 20, 2026 at 4:10 AM Kiryl Shutsemau <kas@kernel.org> wrote:
>
> On Thu, Feb 19, 2026 at 03:24:37PM -0800, Kalesh Singh wrote:
> > On Thu, Feb 19, 2026 at 7:39 AM David Hildenbrand (Arm)
> > <david@kernel.org> wrote:
> > >
> > > On 2/19/26 16:08, Kiryl Shutsemau wrote:
> > > > No, there's no new hardware (that I know of). I want to explore what page size
> > > > means.
> > > >
> > > > The kernel uses the same value - PAGE_SIZE - for two things:
> > > >
> > > > - the order-0 buddy allocation size;
> > > >
> > > > - the granularity of virtual address space mapping;
> > > >
> > > > I think we can benefit from separating these two meanings and allowing
> > > > order-0 allocations to be larger than the virtual address space covered by a
> > > > PTE entry.
> > > >
> > > > The main motivation is scalability. Managing memory on multi-terabyte
> > > > machines in 4k is suboptimal, to say the least.
> > > >
> > > > Potential benefits of the approach (assuming 64k pages):
> > > >
> > > > - The order-0 page size cuts struct page overhead by a factor of 16. From
> > > > ~1.6% of RAM to ~0.1%;
> > > >
> > > > - TLB wins on machines with TLB coalescing as long as mapping is naturally
> > > > aligned;
> > > >
> > > > - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
> > > >
> > > > - 1G pages are within possibility for the buddy allocator - order-14
> > > > allocation. It can open the road to 1G THPs.
> > > >
> > > > - As with THP, fewer pages - less pressure on the LRU lock;
> > > >
> > > > - ...
> > > >
> > > > The trade-off is memory waste (similar to what we have on architectures with
> > > > native 64k pages today) and complexity, mostly in the core-MM code.
> > > >
> > > > == Design considerations ==
> > > >
> > > > I want to split PAGE_SIZE into two distinct values:
> > > >
> > > > - PTE_SIZE defines the virtual address space granularity;
> > > >
> > > > - PG_SIZE defines the size of the order-0 buddy allocation;
> > > >
> > > > PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
> > > > requires conversion, and keep existing code working while conversion is in
> > > > progress.
> > > >
> > > > The same split happens for other page-related macros: mask, shift,
> > > > alignment helpers, etc.
> > > >
> > > > PFNs are in PTE_SIZE units.
> > > >
> > > > The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
> > > > units.
> > > >
> > > > Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
> > > > for userspace. But we might want to communicate PG_SIZE to userspace to
> > > > get the optimal results for userspace that cares.
> > > >
> > > > PTE_SIZE granularity requires a substantial rework of page fault and VMA
> > > > handling:
> > > >
> > > > - A struct page pointer and pgprot_t are not enough to create a PTE entry.
> > > > We also need the offset within the page we are creating the PTE for.
> > > >
> > > > - Since the VMA start can be aligned arbitrarily with respect to the
> > > > underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
> > > > which is in PTE_SIZE units.
> > > >
> > > > - The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
> > > > misaligned cases;
> > > >
> > > > Page faults into file mappings are relatively simple to handle as we
> > > > always have the page cache to refer to. So you can map only the part of the
> > > > page that fits in the page table, similarly to fault-around.
> > > >
> > > > Anonymous and file-CoW faults should also be simple as long as the VMA is
> > > > aligned to PG_SIZE in both the virtual address space and with respect to
> > > > vm_pgoff. We might waste some memory on the ends of the VMA, but it is
> > > > tolerable.
> > > >
> > > > Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
> > > > pages across a page table boundary. In the worst case, a page is mapped across
> > > > a PGD entry boundary and PTEs for the page have to be put in two separate
> > > > subtrees of page tables.
> > > >
> > > > A naive implementation would map different pages on different sides of a
> > > > page table boundary and accept the waste of one page per page table crossing.
> > > > The hope is that misaligned mappings are rare, but this is suboptimal.
> > > >
> > > > mremap(2) is the ultimate stress test for the design.
> > > >
> > > > On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> > > > is greater than 4 KB, we need a way to pack multiple page tables into a
> > > > single page. We could use the slab allocator for this, but it would
> > > > require relocating the page-table metadata out of struct page.
> > >
> > > When discussing per-process page sizes with Ryan and Dev, I mentioned
> > > that having a larger emulated page size could be interesting for other
> > > architectures as well.
> > >
> > > That is, we would emulate a 64K page size on Intel for user space as
> > > well, but let the OS work with 4K pages.
> > >
> > > We'd only allocate+map large folios into user space + pagecache, but
> > > still allow for page tables etc. to not waste memory.
> > >
> > > So "most" of your allocations in the system would actually be at least
> > > 64k, reducing zone lock contention etc.
> > >
> > >
> > > It doesn't solve all the problems you wanted to tackle on your list
> > > (e.g., "struct page" overhead, which will be sorted out by memdescs).
> >
> > Hi Kiryl,
> >
> > I'd be interested to discuss this at LSFMM.
> >
> > On Android, we have a separate but related use case: we emulate the
> > userspace page size on x86, primarily to enable app developers to
> > conduct compatibility testing of their apps for 16KB Android devices.
> > [1]
> >
> > It mainly works by enforcing a larger granularity on the VMAs to
> > emulate a userspace page size, somewhat similar to what David
> > mentioned, while the underlying kernel still operates on a 4KB
> > granularity. [2]
> >
> > IIUC the current design would not enfore the larger granularity /
> > alignment for VMAs to avoid breaking ABI. However, I'd be interest to
> > discuss whether it can be extended to cover this usecase as well.
>
> I don't want to break ABI, but might add a knob (maybe personality(2) ?)
> for enforcement to see what breaks.
I think personality(2) may be too late? By the time a process invokes
it, the initial userspace mappings (executable, linker for init, etc)
are already established with the default granularity.
To handle this, I've been using an early_param to enforce the larger
VMA alignment system-wide right from boot.
Perhaps, something for global enforcement (Kconfig/early param) and a
prctl/personality flag for per-process opt in?
>
> In general, I would prefer to advertise a new value to userspace that
> would mean preferred virtual address space granularity.
This makes sense for maintaining ABI compatibility. Userspace
allocators might want to optimize their layouts to match PG_SIZE while
still being able to operate at PTE_SIZE when needed.
-- Kalesh
>
> >
> > [1] https://developer.android.com/guide/practices/page-sizes#16kb-emulator
> > [2] https://source.android.com/docs/core/architecture/16kb-page-size/getting-started-cf-x86-64-pgagnostic
> >
> > Thanks,
> > Kalesh
> >
> >
> >
> >
> > >
> > > --
> > > Cheers,
> > >
> > > David
> > >
>
> --
> Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
2026-02-20 16:30 ` David Hildenbrand (Arm)
@ 2026-02-20 19:33 ` Kalesh Singh
0 siblings, 0 replies; 33+ messages in thread
From: Kalesh Singh @ 2026-02-20 19:33 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Kiryl Shutsemau, lsf-pc, linux-mm, x86, linux-kernel,
Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Matthew Wilcox, Johannes Weiner, Usama Arif
On Fri, Feb 20, 2026 at 8:30 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 2/20/26 13:07, Kiryl Shutsemau wrote:
> > On Fri, Feb 20, 2026 at 11:24:37AM +0100, David Hildenbrand (Arm) wrote:
> >>>
> >>> Just to clarify, do you want it to be enforced on userspace ABI.
> >>> Like, all mappings are 64k aligned?
> >>
> >> Right, see the proposal from Dev on the list.
> >>
> >> From user-space POV, the pagesize would be 64K for these emulated processes.
> >> That is, VMAs must be suitable aligned etc.
> >
> > Well, it will drastically limit the adoption. We have too much legacy
> > stuff on x86.
>
> I'd assume that many applications nowadays can deal with differing page
> sizes (thanks to some other architectures paving the way).
>
> But yes, some real legacy stuff, or stuff that ever only cared about
> intel still hardcodes pagesize=4k.
I think most issues will stem from linkers setting the default ELF
segment alignment (max-page-size) for x86 to 4096. So those ELFs will
not load correctly or at all on the larger emulated granularity.
-- Kalesh
>
> In Meta's fleet, I'd be quite interesting how much conversion there
> would have to be done.
>
> For legacy apps, you could still run them as 4k pagesize on the same
> system, of course.
>
> >
> >>>
> >>> Waste of memory for page table is solvable and pretty straight forward.
> >>> Most of such cases can be solve mechanically by switching to slab.
> >>
> >> Well, yes, like Willy says, there are already similar custom solutions for
> >> s390x and ppc.
> >>
> >> Pasha talked recently about the memory waste of 16k kernel stacks and how we
> >> would want to reduce that to 4k. In your proposal, it would be 64k, unless
> >> you somehow manage to allocate multiple kernel stacks from the same 64k
> >> page. My head hurts thinking about whether that could work, maybe it could
> >> (no idea about guard pages in there, though).
> >
> > Kernel stack is allocated from vmalloc. I think mapping them with
> > sub-page granularity should be doable.
>
> I still have to wrap my head around the sub-page mapping here as well.
> It's scary.
>
> Re mapcount: I think if any part of the page is mapped, it would be
> considered mapped -> mapcount += 1.
>
> >
> > BTW, do you see any reason why slab-allocated stack wouldn't work for
> > large base page sizes? There's no requirement for it be aligned to page
> > or PTE, right?
>
> I'd assume that would work. Devil is in the detail with these things
> before we have memdescs.
>
> E.g., page table have a dedicated type (PGTY_table) and store separate
> metadata in the ptdesc. For kernel stack there was once a proposal to
> have a type but it is not upstream.
>
> >
> >> Let's take a look at the history of page size usage on Arm (people can feel
> >> free to correct me):
> >>
> >> (1) Most distros were using 64k on Arm.
> >>
> >> (2) People realized that 64k was suboptimal many use cases (memory
> >> waste for stacks, pagecache, etc) and started to switch to 4k. I
> >> remember that mostly HPC-centric users sticked to 64k, but there was
> >> also demand from others to be able to stay on 64k.
> >>
> >> (3) Arm improved performance on a 4k kernel by adding cont-pte support,
> >> trying to get closer to 64k native performance.
> >>
> >> (4) Achieving 64k native performance is hard, which is why per-process
> >> page sizes are being explored to get the best out of both worlds
> >> (use 64k page size only where it really matters for performance).
> >>
> >> Arm clearly has the added benefit of actually benefiting from hardware
> >> support for 64k.
> >>
> >> IIUC, what you are proposing feels a bit like traveling back in time when it
> >> comes to the memory waste problem that Arm users encountered.
> >>
> >> Where do you see the big difference to 64k on Arm in your proposal? Would
> >> you currently also be running 64k Arm in production and the memory waste etc
> >> is acceptable?
> >
> > That's the point. I don't see a big difference to 64k Arm. I want to
> > bring this option to x86: at some machine size it makes sense trade
> > memory consumption for scalability. I am targeting it to machines with
> > over 2TiB of RAM.
> >
> > BTW, we do run 64k Arm in our fleet. There's some growing pains, but it
> > looks good in general We have no plans to switch to 4k (or 16k) at the
> > moment. 512M THPs also look good on some workloads.
>
> Okay, that's valuable information, thanks!
>
> Being able to remove the sub-page mapping part (or being able to just
> hide it somewhere deep down in arch code) would make this a lot easier
> to digest.
>
> --
> Cheers,
>
> David
>
^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2026-02-20 19:33 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-19 15:08 [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Kiryl Shutsemau
2026-02-19 15:17 ` Peter Zijlstra
2026-02-19 15:20 ` Peter Zijlstra
2026-02-19 15:27 ` Kiryl Shutsemau
2026-02-19 15:33 ` Pedro Falcato
2026-02-19 15:50 ` Kiryl Shutsemau
2026-02-19 15:53 ` David Hildenbrand (Arm)
2026-02-19 19:31 ` Pedro Falcato
2026-02-19 15:39 ` David Hildenbrand (Arm)
2026-02-19 15:54 ` Kiryl Shutsemau
2026-02-19 16:09 ` David Hildenbrand (Arm)
2026-02-20 2:55 ` Zi Yan
2026-02-19 17:09 ` Kiryl Shutsemau
2026-02-20 10:24 ` David Hildenbrand (Arm)
2026-02-20 12:07 ` Kiryl Shutsemau
2026-02-20 16:30 ` David Hildenbrand (Arm)
2026-02-20 19:33 ` Kalesh Singh
2026-02-19 23:24 ` Kalesh Singh
2026-02-20 12:10 ` Kiryl Shutsemau
2026-02-20 19:21 ` Kalesh Singh
2026-02-19 17:08 ` Dave Hansen
2026-02-19 22:05 ` Kiryl Shutsemau
2026-02-20 3:28 ` Liam R. Howlett
2026-02-20 12:33 ` Kiryl Shutsemau
2026-02-20 15:17 ` Liam R. Howlett
2026-02-20 15:50 ` Kiryl Shutsemau
2026-02-19 17:30 ` Dave Hansen
2026-02-19 22:14 ` Kiryl Shutsemau
2026-02-19 22:21 ` Dave Hansen
2026-02-19 17:47 ` Matthew Wilcox
2026-02-19 22:26 ` Kiryl Shutsemau
2026-02-20 9:04 ` David Laight
2026-02-20 12:12 ` Kiryl Shutsemau
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox