[LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
@ 2026-02-19 15:08 Kiryl Shutsemau
  2026-02-19 15:17 ` Peter Zijlstra
                   ` (6 more replies)
  0 siblings, 7 replies; 43+ messages in thread
From: Kiryl Shutsemau @ 2026-02-19 15:08 UTC (permalink / raw)
  To: lsf-pc, linux-mm
  Cc: x86, linux-kernel, Andrew Morton, David Hildenbrand,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif

No, there's no new hardware (that I know of). I want to explore what page size
means.

The kernel uses the same value - PAGE_SIZE - for two things:

  - the order-0 buddy allocation size;

  - the granularity of virtual address space mapping;

I think we can benefit from separating these two meanings and allowing
order-0 allocations to be larger than the virtual address space covered by a
PTE entry.

The main motivation is scalability. Managing memory on multi-terabyte
machines in 4k is suboptimal, to say the least.

Potential benefits of the approach (assuming 64k pages):

  - The order-0 page size cuts struct page overhead by a factor of 16. From
    ~1.6% of RAM to ~0.1%;

  - TLB wins on machines with TLB coalescing as long as mapping is naturally
    aligned;

  - Order-5 allocation is 2M, resulting in less pressure on the zone lock;

  - 1G pages are within possibility for the buddy allocator - order-14
    allocation. It can open the road to 1G THPs.

  - As with THP, fewer pages - less pressure on the LRU lock;

  - ...

The trade-off is memory waste (similar to what we have on architectures with
native 64k pages today) and complexity, mostly in the core-MM code.

== Design considerations ==

I want to split PAGE_SIZE into two distinct values:

  - PTE_SIZE defines the virtual address space granularity;

  - PG_SIZE defines the size of the order-0 buddy allocation;

PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
requires conversion, and keep existing code working while conversion is in
progress.

The same split happens for other page-related macros: mask, shift,
alignment helpers, etc.

PFNs are in PTE_SIZE units.

The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
units.

Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
for userspace. But we might want to communicate PG_SIZE to userspace to
get the optimal results for userspace that cares.

PTE_SIZE granularity requires a substantial rework of page fault and VMA
handling:

  - A struct page pointer and pgprot_t are not enough to create a PTE entry.
    We also need the offset within the page we are creating the PTE for.

  - Since the VMA start can be aligned arbitrarily with respect to the
    underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
    which is in PTE_SIZE units.

  - The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
    misaligned cases;

Page faults into file mappings are relatively simple to handle as we
always have the page cache to refer to. So you can map only the part of the
page that fits in the page table, similarly to fault-around.

Anonymous and file-CoW faults should also be simple as long as the VMA is
aligned to PG_SIZE in both the virtual address space and with respect to
vm_pgoff. We might waste some memory on the ends of the VMA, but it is
tolerable.

Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
pages across a page table boundary. In the worst case, a page is mapped across
a PGD entry boundary and PTEs for the page have to be put in two separate
subtrees of page tables.

A naive implementation would map different pages on different sides of a
page table boundary and accept the waste of one page per page table crossing.
The hope is that misaligned mappings are rare, but this is suboptimal.

mremap(2) is the ultimate stress test for the design.

On x86, page tables are allocated from the buddy allocator and if PG_SIZE
is greater than 4 KB, we need a way to pack multiple page tables into a
single page. We could use the slab allocator for this, but it would
require relocating the page-table metadata out of struct page.

Things I have not thought much about yet:

  - Accounting for wasted memory;

  - rmap;

  - mapcount;

  - A lot of arch-specific code;

  - <insert my blind spot here>;

== Status ==

I have a POC implementation on top of v6.17:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git pte_size

It is WIP and full of hacks I am trying to make sense of now.

It compiles with my minimalistic kernel config and can boot to a shell with
both 16k and 64k base page sizes. The shell doesn't crash immediately, but
sometimes I wonder why :P

The patchset is large:

 378 files changed, 3348 insertions(+), 3102 deletions(-)

and it is far from being complete.

== Goals ==

I want to get feedback for the overall design and possible ways to
upstream.

My plan is to submit an RFC-quality patchset before the summit.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 15:08 [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Kiryl Shutsemau
@ 2026-02-19 15:17 ` Peter Zijlstra
  2026-02-19 15:20   ` Peter Zijlstra
  2026-02-19 15:33 ` Pedro Falcato
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2026-02-19 15:17 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif

On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
> No, there's no new hardware (that I know of). I want to explore what page size
> means.
> 
> The kernel uses the same value - PAGE_SIZE - for two things:
> 
>   - the order-0 buddy allocation size;
> 
>   - the granularity of virtual address space mapping;
> 
> I think we can benefit from separating these two meanings and allowing
> order-0 allocations to be larger than the virtual address space covered by a
> PTE entry.

Didn't AA do this a decade ago or somesuch?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 15:17 ` Peter Zijlstra
@ 2026-02-19 15:20   ` Peter Zijlstra
  2026-02-19 15:27     ` Kiryl Shutsemau
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2026-02-19 15:20 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif

On Thu, Feb 19, 2026 at 04:17:29PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
> > No, there's no new hardware (that I know of). I want to explore what page size
> > means.
> > 
> > The kernel uses the same value - PAGE_SIZE - for two things:
> > 
> >   - the order-0 buddy allocation size;
> > 
> >   - the granularity of virtual address space mapping;
> > 
> > I think we can benefit from separating these two meanings and allowing
> > order-0 allocations to be larger than the virtual address space covered by a
> > PTE entry.
> 
> Didn't AA do this a decade ago or somesuch?

  https://lwn.net/Articles/240914/


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 15:20   ` Peter Zijlstra
@ 2026-02-19 15:27     ` Kiryl Shutsemau
  0 siblings, 0 replies; 43+ messages in thread
From: Kiryl Shutsemau @ 2026-02-19 15:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif

On Thu, Feb 19, 2026 at 04:20:45PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 19, 2026 at 04:17:29PM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
> > > No, there's no new hardware (that I know of). I want to explore what page size
> > > means.
> > > 
> > > The kernel uses the same value - PAGE_SIZE - for two things:
> > > 
> > >   - the order-0 buddy allocation size;
> > > 
> > >   - the granularity of virtual address space mapping;
> > > 
> > > I think we can benefit from separating these two meanings and allowing
> > > order-0 allocations to be larger than the virtual address space covered by a
> > > PTE entry.
> > 
> > Didn't AA do this a decade ago or somesuch?
> 
>   https://lwn.net/Articles/240914/

Oh, 2007. It predates me in kernel. Will read up. Thanks!

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 15:08 [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Kiryl Shutsemau
  2026-02-19 15:17 ` Peter Zijlstra
@ 2026-02-19 15:33 ` Pedro Falcato
  2026-02-19 15:50   ` Kiryl Shutsemau
  2026-02-19 15:39 ` David Hildenbrand (Arm)
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 43+ messages in thread
From: Pedro Falcato @ 2026-02-19 15:33 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif

On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
> No, there's no new hardware (that I know of). I want to explore what page size
> means.
> 
> The kernel uses the same value - PAGE_SIZE - for two things:
> 
>   - the order-0 buddy allocation size;
> 
>   - the granularity of virtual address space mapping;
> 
> I think we can benefit from separating these two meanings and allowing
> order-0 allocations to be larger than the virtual address space covered by a
> PTE entry.
>

Doesn't this idea make less sense these days, with mTHP? Simply by toggling one
of the entries in /sys/kernel/mm/transparent_hugepage.
 
> The main motivation is scalability. Managing memory on multi-terabyte
> machines in 4k is suboptimal, to say the least.
> 
> Potential benefits of the approach (assuming 64k pages):
> 
>   - The order-0 page size cuts struct page overhead by a factor of 16. From
>     ~1.6% of RAM to ~0.1%;
> 
>   - TLB wins on machines with TLB coalescing as long as mapping is naturally
>     aligned;
> 
>   - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
> 
>   - 1G pages are within possibility for the buddy allocator - order-14
>     allocation. It can open the road to 1G THPs.
> 
>   - As with THP, fewer pages - less pressure on the LRU lock;

We could perhaps add a way to enforce a min_order globally on the page cache,
as a way to address it.

There are some points there which aren't addressed by mTHP work in any way
(1G THPs for one), others which are being addressed separately (memdesc work
trying to cut down on struct page overhead).

(I also don't understand your point about order-5 allocation, AFAIK pcp will
cache up to COSTLY_ORDER (3) and PMD order, but I'm probably not seeing the
full picture)


-- 
Pedro


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 15:08 [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Kiryl Shutsemau
  2026-02-19 15:17 ` Peter Zijlstra
  2026-02-19 15:33 ` Pedro Falcato
@ 2026-02-19 15:39 ` David Hildenbrand (Arm)
  2026-02-19 15:54   ` Kiryl Shutsemau
                     ` (2 more replies)
  2026-02-19 17:08 ` Dave Hansen
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 43+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-19 15:39 UTC (permalink / raw)
  To: Kiryl Shutsemau, lsf-pc, linux-mm
  Cc: x86, linux-kernel, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Matthew Wilcox, Johannes Weiner, Usama Arif

On 2/19/26 16:08, Kiryl Shutsemau wrote:
> No, there's no new hardware (that I know of). I want to explore what page size
> means.
> 
> The kernel uses the same value - PAGE_SIZE - for two things:
> 
>    - the order-0 buddy allocation size;
> 
>    - the granularity of virtual address space mapping;
> 
> I think we can benefit from separating these two meanings and allowing
> order-0 allocations to be larger than the virtual address space covered by a
> PTE entry.
> 
> The main motivation is scalability. Managing memory on multi-terabyte
> machines in 4k is suboptimal, to say the least.
> 
> Potential benefits of the approach (assuming 64k pages):
> 
>    - The order-0 page size cuts struct page overhead by a factor of 16. From
>      ~1.6% of RAM to ~0.1%;
> 
>    - TLB wins on machines with TLB coalescing as long as mapping is naturally
>      aligned;
> 
>    - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
> 
>    - 1G pages are within possibility for the buddy allocator - order-14
>      allocation. It can open the road to 1G THPs.
> 
>    - As with THP, fewer pages - less pressure on the LRU lock;
> 
>    - ...
> 
> The trade-off is memory waste (similar to what we have on architectures with
> native 64k pages today) and complexity, mostly in the core-MM code.
> 
> == Design considerations ==
> 
> I want to split PAGE_SIZE into two distinct values:
> 
>    - PTE_SIZE defines the virtual address space granularity;
> 
>    - PG_SIZE defines the size of the order-0 buddy allocation;
> 
> PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
> requires conversion, and keep existing code working while conversion is in
> progress.
> 
> The same split happens for other page-related macros: mask, shift,
> alignment helpers, etc.
> 
> PFNs are in PTE_SIZE units.
> 
> The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
> units.
> 
> Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
> for userspace. But we might want to communicate PG_SIZE to userspace to
> get the optimal results for userspace that cares.
> 
> PTE_SIZE granularity requires a substantial rework of page fault and VMA
> handling:
> 
>    - A struct page pointer and pgprot_t are not enough to create a PTE entry.
>      We also need the offset within the page we are creating the PTE for.
> 
>    - Since the VMA start can be aligned arbitrarily with respect to the
>      underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
>      which is in PTE_SIZE units.
> 
>    - The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
>      misaligned cases;
> 
> Page faults into file mappings are relatively simple to handle as we
> always have the page cache to refer to. So you can map only the part of the
> page that fits in the page table, similarly to fault-around.
> 
> Anonymous and file-CoW faults should also be simple as long as the VMA is
> aligned to PG_SIZE in both the virtual address space and with respect to
> vm_pgoff. We might waste some memory on the ends of the VMA, but it is
> tolerable.
> 
> Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
> pages across a page table boundary. In the worst case, a page is mapped across
> a PGD entry boundary and PTEs for the page have to be put in two separate
> subtrees of page tables.
> 
> A naive implementation would map different pages on different sides of a
> page table boundary and accept the waste of one page per page table crossing.
> The hope is that misaligned mappings are rare, but this is suboptimal.
> 
> mremap(2) is the ultimate stress test for the design.
> 
> On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> is greater than 4 KB, we need a way to pack multiple page tables into a
> single page. We could use the slab allocator for this, but it would
> require relocating the page-table metadata out of struct page.

When discussing per-process page sizes with Ryan and Dev, I mentioned 
that having a larger emulated page size could be interesting for other 
architectures as well.

That is, we would emulate a 64K page size on Intel for user space as 
well, but let the OS work with 4K pages.

We'd only allocate+map large folios into user space + pagecache, but 
still allow for page tables etc. to not waste memory.

So "most" of your allocations in the system would actually be at least 
64k, reducing zone lock contention etc.


It doesn't solve all the problems you wanted to tackle on your list 
(e.g., "struct page" overhead, which will be sorted out by memdescs).

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 15:33 ` Pedro Falcato
@ 2026-02-19 15:50   ` Kiryl Shutsemau
  2026-02-19 15:53     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2026-02-19 15:50 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif

On Thu, Feb 19, 2026 at 03:33:47PM +0000, Pedro Falcato wrote:
> On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
> > No, there's no new hardware (that I know of). I want to explore what page size
> > means.
> > 
> > The kernel uses the same value - PAGE_SIZE - for two things:
> > 
> >   - the order-0 buddy allocation size;
> > 
> >   - the granularity of virtual address space mapping;
> > 
> > I think we can benefit from separating these two meanings and allowing
> > order-0 allocations to be larger than the virtual address space covered by a
> > PTE entry.
> >
> 
> Doesn't this idea make less sense these days, with mTHP? Simply by toggling one
> of the entries in /sys/kernel/mm/transparent_hugepage.

mTHP is still best effort. This is way you don't need to care about
fragmentation, you will get your 64k page as long as you have free
memory.

> > The main motivation is scalability. Managing memory on multi-terabyte
> > machines in 4k is suboptimal, to say the least.
> > 
> > Potential benefits of the approach (assuming 64k pages):
> > 
> >   - The order-0 page size cuts struct page overhead by a factor of 16. From
> >     ~1.6% of RAM to ~0.1%;
> > 
> >   - TLB wins on machines with TLB coalescing as long as mapping is naturally
> >     aligned;
> > 
> >   - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
> > 
> >   - 1G pages are within possibility for the buddy allocator - order-14
> >     allocation. It can open the road to 1G THPs.
> > 
> >   - As with THP, fewer pages - less pressure on the LRU lock;
> 
> We could perhaps add a way to enforce a min_order globally on the page cache,
> as a way to address it.

Raising min_order is not free. I puts more pressure on page allocator.

> There are some points there which aren't addressed by mTHP work in any way
> (1G THPs for one), others which are being addressed separately (memdesc work
> trying to cut down on struct page overhead).
> 
> (I also don't understand your point about order-5 allocation, AFAIK pcp will
> cache up to COSTLY_ORDER (3) and PMD order, but I'm probably not seeing the
> full picture)

With higher base page size, page allocator doesn't need to do as much
work to merge/split buddy pages. So serving the same 2M as order-5 is
cheaper than order-9.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 15:50   ` Kiryl Shutsemau
@ 2026-02-19 15:53     ` David Hildenbrand (Arm)
  2026-02-19 19:31       ` Pedro Falcato
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-19 15:53 UTC (permalink / raw)
  To: Kiryl Shutsemau, Pedro Falcato
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif

On 2/19/26 16:50, Kiryl Shutsemau wrote:
> On Thu, Feb 19, 2026 at 03:33:47PM +0000, Pedro Falcato wrote:
>> On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
>>> No, there's no new hardware (that I know of). I want to explore what page size
>>> means.
>>>
>>> The kernel uses the same value - PAGE_SIZE - for two things:
>>>
>>>    - the order-0 buddy allocation size;
>>>
>>>    - the granularity of virtual address space mapping;
>>>
>>> I think we can benefit from separating these two meanings and allowing
>>> order-0 allocations to be larger than the virtual address space covered by a
>>> PTE entry.
>>>
>>
>> Doesn't this idea make less sense these days, with mTHP? Simply by toggling one
>> of the entries in /sys/kernel/mm/transparent_hugepage.
> 
> mTHP is still best effort. This is way you don't need to care about
> fragmentation, you will get your 64k page as long as you have free
> memory.
> 
>>> The main motivation is scalability. Managing memory on multi-terabyte
>>> machines in 4k is suboptimal, to say the least.
>>>
>>> Potential benefits of the approach (assuming 64k pages):
>>>
>>>    - The order-0 page size cuts struct page overhead by a factor of 16. From
>>>      ~1.6% of RAM to ~0.1%;
>>>
>>>    - TLB wins on machines with TLB coalescing as long as mapping is naturally
>>>      aligned;
>>>
>>>    - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
>>>
>>>    - 1G pages are within possibility for the buddy allocator - order-14
>>>      allocation. It can open the road to 1G THPs.
>>>
>>>    - As with THP, fewer pages - less pressure on the LRU lock;
>>
>> We could perhaps add a way to enforce a min_order globally on the page cache,
>> as a way to address it.
> 
> Raising min_order is not free. I puts more pressure on page allocator.
> 
>> There are some points there which aren't addressed by mTHP work in any way
>> (1G THPs for one), others which are being addressed separately (memdesc work
>> trying to cut down on struct page overhead).
>>
>> (I also don't understand your point about order-5 allocation, AFAIK pcp will
>> cache up to COSTLY_ORDER (3) and PMD order, but I'm probably not seeing the
>> full picture)
> 
> With higher base page size, page allocator doesn't need to do as much
> work to merge/split buddy pages. So serving the same 2M as order-5 is
> cheaper than order-9.

I think the idea is that if most of your allocations (anon + pagecache) 
are 64k instead of 4k, on average, you'll just naturally do less merging 
splitting.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 15:39 ` David Hildenbrand (Arm)
@ 2026-02-19 15:54   ` Kiryl Shutsemau
  2026-02-19 16:09     ` David Hildenbrand (Arm)
  2026-02-19 17:09   ` Kiryl Shutsemau
  2026-02-19 23:24   ` Kalesh Singh
  2 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2026-02-19 15:54 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif

On Thu, Feb 19, 2026 at 04:39:34PM +0100, David Hildenbrand (Arm) wrote:
> On 2/19/26 16:08, Kiryl Shutsemau wrote:
> > No, there's no new hardware (that I know of). I want to explore what page size
> > means.
> > 
> > The kernel uses the same value - PAGE_SIZE - for two things:
> > 
> >    - the order-0 buddy allocation size;
> > 
> >    - the granularity of virtual address space mapping;
> > 
> > I think we can benefit from separating these two meanings and allowing
> > order-0 allocations to be larger than the virtual address space covered by a
> > PTE entry.
> > 
> > The main motivation is scalability. Managing memory on multi-terabyte
> > machines in 4k is suboptimal, to say the least.
> > 
> > Potential benefits of the approach (assuming 64k pages):
> > 
> >    - The order-0 page size cuts struct page overhead by a factor of 16. From
> >      ~1.6% of RAM to ~0.1%;
> > 
> >    - TLB wins on machines with TLB coalescing as long as mapping is naturally
> >      aligned;
> > 
> >    - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
> > 
> >    - 1G pages are within possibility for the buddy allocator - order-14
> >      allocation. It can open the road to 1G THPs.
> > 
> >    - As with THP, fewer pages - less pressure on the LRU lock;
> > 
> >    - ...
> > 
> > The trade-off is memory waste (similar to what we have on architectures with
> > native 64k pages today) and complexity, mostly in the core-MM code.
> > 
> > == Design considerations ==
> > 
> > I want to split PAGE_SIZE into two distinct values:
> > 
> >    - PTE_SIZE defines the virtual address space granularity;
> > 
> >    - PG_SIZE defines the size of the order-0 buddy allocation;
> > 
> > PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
> > requires conversion, and keep existing code working while conversion is in
> > progress.
> > 
> > The same split happens for other page-related macros: mask, shift,
> > alignment helpers, etc.
> > 
> > PFNs are in PTE_SIZE units.
> > 
> > The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
> > units.
> > 
> > Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
> > for userspace. But we might want to communicate PG_SIZE to userspace to
> > get the optimal results for userspace that cares.
> > 
> > PTE_SIZE granularity requires a substantial rework of page fault and VMA
> > handling:
> > 
> >    - A struct page pointer and pgprot_t are not enough to create a PTE entry.
> >      We also need the offset within the page we are creating the PTE for.
> > 
> >    - Since the VMA start can be aligned arbitrarily with respect to the
> >      underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
> >      which is in PTE_SIZE units.
> > 
> >    - The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
> >      misaligned cases;
> > 
> > Page faults into file mappings are relatively simple to handle as we
> > always have the page cache to refer to. So you can map only the part of the
> > page that fits in the page table, similarly to fault-around.
> > 
> > Anonymous and file-CoW faults should also be simple as long as the VMA is
> > aligned to PG_SIZE in both the virtual address space and with respect to
> > vm_pgoff. We might waste some memory on the ends of the VMA, but it is
> > tolerable.
> > 
> > Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
> > pages across a page table boundary. In the worst case, a page is mapped across
> > a PGD entry boundary and PTEs for the page have to be put in two separate
> > subtrees of page tables.
> > 
> > A naive implementation would map different pages on different sides of a
> > page table boundary and accept the waste of one page per page table crossing.
> > The hope is that misaligned mappings are rare, but this is suboptimal.
> > 
> > mremap(2) is the ultimate stress test for the design.
> > 
> > On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> > is greater than 4 KB, we need a way to pack multiple page tables into a
> > single page. We could use the slab allocator for this, but it would
> > require relocating the page-table metadata out of struct page.
> 
> When discussing per-process page sizes with Ryan and Dev, I mentioned that
> having a larger emulated page size could be interesting for other
> architectures as well.
> 
> That is, we would emulate a 64K page size on Intel for user space as well,
> but let the OS work with 4K pages.
> 
> We'd only allocate+map large folios into user space + pagecache, but still
> allow for page tables etc. to not waste memory.
> 
> So "most" of your allocations in the system would actually be at least 64k,
> reducing zone lock contention etc.

I am not convinced emulation would help zone lock contention. I expect
contention to be higher if page allocator would see a mix of 4k and 64k
requests. It sounds like constant split/merge under the lock.

> It doesn't solve all the problems you wanted to tackle on your list (e.g.,
> "struct page" overhead, which will be sorted out by memdescs).

I don't think we can serve 1G pages out of buddy allocator with 4k
order-0. And without it, I don't see how to get to a viable 1G THPs.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 15:54   ` Kiryl Shutsemau
@ 2026-02-19 16:09     ` David Hildenbrand (Arm)
  2026-02-20  2:55       ` Zi Yan
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-19 16:09 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif

On 2/19/26 16:54, Kiryl Shutsemau wrote:
> On Thu, Feb 19, 2026 at 04:39:34PM +0100, David Hildenbrand (Arm) wrote:
>> On 2/19/26 16:08, Kiryl Shutsemau wrote:
>>> No, there's no new hardware (that I know of). I want to explore what page size
>>> means.
>>>
>>> The kernel uses the same value - PAGE_SIZE - for two things:
>>>
>>>     - the order-0 buddy allocation size;
>>>
>>>     - the granularity of virtual address space mapping;
>>>
>>> I think we can benefit from separating these two meanings and allowing
>>> order-0 allocations to be larger than the virtual address space covered by a
>>> PTE entry.
>>>
>>> The main motivation is scalability. Managing memory on multi-terabyte
>>> machines in 4k is suboptimal, to say the least.
>>>
>>> Potential benefits of the approach (assuming 64k pages):
>>>
>>>     - The order-0 page size cuts struct page overhead by a factor of 16. From
>>>       ~1.6% of RAM to ~0.1%;
>>>
>>>     - TLB wins on machines with TLB coalescing as long as mapping is naturally
>>>       aligned;
>>>
>>>     - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
>>>
>>>     - 1G pages are within possibility for the buddy allocator - order-14
>>>       allocation. It can open the road to 1G THPs.
>>>
>>>     - As with THP, fewer pages - less pressure on the LRU lock;
>>>
>>>     - ...
>>>
>>> The trade-off is memory waste (similar to what we have on architectures with
>>> native 64k pages today) and complexity, mostly in the core-MM code.
>>>
>>> == Design considerations ==
>>>
>>> I want to split PAGE_SIZE into two distinct values:
>>>
>>>     - PTE_SIZE defines the virtual address space granularity;
>>>
>>>     - PG_SIZE defines the size of the order-0 buddy allocation;
>>>
>>> PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
>>> requires conversion, and keep existing code working while conversion is in
>>> progress.
>>>
>>> The same split happens for other page-related macros: mask, shift,
>>> alignment helpers, etc.
>>>
>>> PFNs are in PTE_SIZE units.
>>>
>>> The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
>>> units.
>>>
>>> Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
>>> for userspace. But we might want to communicate PG_SIZE to userspace to
>>> get the optimal results for userspace that cares.
>>>
>>> PTE_SIZE granularity requires a substantial rework of page fault and VMA
>>> handling:
>>>
>>>     - A struct page pointer and pgprot_t are not enough to create a PTE entry.
>>>       We also need the offset within the page we are creating the PTE for.
>>>
>>>     - Since the VMA start can be aligned arbitrarily with respect to the
>>>       underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
>>>       which is in PTE_SIZE units.
>>>
>>>     - The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
>>>       misaligned cases;
>>>
>>> Page faults into file mappings are relatively simple to handle as we
>>> always have the page cache to refer to. So you can map only the part of the
>>> page that fits in the page table, similarly to fault-around.
>>>
>>> Anonymous and file-CoW faults should also be simple as long as the VMA is
>>> aligned to PG_SIZE in both the virtual address space and with respect to
>>> vm_pgoff. We might waste some memory on the ends of the VMA, but it is
>>> tolerable.
>>>
>>> Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
>>> pages across a page table boundary. In the worst case, a page is mapped across
>>> a PGD entry boundary and PTEs for the page have to be put in two separate
>>> subtrees of page tables.
>>>
>>> A naive implementation would map different pages on different sides of a
>>> page table boundary and accept the waste of one page per page table crossing.
>>> The hope is that misaligned mappings are rare, but this is suboptimal.
>>>
>>> mremap(2) is the ultimate stress test for the design.
>>>
>>> On x86, page tables are allocated from the buddy allocator and if PG_SIZE
>>> is greater than 4 KB, we need a way to pack multiple page tables into a
>>> single page. We could use the slab allocator for this, but it would
>>> require relocating the page-table metadata out of struct page.
>>
>> When discussing per-process page sizes with Ryan and Dev, I mentioned that
>> having a larger emulated page size could be interesting for other
>> architectures as well.
>>
>> That is, we would emulate a 64K page size on Intel for user space as well,
>> but let the OS work with 4K pages.
>>
>> We'd only allocate+map large folios into user space + pagecache, but still
>> allow for page tables etc. to not waste memory.
>>
>> So "most" of your allocations in the system would actually be at least 64k,
>> reducing zone lock contention etc.
> 
> I am not convinced emulation would help zone lock contention. I expect
> contention to be higher if page allocator would see a mix of 4k and 64k
> requests. It sounds like constant split/merge under the lock.

If most your allocations are larger, then there isn't that much 
splitting/merging.

There will be some for the < 64k allocations of course, but when all 
user space+page cache is >= 64 then the split/merge + zone lock should 
be heavily reduced.

> 
>> It doesn't solve all the problems you wanted to tackle on your list (e.g.,
>> "struct page" overhead, which will be sorted out by memdescs).
> 
> I don't think we can serve 1G pages out of buddy allocator with 4k
> order-0. And without it, I don't see how to get to a viable 1G THPs.

Zi Yan was one working on this, and I think we had ideas on how to make 
that work in the long run.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 15:08 [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Kiryl Shutsemau
                   ` (2 preceding siblings ...)
  2026-02-19 15:39 ` David Hildenbrand (Arm)
@ 2026-02-19 17:08 ` Dave Hansen
  2026-02-19 22:05   ` Kiryl Shutsemau
  2026-02-19 17:30 ` Dave Hansen
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2026-02-19 17:08 UTC (permalink / raw)
  To: Kiryl Shutsemau, lsf-pc, linux-mm
  Cc: x86, linux-kernel, Andrew Morton, David Hildenbrand,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif

On 2/19/26 07:08, Kiryl Shutsemau wrote:
>   - The order-0 page size cuts struct page overhead by a factor of 16. From
>     ~1.6% of RAM to ~0.1%;

First of all, this looks like fun. Nice work! I'm not opposed at all in
concept to cleaning up things and doing the logical separation you
described to split buddy granularity and mapping granularity. That seems
like a worthy endeavor and some of the union/#define tricks look like a
likely viable way to do it incrementally.

But I don't think there's going to be a lot of memory savings in the
end. Maybe this would bring the mem= hyperscalers back into the fold and
have them actually start using 'struct page' again for their VM memory.
Dunno.

But, let's look at my kernel directory and round the file sizes up to
4k, 16k and 64k:

find .  -printf '%s\n' | while read size; do echo	\
		$(((size + 0x0fff) & 0xfffff000))	\
		$(((size + 0x3fff) & 0xffffc000))	\
		$(((size + 0xffff) & 0xffff0000));
done

... and add them all up:

11,297,648 KB - on disk
11,297,712 KB - in a 4k page cache
12,223,488 KB - in a 16k page cache
16,623,296 KB - in a 64k page cache

So a 64k page cache eats ~5GB of extra memory for a kernel tree (well,
_my_ kernel tree). In other words, if you are looking for memory savings
on my laptop, you'll need ~300GB of RAM before 'struct page' overhead
overwhelms the page cache bloat from a single kernel tree.

The whole kernel obviously isn't in the page cache all at the same time.
The page cache across the system is also obviously different than a
kernel tree, but you get the point.

That's not to diminish how useful something like this might be,
especially for folks that are sensitive to 'struct page' overhead or
allocator performance.

But, it will mostly be getting better performance at the _cost_ of
consuming more RAM, not saving RAM.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 15:39 ` David Hildenbrand (Arm)
  2026-02-19 15:54   ` Kiryl Shutsemau
@ 2026-02-19 17:09   ` Kiryl Shutsemau
  2026-02-20 10:24     ` David Hildenbrand (Arm)
  2026-02-19 23:24   ` Kalesh Singh
  2 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2026-02-19 17:09 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif

On Thu, Feb 19, 2026 at 04:39:34PM +0100, David Hildenbrand (Arm) wrote:
> On 2/19/26 16:08, Kiryl Shutsemau wrote:
> > No, there's no new hardware (that I know of). I want to explore what page size
> > means.
> > 
> > The kernel uses the same value - PAGE_SIZE - for two things:
> > 
> >    - the order-0 buddy allocation size;
> > 
> >    - the granularity of virtual address space mapping;
> > 
> > I think we can benefit from separating these two meanings and allowing
> > order-0 allocations to be larger than the virtual address space covered by a
> > PTE entry.
> > 
> > The main motivation is scalability. Managing memory on multi-terabyte
> > machines in 4k is suboptimal, to say the least.
> > 
> > Potential benefits of the approach (assuming 64k pages):
> > 
> >    - The order-0 page size cuts struct page overhead by a factor of 16. From
> >      ~1.6% of RAM to ~0.1%;
> > 
> >    - TLB wins on machines with TLB coalescing as long as mapping is naturally
> >      aligned;
> > 
> >    - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
> > 
> >    - 1G pages are within possibility for the buddy allocator - order-14
> >      allocation. It can open the road to 1G THPs.
> > 
> >    - As with THP, fewer pages - less pressure on the LRU lock;
> > 
> >    - ...
> > 
> > The trade-off is memory waste (similar to what we have on architectures with
> > native 64k pages today) and complexity, mostly in the core-MM code.
> > 
> > == Design considerations ==
> > 
> > I want to split PAGE_SIZE into two distinct values:
> > 
> >    - PTE_SIZE defines the virtual address space granularity;
> > 
> >    - PG_SIZE defines the size of the order-0 buddy allocation;
> > 
> > PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
> > requires conversion, and keep existing code working while conversion is in
> > progress.
> > 
> > The same split happens for other page-related macros: mask, shift,
> > alignment helpers, etc.
> > 
> > PFNs are in PTE_SIZE units.
> > 
> > The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
> > units.
> > 
> > Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
> > for userspace. But we might want to communicate PG_SIZE to userspace to
> > get the optimal results for userspace that cares.
> > 
> > PTE_SIZE granularity requires a substantial rework of page fault and VMA
> > handling:
> > 
> >    - A struct page pointer and pgprot_t are not enough to create a PTE entry.
> >      We also need the offset within the page we are creating the PTE for.
> > 
> >    - Since the VMA start can be aligned arbitrarily with respect to the
> >      underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
> >      which is in PTE_SIZE units.
> > 
> >    - The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
> >      misaligned cases;
> > 
> > Page faults into file mappings are relatively simple to handle as we
> > always have the page cache to refer to. So you can map only the part of the
> > page that fits in the page table, similarly to fault-around.
> > 
> > Anonymous and file-CoW faults should also be simple as long as the VMA is
> > aligned to PG_SIZE in both the virtual address space and with respect to
> > vm_pgoff. We might waste some memory on the ends of the VMA, but it is
> > tolerable.
> > 
> > Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
> > pages across a page table boundary. In the worst case, a page is mapped across
> > a PGD entry boundary and PTEs for the page have to be put in two separate
> > subtrees of page tables.
> > 
> > A naive implementation would map different pages on different sides of a
> > page table boundary and accept the waste of one page per page table crossing.
> > The hope is that misaligned mappings are rare, but this is suboptimal.
> > 
> > mremap(2) is the ultimate stress test for the design.
> > 
> > On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> > is greater than 4 KB, we need a way to pack multiple page tables into a
> > single page. We could use the slab allocator for this, but it would
> > require relocating the page-table metadata out of struct page.
> 
> When discussing per-process page sizes with Ryan and Dev, I mentioned that
> having a larger emulated page size could be interesting for other
> architectures as well.
>
> That is, we would emulate a 64K page size on Intel for user space as well,
> but let the OS work with 4K pages.

Just to clarify, do you want it to be enforced on userspace ABI.
Like, all mappings are 64k aligned?

> We'd only allocate+map large folios into user space + pagecache, but still
> allow for page tables etc. to not waste memory.

Waste of memory for page table is solvable and pretty straight forward.
Most of such cases can be solve mechanically by switching to slab.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 15:08 [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Kiryl Shutsemau
                   ` (3 preceding siblings ...)
  2026-02-19 17:08 ` Dave Hansen
@ 2026-02-19 17:30 ` Dave Hansen
  2026-02-19 22:14   ` Kiryl Shutsemau
  2026-02-19 17:47 ` Matthew Wilcox
  2026-02-20  9:04 ` David Laight
  6 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2026-02-19 17:30 UTC (permalink / raw)
  To: Kiryl Shutsemau, lsf-pc, linux-mm
  Cc: x86, linux-kernel, Andrew Morton, David Hildenbrand,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif

On 2/19/26 07:08, Kiryl Shutsemau wrote:
...
> The patchset is large:
> 
>  378 files changed, 3348 insertions(+), 3102 deletions(-)

A few notes about the diffstats:

$ git diff v6.17..HEAD arch/x86 | diffstat | tail -1
 105 files changed, 874 insertions(+), 843 deletions(-)
$ git diff v6.17..HEAD mm | diffstat | tail -1
 53 files changed, 1136 insertions(+), 1069 deletions(-)

The vast, vast majority of this seems to be the renames. Stuff like:

> -               new = round_down(new, PAGE_SIZE);
> +               new = round_down(new, PTE_SIZE);

or even less worrying:

> -int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid);
> +int set_direct_map_valid_noflush(struct page *page, unsigned numpages, bool valid);

That stuff obviously needs to be audited but it's far less concerning
than the logic changes.

So just for review sanity, if you go forward with this, I'd very much
appreciate a strong separation of the purely mechanical bits from any
logic changes.

> On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> is greater than 4 KB, we need a way to pack multiple page tables into a
> single page. We could use the slab allocator for this, but it would
> require relocating the page-table metadata out of struct page.

Others mentioned this, but I think this essentially gates what you are
doing behind a full tree conversion over to ptdescs.

The most useful thing we can do with this series is look at it and
decide what _other_ things need to get done before the tree could
possibly go in that direction, like ptdesc or a the disambiguation
between PTE_SIZE and PG_SIZE that you've kicked off here.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 15:08 [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Kiryl Shutsemau
                   ` (4 preceding siblings ...)
  2026-02-19 17:30 ` Dave Hansen
@ 2026-02-19 17:47 ` Matthew Wilcox
  2026-02-19 22:26   ` Kiryl Shutsemau
  2026-02-20  9:04 ` David Laight
  6 siblings, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2026-02-19 17:47 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Johannes Weiner, Usama Arif

On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
> On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> is greater than 4 KB, we need a way to pack multiple page tables into a
> single page. We could use the slab allocator for this, but it would
> require relocating the page-table metadata out of struct page.

Have you looked at the s390/ppc implementations (yes, they're different,
no, that sucks)?  slab seems like the wrong approach to me.

There's a third approach that I've never looked at which is to allocate
the larger size, then just use it for N consecutive entries.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 15:53     ` David Hildenbrand (Arm)
@ 2026-02-19 19:31       ` Pedro Falcato
  0 siblings, 0 replies; 43+ messages in thread
From: Pedro Falcato @ 2026-02-19 19:31 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Kiryl Shutsemau, lsf-pc, linux-mm, x86, linux-kernel,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif

On Thu, Feb 19, 2026 at 04:53:10PM +0100, David Hildenbrand (Arm) wrote:
> On 2/19/26 16:50, Kiryl Shutsemau wrote:
> > On Thu, Feb 19, 2026 at 03:33:47PM +0000, Pedro Falcato wrote:
> > > On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
> > > > No, there's no new hardware (that I know of). I want to explore what page size
> > > > means.
> > > > 
> > > > The kernel uses the same value - PAGE_SIZE - for two things:
> > > > 
> > > >    - the order-0 buddy allocation size;
> > > > 
> > > >    - the granularity of virtual address space mapping;
> > > > 
> > > > I think we can benefit from separating these two meanings and allowing
> > > > order-0 allocations to be larger than the virtual address space covered by a
> > > > PTE entry.
> > > > 
> > > 
> > > Doesn't this idea make less sense these days, with mTHP? Simply by toggling one
> > > of the entries in /sys/kernel/mm/transparent_hugepage.
> > 
> > mTHP is still best effort. This is way you don't need to care about
> > fragmentation, you will get your 64k page as long as you have free
> > memory.
> > 
> > > > The main motivation is scalability. Managing memory on multi-terabyte
> > > > machines in 4k is suboptimal, to say the least.
> > > > 
> > > > Potential benefits of the approach (assuming 64k pages):
> > > > 
> > > >    - The order-0 page size cuts struct page overhead by a factor of 16. From
> > > >      ~1.6% of RAM to ~0.1%;
> > > > 
> > > >    - TLB wins on machines with TLB coalescing as long as mapping is naturally
> > > >      aligned;
> > > > 
> > > >    - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
> > > > 
> > > >    - 1G pages are within possibility for the buddy allocator - order-14
> > > >      allocation. It can open the road to 1G THPs.
> > > > 
> > > >    - As with THP, fewer pages - less pressure on the LRU lock;
> > > 
> > > We could perhaps add a way to enforce a min_order globally on the page cache,
> > > as a way to address it.
> > 
> > Raising min_order is not free. I puts more pressure on page allocator.
> > 
> > > There are some points there which aren't addressed by mTHP work in any way
> > > (1G THPs for one), others which are being addressed separately (memdesc work
> > > trying to cut down on struct page overhead).
> > > 
> > > (I also don't understand your point about order-5 allocation, AFAIK pcp will
> > > cache up to COSTLY_ORDER (3) and PMD order, but I'm probably not seeing the
> > > full picture)
> > 
> > With higher base page size, page allocator doesn't need to do as much
> > work to merge/split buddy pages. So serving the same 2M as order-5 is
> > cheaper than order-9.
> 
> I think the idea is that if most of your allocations (anon + pagecache) are
> 64k instead of 4k, on average, you'll just naturally do less merging
> splitting.

Yep. That plus slab_min_order would hopefully yield a system where 90%+
(depending on how your filesystem's buffer cache works) allocations are 64K.

-- 
Pedro


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 17:08 ` Dave Hansen
@ 2026-02-19 22:05   ` Kiryl Shutsemau
  2026-02-20  3:28     ` Liam R. Howlett
  0 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2026-02-19 22:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif

On Thu, Feb 19, 2026 at 09:08:57AM -0800, Dave Hansen wrote:
> On 2/19/26 07:08, Kiryl Shutsemau wrote:
> >   - The order-0 page size cuts struct page overhead by a factor of 16. From
> >     ~1.6% of RAM to ~0.1%;
> ...
> But, it will mostly be getting better performance at the _cost_ of
> consuming more RAM, not saving RAM.

That's fair.

The problem with struct page memory consumption is that it is static and
cannot be reclaimed. You pay the struct page tax no matter what.

Page cache rounding overhead can be large, but a motivated userspace can
keep it under control by avoiding splitting a dataset into many small
files. And this memory is reclaimable.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 17:30 ` Dave Hansen
@ 2026-02-19 22:14   ` Kiryl Shutsemau
  2026-02-19 22:21     ` Dave Hansen
  0 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2026-02-19 22:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif

On Thu, Feb 19, 2026 at 09:30:36AM -0800, Dave Hansen wrote:
> On 2/19/26 07:08, Kiryl Shutsemau wrote:
> ...
> > The patchset is large:
> > 
> >  378 files changed, 3348 insertions(+), 3102 deletions(-)
> 
> A few notes about the diffstats:
> 
> $ git diff v6.17..HEAD arch/x86 | diffstat | tail -1
>  105 files changed, 874 insertions(+), 843 deletions(-)
> $ git diff v6.17..HEAD mm | diffstat | tail -1
>  53 files changed, 1136 insertions(+), 1069 deletions(-)
> 
> The vast, vast majority of this seems to be the renames. Stuff like:
> 
> > -               new = round_down(new, PAGE_SIZE);
> > +               new = round_down(new, PTE_SIZE);
> 
> or even less worrying:
> 
> > -int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid);
> > +int set_direct_map_valid_noflush(struct page *page, unsigned numpages, bool valid);
> 
> That stuff obviously needs to be audited but it's far less concerning
> than the logic changes.
> 
> So just for review sanity, if you go forward with this, I'd very much
> appreciate a strong separation of the purely mechanical bits from any
> logic changes.

That's the plan. That's the only way I can keep myself sane :P

> > On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> > is greater than 4 KB, we need a way to pack multiple page tables into a
> > single page. We could use the slab allocator for this, but it would
> > require relocating the page-table metadata out of struct page.
> 
> Others mentioned this, but I think this essentially gates what you are
> doing behind a full tree conversion over to ptdescs.

I have not followed ptdescs closely. Need to catch up.

For PoC, I will just waste full order-0 page for page table. Packing is
not required for correctness.

> The most useful thing we can do with this series is look at it and
> decide what _other_ things need to get done before the tree could
> possibly go in that direction, like ptdesc or a the disambiguation
> between PTE_SIZE and PG_SIZE that you've kicked off here.

Right.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 22:14   ` Kiryl Shutsemau
@ 2026-02-19 22:21     ` Dave Hansen
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2026-02-19 22:21 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif

On 2/19/26 14:14, Kiryl Shutsemau wrote:
>> Others mentioned this, but I think this essentially gates what you are
>> doing behind a full tree conversion over to ptdescs.
> I have not followed ptdescs closely. Need to catch up.
> 
> For PoC, I will just waste full order-0 page for page table. Packing is
> not required for correctness.

Yeah, I guess padding it out is ugly but effective.

I was trying to figure out how it would apply to the KPTI pgd because we
just flip bit 12 to switch between user and kernel PGDs. But I guess the
8k of PGDs in the current allocation will fit fine in 128k, so it's
weird but functional.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 17:47 ` Matthew Wilcox
@ 2026-02-19 22:26   ` Kiryl Shutsemau
  0 siblings, 0 replies; 43+ messages in thread
From: Kiryl Shutsemau @ 2026-02-19 22:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Johannes Weiner, Usama Arif

On Thu, Feb 19, 2026 at 05:47:22PM +0000, Matthew Wilcox wrote:
> On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
> > On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> > is greater than 4 KB, we need a way to pack multiple page tables into a
> > single page. We could use the slab allocator for this, but it would
> > require relocating the page-table metadata out of struct page.
> 
> Have you looked at the s390/ppc implementations (yes, they're different,
> no, that sucks)? 

No, will check it out tomorrow.

> slab seems like the wrong approach to me.

I was the first thing that came to mind. I have not put much time into
it

> There's a third approach that I've never looked at which is to allocate
> the larger size, then just use it for N consecutive entries.

Yeah, that's a possible way. We would need to populate 16 page table
entries of the parent page table. But you don't need to care about
fragmentation within the page.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 15:39 ` David Hildenbrand (Arm)
  2026-02-19 15:54   ` Kiryl Shutsemau
  2026-02-19 17:09   ` Kiryl Shutsemau
@ 2026-02-19 23:24   ` Kalesh Singh
  2026-02-20 12:10     ` Kiryl Shutsemau
  2 siblings, 1 reply; 43+ messages in thread
From: Kalesh Singh @ 2026-02-19 23:24 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Kiryl Shutsemau, lsf-pc, linux-mm, x86, linux-kernel,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif, android-mm,
	Adrian Barnaś,
	Mateusz Maćkowski, Steven Moreland

On Thu, Feb 19, 2026 at 7:39 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 2/19/26 16:08, Kiryl Shutsemau wrote:
> > No, there's no new hardware (that I know of). I want to explore what page size
> > means.
> >
> > The kernel uses the same value - PAGE_SIZE - for two things:
> >
> >    - the order-0 buddy allocation size;
> >
> >    - the granularity of virtual address space mapping;
> >
> > I think we can benefit from separating these two meanings and allowing
> > order-0 allocations to be larger than the virtual address space covered by a
> > PTE entry.
> >
> > The main motivation is scalability. Managing memory on multi-terabyte
> > machines in 4k is suboptimal, to say the least.
> >
> > Potential benefits of the approach (assuming 64k pages):
> >
> >    - The order-0 page size cuts struct page overhead by a factor of 16. From
> >      ~1.6% of RAM to ~0.1%;
> >
> >    - TLB wins on machines with TLB coalescing as long as mapping is naturally
> >      aligned;
> >
> >    - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
> >
> >    - 1G pages are within possibility for the buddy allocator - order-14
> >      allocation. It can open the road to 1G THPs.
> >
> >    - As with THP, fewer pages - less pressure on the LRU lock;
> >
> >    - ...
> >
> > The trade-off is memory waste (similar to what we have on architectures with
> > native 64k pages today) and complexity, mostly in the core-MM code.
> >
> > == Design considerations ==
> >
> > I want to split PAGE_SIZE into two distinct values:
> >
> >    - PTE_SIZE defines the virtual address space granularity;
> >
> >    - PG_SIZE defines the size of the order-0 buddy allocation;
> >
> > PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
> > requires conversion, and keep existing code working while conversion is in
> > progress.
> >
> > The same split happens for other page-related macros: mask, shift,
> > alignment helpers, etc.
> >
> > PFNs are in PTE_SIZE units.
> >
> > The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
> > units.
> >
> > Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
> > for userspace. But we might want to communicate PG_SIZE to userspace to
> > get the optimal results for userspace that cares.
> >
> > PTE_SIZE granularity requires a substantial rework of page fault and VMA
> > handling:
> >
> >    - A struct page pointer and pgprot_t are not enough to create a PTE entry.
> >      We also need the offset within the page we are creating the PTE for.
> >
> >    - Since the VMA start can be aligned arbitrarily with respect to the
> >      underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
> >      which is in PTE_SIZE units.
> >
> >    - The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
> >      misaligned cases;
> >
> > Page faults into file mappings are relatively simple to handle as we
> > always have the page cache to refer to. So you can map only the part of the
> > page that fits in the page table, similarly to fault-around.
> >
> > Anonymous and file-CoW faults should also be simple as long as the VMA is
> > aligned to PG_SIZE in both the virtual address space and with respect to
> > vm_pgoff. We might waste some memory on the ends of the VMA, but it is
> > tolerable.
> >
> > Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
> > pages across a page table boundary. In the worst case, a page is mapped across
> > a PGD entry boundary and PTEs for the page have to be put in two separate
> > subtrees of page tables.
> >
> > A naive implementation would map different pages on different sides of a
> > page table boundary and accept the waste of one page per page table crossing.
> > The hope is that misaligned mappings are rare, but this is suboptimal.
> >
> > mremap(2) is the ultimate stress test for the design.
> >
> > On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> > is greater than 4 KB, we need a way to pack multiple page tables into a
> > single page. We could use the slab allocator for this, but it would
> > require relocating the page-table metadata out of struct page.
>
> When discussing per-process page sizes with Ryan and Dev, I mentioned
> that having a larger emulated page size could be interesting for other
> architectures as well.
>
> That is, we would emulate a 64K page size on Intel for user space as
> well, but let the OS work with 4K pages.
>
> We'd only allocate+map large folios into user space + pagecache, but
> still allow for page tables etc. to not waste memory.
>
> So "most" of your allocations in the system would actually be at least
> 64k, reducing zone lock contention etc.
>
>
> It doesn't solve all the problems you wanted to tackle on your list
> (e.g., "struct page" overhead, which will be sorted out by memdescs).

Hi Kiryl,

I'd be interested to discuss this at LSFMM.

On Android, we have a separate but related use case: we emulate the
userspace page size on x86, primarily to enable app developers to
conduct compatibility testing of their apps for 16KB Android devices.
[1]

It mainly works by enforcing a larger granularity on the VMAs to
emulate a userspace page size, somewhat similar to what David
mentioned, while the underlying kernel still operates on a 4KB
granularity. [2]

IIUC the current design would not enfore the larger granularity /
alignment for VMAs to avoid breaking ABI. However, I'd be interest to
discuss whether it can be extended to cover this usecase as well.

[1]  https://developer.android.com/guide/practices/page-sizes#16kb-emulator
[2] https://source.android.com/docs/core/architecture/16kb-page-size/getting-started-cf-x86-64-pgagnostic

Thanks,
Kalesh




>
> --
> Cheers,
>
> David
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 16:09     ` David Hildenbrand (Arm)
@ 2026-02-20  2:55       ` Zi Yan
  0 siblings, 0 replies; 43+ messages in thread
From: Zi Yan @ 2026-02-20  2:55 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Kiryl Shutsemau, lsf-pc, linux-mm, x86, linux-kernel,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif

On 19 Feb 2026, at 11:09, David Hildenbrand (Arm) wrote:

> On 2/19/26 16:54, Kiryl Shutsemau wrote:
>> On Thu, Feb 19, 2026 at 04:39:34PM +0100, David Hildenbrand (Arm) wrote:
>>> On 2/19/26 16:08, Kiryl Shutsemau wrote:
>>>> No, there's no new hardware (that I know of). I want to explore what page size
>>>> means.
>>>>
>>>> The kernel uses the same value - PAGE_SIZE - for two things:
>>>>
>>>>     - the order-0 buddy allocation size;
>>>>
>>>>     - the granularity of virtual address space mapping;
>>>>
>>>> I think we can benefit from separating these two meanings and allowing
>>>> order-0 allocations to be larger than the virtual address space covered by a
>>>> PTE entry.
>>>>
>>>> The main motivation is scalability. Managing memory on multi-terabyte
>>>> machines in 4k is suboptimal, to say the least.
>>>>
>>>> Potential benefits of the approach (assuming 64k pages):
>>>>
>>>>     - The order-0 page size cuts struct page overhead by a factor of 16. From
>>>>       ~1.6% of RAM to ~0.1%;
>>>>
>>>>     - TLB wins on machines with TLB coalescing as long as mapping is naturally
>>>>       aligned;
>>>>
>>>>     - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
>>>>
>>>>     - 1G pages are within possibility for the buddy allocator - order-14
>>>>       allocation. It can open the road to 1G THPs.
>>>>
>>>>     - As with THP, fewer pages - less pressure on the LRU lock;
>>>>
>>>>     - ...
>>>>
>>>> The trade-off is memory waste (similar to what we have on architectures with
>>>> native 64k pages today) and complexity, mostly in the core-MM code.
>>>>
>>>> == Design considerations ==
>>>>
>>>> I want to split PAGE_SIZE into two distinct values:
>>>>
>>>>     - PTE_SIZE defines the virtual address space granularity;
>>>>
>>>>     - PG_SIZE defines the size of the order-0 buddy allocation;
>>>>
>>>> PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
>>>> requires conversion, and keep existing code working while conversion is in
>>>> progress.
>>>>
>>>> The same split happens for other page-related macros: mask, shift,
>>>> alignment helpers, etc.
>>>>
>>>> PFNs are in PTE_SIZE units.
>>>>
>>>> The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
>>>> units.
>>>>
>>>> Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
>>>> for userspace. But we might want to communicate PG_SIZE to userspace to
>>>> get the optimal results for userspace that cares.
>>>>
>>>> PTE_SIZE granularity requires a substantial rework of page fault and VMA
>>>> handling:
>>>>
>>>>     - A struct page pointer and pgprot_t are not enough to create a PTE entry.
>>>>       We also need the offset within the page we are creating the PTE for.
>>>>
>>>>     - Since the VMA start can be aligned arbitrarily with respect to the
>>>>       underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
>>>>       which is in PTE_SIZE units.
>>>>
>>>>     - The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
>>>>       misaligned cases;
>>>>
>>>> Page faults into file mappings are relatively simple to handle as we
>>>> always have the page cache to refer to. So you can map only the part of the
>>>> page that fits in the page table, similarly to fault-around.
>>>>
>>>> Anonymous and file-CoW faults should also be simple as long as the VMA is
>>>> aligned to PG_SIZE in both the virtual address space and with respect to
>>>> vm_pgoff. We might waste some memory on the ends of the VMA, but it is
>>>> tolerable.
>>>>
>>>> Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
>>>> pages across a page table boundary. In the worst case, a page is mapped across
>>>> a PGD entry boundary and PTEs for the page have to be put in two separate
>>>> subtrees of page tables.
>>>>
>>>> A naive implementation would map different pages on different sides of a
>>>> page table boundary and accept the waste of one page per page table crossing.
>>>> The hope is that misaligned mappings are rare, but this is suboptimal.
>>>>
>>>> mremap(2) is the ultimate stress test for the design.
>>>>
>>>> On x86, page tables are allocated from the buddy allocator and if PG_SIZE
>>>> is greater than 4 KB, we need a way to pack multiple page tables into a
>>>> single page. We could use the slab allocator for this, but it would
>>>> require relocating the page-table metadata out of struct page.
>>>
>>> When discussing per-process page sizes with Ryan and Dev, I mentioned that
>>> having a larger emulated page size could be interesting for other
>>> architectures as well.
>>>
>>> That is, we would emulate a 64K page size on Intel for user space as well,
>>> but let the OS work with 4K pages.
>>>
>>> We'd only allocate+map large folios into user space + pagecache, but still
>>> allow for page tables etc. to not waste memory.
>>>
>>> So "most" of your allocations in the system would actually be at least 64k,
>>> reducing zone lock contention etc.
>>
>> I am not convinced emulation would help zone lock contention. I expect
>> contention to be higher if page allocator would see a mix of 4k and 64k
>> requests. It sounds like constant split/merge under the lock.
>
> If most your allocations are larger, then there isn't that much splitting/merging.
>
> There will be some for the < 64k allocations of course, but when all user space+page cache is >= 64 then the split/merge + zone lock should be heavily reduced.
>
>>
>>> It doesn't solve all the problems you wanted to tackle on your list (e.g.,
>>> "struct page" overhead, which will be sorted out by memdescs).
>>
>> I don't think we can serve 1G pages out of buddy allocator with 4k
>> order-0. And without it, I don't see how to get to a viable 1G THPs.
>
> Zi Yan was one working on this, and I think we had ideas on how to make that work in the long run.

Right. The idea is to add super pageblock (or whatever name), which consists of N consecutive
pageblocks, so that anti fragmentation can work at larger granularity, e.g., 1GB, to create
free pages. Whether 1GB free pages from memory compaction need to go into buddy allocator
or not is debatable.

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 22:05   ` Kiryl Shutsemau
@ 2026-02-20  3:28     ` Liam R. Howlett
  2026-02-20 12:33       ` Kiryl Shutsemau
  0 siblings, 1 reply; 43+ messages in thread
From: Liam R. Howlett @ 2026-02-20  3:28 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Dave Hansen, lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif

* Kiryl Shutsemau <kas@kernel.org> [260219 17:05]:
> On Thu, Feb 19, 2026 at 09:08:57AM -0800, Dave Hansen wrote:
> > On 2/19/26 07:08, Kiryl Shutsemau wrote:
> > >   - The order-0 page size cuts struct page overhead by a factor of 16. From
> > >     ~1.6% of RAM to ~0.1%;
> > ...
> > But, it will mostly be getting better performance at the _cost_ of
> > consuming more RAM, not saving RAM.
> 
> That's fair.
> 
> The problem with struct page memory consumption is that it is static and
> cannot be reclaimed. You pay the struct page tax no matter what.
> 
> Page cache rounding overhead can be large, but a motivated userspace can
> keep it under control by avoiding splitting a dataset into many small
> files. And this memory is reclaimable.
> 

But we are in reclaim a lot more these days.  As I'm sure you are aware,
we are trying to maximize the resources (both cpu and ram) of any
machine powered on.  Entering reclaim will consume the cpu time and will
affect other tasks.

Especially with multiple workload machines, the tendency is to have a
primary focus with the lower desired work being killed, if necessary.
Reducing the overhead just means more secondary tasks, or a bigger
footprint of the ones already executing.

Increasing the memory pressure will degrade the primary workload more
frequently, even if we recover enough to avoid OOMing the secondary.

While in the struct page tax world, the secondary task would be killed
after a shorter (and less frequently executed) reclaim comes up short.
So, I would think that we would be degrading the primary workload in an
attempt to keep the secondary alive?  Maybe I'm over-simplifying here?

Near the other end of the spectrum, we have chromebooks that are
constantly in reclaim, even with 4k pages.  I guess these machines would
be destine to maintain the same page size they use today.  That is, this
solution for the struct page tax is only useful if you have a lot of
memory.  But then again, that's where the bookkeeping costs become hard
to take.

Thanks,
Liam

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 15:08 [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Kiryl Shutsemau
                   ` (5 preceding siblings ...)
  2026-02-19 17:47 ` Matthew Wilcox
@ 2026-02-20  9:04 ` David Laight
  2026-02-20 12:12   ` Kiryl Shutsemau
  6 siblings, 1 reply; 43+ messages in thread
From: David Laight @ 2026-02-20  9:04 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif

On Thu, 19 Feb 2026 15:08:51 +0000
Kiryl Shutsemau <kas@kernel.org> wrote:

> No, there's no new hardware (that I know of). I want to explore what page size
> means.
> 
> The kernel uses the same value - PAGE_SIZE - for two things:
> 
>   - the order-0 buddy allocation size;
> 
>   - the granularity of virtual address space mapping;

Also the 'random' buffers that are PAGE_SIZE rather than 4k.

I also wonder how is affects mmap of kernel memory and the alignement
of PCIe windows (etc).

	David


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 17:09   ` Kiryl Shutsemau
@ 2026-02-20 10:24     ` David Hildenbrand (Arm)
  2026-02-20 12:07       ` Kiryl Shutsemau
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-20 10:24 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif

>> When discussing per-process page sizes with Ryan and Dev, I mentioned that
>> having a larger emulated page size could be interesting for other
>> architectures as well.
>>
>> That is, we would emulate a 64K page size on Intel for user space as well,
>> but let the OS work with 4K pages.
> 
> Just to clarify, do you want it to be enforced on userspace ABI.
> Like, all mappings are 64k aligned?

Right, see the proposal from Dev on the list.

 From user-space POV, the pagesize would be 64K for these emulated 
processes. That is, VMAs must be suitable aligned etc.

One key thing I think is that you could run such emulated-64k process 
(that actually support it!) with 4k processes on the same machine, like 
Arm is considering.

You would have no weird "vma crosses base pages" handling, which is just 
rather nasty and makes my head hurt.

> 
>> We'd only allocate+map large folios into user space + pagecache, but still
>> allow for page tables etc. to not waste memory.
> 
> Waste of memory for page table is solvable and pretty straight forward.
> Most of such cases can be solve mechanically by switching to slab.

Well, yes, like Willy says, there are already similar custom solutions 
for s390x and ppc.

Pasha talked recently about the memory waste of 16k kernel stacks and 
how we would want to reduce that to 4k. In your proposal, it would be 
64k, unless you somehow manage to allocate multiple kernel stacks from 
the same 64k page. My head hurts thinking about whether that could work, 
maybe it could (no idea about guard pages in there, though).

Let's take a look at the history of page size usage on Arm (people can 
feel free to correct me):

(1) Most distros were using 64k on Arm.

(2) People realized that 64k was suboptimal many use cases (memory
     waste for stacks, pagecache, etc) and started to switch to 4k. I
     remember that mostly HPC-centric users sticked to 64k, but there was
     also demand from others to be able to stay on 64k.

(3) Arm improved performance on a 4k kernel by adding cont-pte support,
     trying to get closer to 64k native performance.

(4) Achieving 64k native performance is hard, which is why per-process
     page sizes are being explored to get the best out of both worlds
     (use 64k page size only where it really matters for performance).

Arm clearly has the added benefit of actually benefiting from hardware 
support for 64k.

IIUC, what you are proposing feels a bit like traveling back in time 
when it comes to the memory waste problem that Arm users encountered.

Where do you see the big difference to 64k on Arm in your proposal? 
Would you currently also be running 64k Arm in production and the memory 
waste etc is acceptable?

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-20 10:24     ` David Hildenbrand (Arm)
@ 2026-02-20 12:07       ` Kiryl Shutsemau
  2026-02-20 16:30         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2026-02-20 12:07 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif

On Fri, Feb 20, 2026 at 11:24:37AM +0100, David Hildenbrand (Arm) wrote:
> > > When discussing per-process page sizes with Ryan and Dev, I mentioned that
> > > having a larger emulated page size could be interesting for other
> > > architectures as well.
> > > 
> > > That is, we would emulate a 64K page size on Intel for user space as well,
> > > but let the OS work with 4K pages.
> > 
> > Just to clarify, do you want it to be enforced on userspace ABI.
> > Like, all mappings are 64k aligned?
> 
> Right, see the proposal from Dev on the list.
> 
> From user-space POV, the pagesize would be 64K for these emulated processes.
> That is, VMAs must be suitable aligned etc.

Well, it will drastically limit the adoption. We have too much legacy
stuff on x86.

> > > We'd only allocate+map large folios into user space + pagecache, but still
> > > allow for page tables etc. to not waste memory.
> > 
> > Waste of memory for page table is solvable and pretty straight forward.
> > Most of such cases can be solve mechanically by switching to slab.
> 
> Well, yes, like Willy says, there are already similar custom solutions for
> s390x and ppc.
> 
> Pasha talked recently about the memory waste of 16k kernel stacks and how we
> would want to reduce that to 4k. In your proposal, it would be 64k, unless
> you somehow manage to allocate multiple kernel stacks from the same 64k
> page. My head hurts thinking about whether that could work, maybe it could
> (no idea about guard pages in there, though).

Kernel stack is allocated from vmalloc. I think mapping them with
sub-page granularity should be doable.

BTW, do you see any reason why slab-allocated stack wouldn't work for
large base page sizes? There's no requirement for it be aligned to page
or PTE, right?

> Let's take a look at the history of page size usage on Arm (people can feel
> free to correct me):
> 
> (1) Most distros were using 64k on Arm.
> 
> (2) People realized that 64k was suboptimal many use cases (memory
>     waste for stacks, pagecache, etc) and started to switch to 4k. I
>     remember that mostly HPC-centric users sticked to 64k, but there was
>     also demand from others to be able to stay on 64k.
> 
> (3) Arm improved performance on a 4k kernel by adding cont-pte support,
>     trying to get closer to 64k native performance.
> 
> (4) Achieving 64k native performance is hard, which is why per-process
>     page sizes are being explored to get the best out of both worlds
>     (use 64k page size only where it really matters for performance).
> 
> Arm clearly has the added benefit of actually benefiting from hardware
> support for 64k.
> 
> IIUC, what you are proposing feels a bit like traveling back in time when it
> comes to the memory waste problem that Arm users encountered.
> 
> Where do you see the big difference to 64k on Arm in your proposal? Would
> you currently also be running 64k Arm in production and the memory waste etc
> is acceptable?

That's the point. I don't see a big difference to 64k Arm. I want to
bring this option to x86: at some machine size it makes sense trade
memory consumption for scalability. I am targeting it to machines with
over 2TiB of RAM.

BTW, we do run 64k Arm in our fleet. There's some growing pains, but it
looks good in general We have no plans to switch to 4k (or 16k) at the
moment. 512M THPs also look good on some workloads.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-19 23:24   ` Kalesh Singh
@ 2026-02-20 12:10     ` Kiryl Shutsemau
  2026-02-20 19:21       ` Kalesh Singh
  0 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2026-02-20 12:10 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: David Hildenbrand (Arm),
	lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif, android-mm, Adrian Barnaś,
	Mateusz Maćkowski, Steven Moreland

On Thu, Feb 19, 2026 at 03:24:37PM -0800, Kalesh Singh wrote:
> On Thu, Feb 19, 2026 at 7:39 AM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
> >
> > On 2/19/26 16:08, Kiryl Shutsemau wrote:
> > > No, there's no new hardware (that I know of). I want to explore what page size
> > > means.
> > >
> > > The kernel uses the same value - PAGE_SIZE - for two things:
> > >
> > >    - the order-0 buddy allocation size;
> > >
> > >    - the granularity of virtual address space mapping;
> > >
> > > I think we can benefit from separating these two meanings and allowing
> > > order-0 allocations to be larger than the virtual address space covered by a
> > > PTE entry.
> > >
> > > The main motivation is scalability. Managing memory on multi-terabyte
> > > machines in 4k is suboptimal, to say the least.
> > >
> > > Potential benefits of the approach (assuming 64k pages):
> > >
> > >    - The order-0 page size cuts struct page overhead by a factor of 16. From
> > >      ~1.6% of RAM to ~0.1%;
> > >
> > >    - TLB wins on machines with TLB coalescing as long as mapping is naturally
> > >      aligned;
> > >
> > >    - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
> > >
> > >    - 1G pages are within possibility for the buddy allocator - order-14
> > >      allocation. It can open the road to 1G THPs.
> > >
> > >    - As with THP, fewer pages - less pressure on the LRU lock;
> > >
> > >    - ...
> > >
> > > The trade-off is memory waste (similar to what we have on architectures with
> > > native 64k pages today) and complexity, mostly in the core-MM code.
> > >
> > > == Design considerations ==
> > >
> > > I want to split PAGE_SIZE into two distinct values:
> > >
> > >    - PTE_SIZE defines the virtual address space granularity;
> > >
> > >    - PG_SIZE defines the size of the order-0 buddy allocation;
> > >
> > > PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
> > > requires conversion, and keep existing code working while conversion is in
> > > progress.
> > >
> > > The same split happens for other page-related macros: mask, shift,
> > > alignment helpers, etc.
> > >
> > > PFNs are in PTE_SIZE units.
> > >
> > > The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
> > > units.
> > >
> > > Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
> > > for userspace. But we might want to communicate PG_SIZE to userspace to
> > > get the optimal results for userspace that cares.
> > >
> > > PTE_SIZE granularity requires a substantial rework of page fault and VMA
> > > handling:
> > >
> > >    - A struct page pointer and pgprot_t are not enough to create a PTE entry.
> > >      We also need the offset within the page we are creating the PTE for.
> > >
> > >    - Since the VMA start can be aligned arbitrarily with respect to the
> > >      underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
> > >      which is in PTE_SIZE units.
> > >
> > >    - The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
> > >      misaligned cases;
> > >
> > > Page faults into file mappings are relatively simple to handle as we
> > > always have the page cache to refer to. So you can map only the part of the
> > > page that fits in the page table, similarly to fault-around.
> > >
> > > Anonymous and file-CoW faults should also be simple as long as the VMA is
> > > aligned to PG_SIZE in both the virtual address space and with respect to
> > > vm_pgoff. We might waste some memory on the ends of the VMA, but it is
> > > tolerable.
> > >
> > > Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
> > > pages across a page table boundary. In the worst case, a page is mapped across
> > > a PGD entry boundary and PTEs for the page have to be put in two separate
> > > subtrees of page tables.
> > >
> > > A naive implementation would map different pages on different sides of a
> > > page table boundary and accept the waste of one page per page table crossing.
> > > The hope is that misaligned mappings are rare, but this is suboptimal.
> > >
> > > mremap(2) is the ultimate stress test for the design.
> > >
> > > On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> > > is greater than 4 KB, we need a way to pack multiple page tables into a
> > > single page. We could use the slab allocator for this, but it would
> > > require relocating the page-table metadata out of struct page.
> >
> > When discussing per-process page sizes with Ryan and Dev, I mentioned
> > that having a larger emulated page size could be interesting for other
> > architectures as well.
> >
> > That is, we would emulate a 64K page size on Intel for user space as
> > well, but let the OS work with 4K pages.
> >
> > We'd only allocate+map large folios into user space + pagecache, but
> > still allow for page tables etc. to not waste memory.
> >
> > So "most" of your allocations in the system would actually be at least
> > 64k, reducing zone lock contention etc.
> >
> >
> > It doesn't solve all the problems you wanted to tackle on your list
> > (e.g., "struct page" overhead, which will be sorted out by memdescs).
> 
> Hi Kiryl,
> 
> I'd be interested to discuss this at LSFMM.
> 
> On Android, we have a separate but related use case: we emulate the
> userspace page size on x86, primarily to enable app developers to
> conduct compatibility testing of their apps for 16KB Android devices.
> [1]
> 
> It mainly works by enforcing a larger granularity on the VMAs to
> emulate a userspace page size, somewhat similar to what David
> mentioned, while the underlying kernel still operates on a 4KB
> granularity. [2]
> 
> IIUC the current design would not enfore the larger granularity /
> alignment for VMAs to avoid breaking ABI. However, I'd be interest to
> discuss whether it can be extended to cover this usecase as well.

I don't want to break ABI, but might add a knob (maybe personality(2) ?)
for enforcement to see what breaks.

In general, I would prefer to advertise a new value to userspace that
would mean preferred virtual address space granularity.

> 
> [1]  https://developer.android.com/guide/practices/page-sizes#16kb-emulator
> [2] https://source.android.com/docs/core/architecture/16kb-page-size/getting-started-cf-x86-64-pgagnostic
> 
> Thanks,
> Kalesh
> 
> 
> 
> 
> >
> > --
> > Cheers,
> >
> > David
> >

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-20  9:04 ` David Laight
@ 2026-02-20 12:12   ` Kiryl Shutsemau
  0 siblings, 0 replies; 43+ messages in thread
From: Kiryl Shutsemau @ 2026-02-20 12:12 UTC (permalink / raw)
  To: David Laight
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif

On Fri, Feb 20, 2026 at 09:04:09AM +0000, David Laight wrote:
> On Thu, 19 Feb 2026 15:08:51 +0000
> Kiryl Shutsemau <kas@kernel.org> wrote:
> 
> > No, there's no new hardware (that I know of). I want to explore what page size
> > means.
> > 
> > The kernel uses the same value - PAGE_SIZE - for two things:
> > 
> >   - the order-0 buddy allocation size;
> > 
> >   - the granularity of virtual address space mapping;
> 
> Also the 'random' buffers that are PAGE_SIZE rather than 4k.

Yeah, in some places we use PAGE_SIZE just because without any reason.

> I also wonder how is affects mmap of kernel memory and the alignement
> of PCIe windows (etc).

Kernel, as userspace, is free to map memory PTE_SIZE granularity.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-20  3:28     ` Liam R. Howlett
@ 2026-02-20 12:33       ` Kiryl Shutsemau
  2026-02-20 15:17         ` Liam R. Howlett
  0 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2026-02-20 12:33 UTC (permalink / raw)
  To: Liam R. Howlett, Dave Hansen, lsf-pc, linux-mm, x86,
	linux-kernel, Andrew Morton, David Hildenbrand, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Lorenzo Stoakes,
	Mike Rapoport, Matthew Wilcox, Johannes Weiner, Usama Arif

On Thu, Feb 19, 2026 at 10:28:20PM -0500, Liam R. Howlett wrote:
> * Kiryl Shutsemau <kas@kernel.org> [260219 17:05]:
> > On Thu, Feb 19, 2026 at 09:08:57AM -0800, Dave Hansen wrote:
> > > On 2/19/26 07:08, Kiryl Shutsemau wrote:
> > > >   - The order-0 page size cuts struct page overhead by a factor of 16. From
> > > >     ~1.6% of RAM to ~0.1%;
> > > ...
> > > But, it will mostly be getting better performance at the _cost_ of
> > > consuming more RAM, not saving RAM.
> > 
> > That's fair.
> > 
> > The problem with struct page memory consumption is that it is static and
> > cannot be reclaimed. You pay the struct page tax no matter what.
> > 
> > Page cache rounding overhead can be large, but a motivated userspace can
> > keep it under control by avoiding splitting a dataset into many small
> > files. And this memory is reclaimable.
> > 
> 
> But we are in reclaim a lot more these days.  As I'm sure you are aware,
> we are trying to maximize the resources (both cpu and ram) of any
> machine powered on.  Entering reclaim will consume the cpu time and will
> affect other tasks.
> 
> Especially with multiple workload machines, the tendency is to have a
> primary focus with the lower desired work being killed, if necessary.
> Reducing the overhead just means more secondary tasks, or a bigger
> footprint of the ones already executing.
> 
> Increasing the memory pressure will degrade the primary workload more
> frequently, even if we recover enough to avoid OOMing the secondary.
> 
> While in the struct page tax world, the secondary task would be killed
> after a shorter (and less frequently executed) reclaim comes up short.
> So, I would think that we would be degrading the primary workload in an
> attempt to keep the secondary alive?  Maybe I'm over-simplifying here?

I am not sure I fully follow your point.

Sizing tasks and scheduling tasks between machines is hard in general.
I don't think the balance between struct page tax and page cache
rounding overhead is going to be the primary factor.

> Near the other end of the spectrum, we have chromebooks that are
> constantly in reclaim, even with 4k pages.  I guess these machines would
> be destine to maintain the same page size they use today.  That is, this
> solution for the struct page tax is only useful if you have a lot of
> memory.  But then again, that's where the bookkeeping costs become hard
> to take.

Smaller machines are not target for 64k pages. They will not benefit
from them.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-20 12:33       ` Kiryl Shutsemau
@ 2026-02-20 15:17         ` Liam R. Howlett
  2026-02-20 15:50           ` Kiryl Shutsemau
  0 siblings, 1 reply; 43+ messages in thread
From: Liam R. Howlett @ 2026-02-20 15:17 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Dave Hansen, lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif

* Kiryl Shutsemau <kas@kernel.org> [260220 07:33]:
> On Thu, Feb 19, 2026 at 10:28:20PM -0500, Liam R. Howlett wrote:
> > * Kiryl Shutsemau <kas@kernel.org> [260219 17:05]:
> > > On Thu, Feb 19, 2026 at 09:08:57AM -0800, Dave Hansen wrote:
> > > > On 2/19/26 07:08, Kiryl Shutsemau wrote:
> > > > >   - The order-0 page size cuts struct page overhead by a factor of 16. From
> > > > >     ~1.6% of RAM to ~0.1%;
> > > > ...
> > > > But, it will mostly be getting better performance at the _cost_ of
> > > > consuming more RAM, not saving RAM.
> > > 
> > > That's fair.
> > > 
> > > The problem with struct page memory consumption is that it is static and
> > > cannot be reclaimed. You pay the struct page tax no matter what.
> > > 
> > > Page cache rounding overhead can be large, but a motivated userspace can
> > > keep it under control by avoiding splitting a dataset into many small
> > > files. And this memory is reclaimable.
> > > 
> > 
> > But we are in reclaim a lot more these days.  As I'm sure you are aware,
> > we are trying to maximize the resources (both cpu and ram) of any
> > machine powered on.  Entering reclaim will consume the cpu time and will
> > affect other tasks.
> > 
> > Especially with multiple workload machines, the tendency is to have a
> > primary focus with the lower desired work being killed, if necessary.
> > Reducing the overhead just means more secondary tasks, or a bigger
> > footprint of the ones already executing.
> > 
> > Increasing the memory pressure will degrade the primary workload more
> > frequently, even if we recover enough to avoid OOMing the secondary.
> > 
> > While in the struct page tax world, the secondary task would be killed
> > after a shorter (and less frequently executed) reclaim comes up short.
> > So, I would think that we would be degrading the primary workload in an
> > attempt to keep the secondary alive?  Maybe I'm over-simplifying here?
> 
> I am not sure I fully follow your point.
> 
> Sizing tasks and scheduling tasks between machines is hard in general.
> I don't think the balance between struct page tax and page cache
> rounding overhead is going to be the primary factor.

I think there are more trade offs than what you listed.  It's still
probably worth doing, but I wanted to know if you though that this would
cause us to spend more time in reclaim, which seems to be implied above.
So, another trade-off might be all the reclaim penalty being paid more
frequently?

...

Thanks,
Liam


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-20 15:17         ` Liam R. Howlett
@ 2026-02-20 15:50           ` Kiryl Shutsemau
  0 siblings, 0 replies; 43+ messages in thread
From: Kiryl Shutsemau @ 2026-02-20 15:50 UTC (permalink / raw)
  To: Liam R. Howlett, Dave Hansen, lsf-pc, linux-mm, x86,
	linux-kernel, Andrew Morton, David Hildenbrand, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Lorenzo Stoakes,
	Mike Rapoport, Matthew Wilcox, Johannes Weiner, Usama Arif

On Fri, Feb 20, 2026 at 10:17:45AM -0500, Liam R. Howlett wrote:
> * Kiryl Shutsemau <kas@kernel.org> [260220 07:33]:
> > On Thu, Feb 19, 2026 at 10:28:20PM -0500, Liam R. Howlett wrote:
> > > * Kiryl Shutsemau <kas@kernel.org> [260219 17:05]:
> > > > On Thu, Feb 19, 2026 at 09:08:57AM -0800, Dave Hansen wrote:
> > > > > On 2/19/26 07:08, Kiryl Shutsemau wrote:
> > > > > >   - The order-0 page size cuts struct page overhead by a factor of 16. From
> > > > > >     ~1.6% of RAM to ~0.1%;
> > > > > ...
> > > > > But, it will mostly be getting better performance at the _cost_ of
> > > > > consuming more RAM, not saving RAM.
> > > > 
> > > > That's fair.
> > > > 
> > > > The problem with struct page memory consumption is that it is static and
> > > > cannot be reclaimed. You pay the struct page tax no matter what.
> > > > 
> > > > Page cache rounding overhead can be large, but a motivated userspace can
> > > > keep it under control by avoiding splitting a dataset into many small
> > > > files. And this memory is reclaimable.
> > > > 
> > > 
> > > But we are in reclaim a lot more these days.  As I'm sure you are aware,
> > > we are trying to maximize the resources (both cpu and ram) of any
> > > machine powered on.  Entering reclaim will consume the cpu time and will
> > > affect other tasks.
> > > 
> > > Especially with multiple workload machines, the tendency is to have a
> > > primary focus with the lower desired work being killed, if necessary.
> > > Reducing the overhead just means more secondary tasks, or a bigger
> > > footprint of the ones already executing.
> > > 
> > > Increasing the memory pressure will degrade the primary workload more
> > > frequently, even if we recover enough to avoid OOMing the secondary.
> > > 
> > > While in the struct page tax world, the secondary task would be killed
> > > after a shorter (and less frequently executed) reclaim comes up short.
> > > So, I would think that we would be degrading the primary workload in an
> > > attempt to keep the secondary alive?  Maybe I'm over-simplifying here?
> > 
> > I am not sure I fully follow your point.
> > 
> > Sizing tasks and scheduling tasks between machines is hard in general.
> > I don't think the balance between struct page tax and page cache
> > rounding overhead is going to be the primary factor.
> 
> I think there are more trade offs than what you listed.  It's still
> probably worth doing, but I wanted to know if you though that this would
> cause us to spend more time in reclaim, which seems to be implied above.
> So, another trade-off might be all the reclaim penalty being paid more
> frequently?

I am not sure.

Kernel would need to do less work in reclaim per unit of memory.
Depending on workloads you might see less allocation events and
therefore less frequent reclaim.

It's all too hand-wavy at the stage.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-20 12:07       ` Kiryl Shutsemau
@ 2026-02-20 16:30         ` David Hildenbrand (Arm)
  2026-02-20 19:33           ` Kalesh Singh
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-20 16:30 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif

On 2/20/26 13:07, Kiryl Shutsemau wrote:
> On Fri, Feb 20, 2026 at 11:24:37AM +0100, David Hildenbrand (Arm) wrote:
>>>
>>> Just to clarify, do you want it to be enforced on userspace ABI.
>>> Like, all mappings are 64k aligned?
>>
>> Right, see the proposal from Dev on the list.
>>
>>  From user-space POV, the pagesize would be 64K for these emulated processes.
>> That is, VMAs must be suitable aligned etc.
> 
> Well, it will drastically limit the adoption. We have too much legacy
> stuff on x86.

I'd assume that many applications nowadays can deal with differing page 
sizes (thanks to some other architectures paving the way).

But yes, some real legacy stuff, or stuff that ever only cared about 
intel still hardcodes pagesize=4k.

In Meta's fleet, I'd be quite interesting how much conversion there 
would have to be done.

For legacy apps, you could still run them as 4k pagesize on the same 
system, of course.

> 
>>>
>>> Waste of memory for page table is solvable and pretty straight forward.
>>> Most of such cases can be solve mechanically by switching to slab.
>>
>> Well, yes, like Willy says, there are already similar custom solutions for
>> s390x and ppc.
>>
>> Pasha talked recently about the memory waste of 16k kernel stacks and how we
>> would want to reduce that to 4k. In your proposal, it would be 64k, unless
>> you somehow manage to allocate multiple kernel stacks from the same 64k
>> page. My head hurts thinking about whether that could work, maybe it could
>> (no idea about guard pages in there, though).
> 
> Kernel stack is allocated from vmalloc. I think mapping them with
> sub-page granularity should be doable.

I still have to wrap my head around the sub-page mapping here as well. 
It's scary.

Re mapcount: I think if any part of the page is mapped, it would be 
considered mapped -> mapcount += 1.

> 
> BTW, do you see any reason why slab-allocated stack wouldn't work for
> large base page sizes? There's no requirement for it be aligned to page
> or PTE, right?

I'd assume that would work. Devil is in the detail with these things 
before we have memdescs.

E.g., page table have a dedicated type (PGTY_table) and store separate 
metadata in the ptdesc. For kernel stack there was once a proposal to 
have a type but it is not upstream.

> 
>> Let's take a look at the history of page size usage on Arm (people can feel
>> free to correct me):
>>
>> (1) Most distros were using 64k on Arm.
>>
>> (2) People realized that 64k was suboptimal many use cases (memory
>>      waste for stacks, pagecache, etc) and started to switch to 4k. I
>>      remember that mostly HPC-centric users sticked to 64k, but there was
>>      also demand from others to be able to stay on 64k.
>>
>> (3) Arm improved performance on a 4k kernel by adding cont-pte support,
>>      trying to get closer to 64k native performance.
>>
>> (4) Achieving 64k native performance is hard, which is why per-process
>>      page sizes are being explored to get the best out of both worlds
>>      (use 64k page size only where it really matters for performance).
>>
>> Arm clearly has the added benefit of actually benefiting from hardware
>> support for 64k.
>>
>> IIUC, what you are proposing feels a bit like traveling back in time when it
>> comes to the memory waste problem that Arm users encountered.
>>
>> Where do you see the big difference to 64k on Arm in your proposal? Would
>> you currently also be running 64k Arm in production and the memory waste etc
>> is acceptable?
> 
> That's the point. I don't see a big difference to 64k Arm. I want to
> bring this option to x86: at some machine size it makes sense trade
> memory consumption for scalability. I am targeting it to machines with
> over 2TiB of RAM.
> 
> BTW, we do run 64k Arm in our fleet. There's some growing pains, but it
> looks good in general We have no plans to switch to 4k (or 16k) at the
> moment. 512M THPs also look good on some workloads.

Okay, that's valuable information, thanks!

Being able to remove the sub-page mapping part (or being able to just 
hide it somewhere deep down in arch code) would make this a lot easier 
to digest.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-20 12:10     ` Kiryl Shutsemau
@ 2026-02-20 19:21       ` Kalesh Singh
  0 siblings, 0 replies; 43+ messages in thread
From: Kalesh Singh @ 2026-02-20 19:21 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: David Hildenbrand (Arm),
	lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif, android-mm, Adrian Barnaś,
	Mateusz Maćkowski, Steven Moreland

On Fri, Feb 20, 2026 at 4:10 AM Kiryl Shutsemau <kas@kernel.org> wrote:
>
> On Thu, Feb 19, 2026 at 03:24:37PM -0800, Kalesh Singh wrote:
> > On Thu, Feb 19, 2026 at 7:39 AM David Hildenbrand (Arm)
> > <david@kernel.org> wrote:
> > >
> > > On 2/19/26 16:08, Kiryl Shutsemau wrote:
> > > > No, there's no new hardware (that I know of). I want to explore what page size
> > > > means.
> > > >
> > > > The kernel uses the same value - PAGE_SIZE - for two things:
> > > >
> > > >    - the order-0 buddy allocation size;
> > > >
> > > >    - the granularity of virtual address space mapping;
> > > >
> > > > I think we can benefit from separating these two meanings and allowing
> > > > order-0 allocations to be larger than the virtual address space covered by a
> > > > PTE entry.
> > > >
> > > > The main motivation is scalability. Managing memory on multi-terabyte
> > > > machines in 4k is suboptimal, to say the least.
> > > >
> > > > Potential benefits of the approach (assuming 64k pages):
> > > >
> > > >    - The order-0 page size cuts struct page overhead by a factor of 16. From
> > > >      ~1.6% of RAM to ~0.1%;
> > > >
> > > >    - TLB wins on machines with TLB coalescing as long as mapping is naturally
> > > >      aligned;
> > > >
> > > >    - Order-5 allocation is 2M, resulting in less pressure on the zone lock;
> > > >
> > > >    - 1G pages are within possibility for the buddy allocator - order-14
> > > >      allocation. It can open the road to 1G THPs.
> > > >
> > > >    - As with THP, fewer pages - less pressure on the LRU lock;
> > > >
> > > >    - ...
> > > >
> > > > The trade-off is memory waste (similar to what we have on architectures with
> > > > native 64k pages today) and complexity, mostly in the core-MM code.
> > > >
> > > > == Design considerations ==
> > > >
> > > > I want to split PAGE_SIZE into two distinct values:
> > > >
> > > >    - PTE_SIZE defines the virtual address space granularity;
> > > >
> > > >    - PG_SIZE defines the size of the order-0 buddy allocation;
> > > >
> > > > PAGE_SIZE is only defined if PTE_SIZE == PG_SIZE. It will flag which code
> > > > requires conversion, and keep existing code working while conversion is in
> > > > progress.
> > > >
> > > > The same split happens for other page-related macros: mask, shift,
> > > > alignment helpers, etc.
> > > >
> > > > PFNs are in PTE_SIZE units.
> > > >
> > > > The buddy allocator and page cache (as well as all I/O) operate in PG_SIZE
> > > > units.
> > > >
> > > > Userspace mappings are maintained with PTE_SIZE granularity. No ABI changes
> > > > for userspace. But we might want to communicate PG_SIZE to userspace to
> > > > get the optimal results for userspace that cares.
> > > >
> > > > PTE_SIZE granularity requires a substantial rework of page fault and VMA
> > > > handling:
> > > >
> > > >    - A struct page pointer and pgprot_t are not enough to create a PTE entry.
> > > >      We also need the offset within the page we are creating the PTE for.
> > > >
> > > >    - Since the VMA start can be aligned arbitrarily with respect to the
> > > >      underlying page, vma->vm_pgoff has to be changed to vma->vm_pteoff,
> > > >      which is in PTE_SIZE units.
> > > >
> > > >    - The page fault handler needs to handle PTE_SIZE < PG_SIZE, including
> > > >      misaligned cases;
> > > >
> > > > Page faults into file mappings are relatively simple to handle as we
> > > > always have the page cache to refer to. So you can map only the part of the
> > > > page that fits in the page table, similarly to fault-around.
> > > >
> > > > Anonymous and file-CoW faults should also be simple as long as the VMA is
> > > > aligned to PG_SIZE in both the virtual address space and with respect to
> > > > vm_pgoff. We might waste some memory on the ends of the VMA, but it is
> > > > tolerable.
> > > >
> > > > Misaligned anonymous and file-CoW faults are a pain. Specifically, mapping
> > > > pages across a page table boundary. In the worst case, a page is mapped across
> > > > a PGD entry boundary and PTEs for the page have to be put in two separate
> > > > subtrees of page tables.
> > > >
> > > > A naive implementation would map different pages on different sides of a
> > > > page table boundary and accept the waste of one page per page table crossing.
> > > > The hope is that misaligned mappings are rare, but this is suboptimal.
> > > >
> > > > mremap(2) is the ultimate stress test for the design.
> > > >
> > > > On x86, page tables are allocated from the buddy allocator and if PG_SIZE
> > > > is greater than 4 KB, we need a way to pack multiple page tables into a
> > > > single page. We could use the slab allocator for this, but it would
> > > > require relocating the page-table metadata out of struct page.
> > >
> > > When discussing per-process page sizes with Ryan and Dev, I mentioned
> > > that having a larger emulated page size could be interesting for other
> > > architectures as well.
> > >
> > > That is, we would emulate a 64K page size on Intel for user space as
> > > well, but let the OS work with 4K pages.
> > >
> > > We'd only allocate+map large folios into user space + pagecache, but
> > > still allow for page tables etc. to not waste memory.
> > >
> > > So "most" of your allocations in the system would actually be at least
> > > 64k, reducing zone lock contention etc.
> > >
> > >
> > > It doesn't solve all the problems you wanted to tackle on your list
> > > (e.g., "struct page" overhead, which will be sorted out by memdescs).
> >
> > Hi Kiryl,
> >
> > I'd be interested to discuss this at LSFMM.
> >
> > On Android, we have a separate but related use case: we emulate the
> > userspace page size on x86, primarily to enable app developers to
> > conduct compatibility testing of their apps for 16KB Android devices.
> > [1]
> >
> > It mainly works by enforcing a larger granularity on the VMAs to
> > emulate a userspace page size, somewhat similar to what David
> > mentioned, while the underlying kernel still operates on a 4KB
> > granularity. [2]
> >
> > IIUC the current design would not enfore the larger granularity /
> > alignment for VMAs to avoid breaking ABI. However, I'd be interest to
> > discuss whether it can be extended to cover this usecase as well.
>
> I don't want to break ABI, but might add a knob (maybe personality(2) ?)
> for enforcement to see what breaks.

I think personality(2) may be too late? By the time a process invokes
it, the initial userspace mappings (executable, linker for init, etc)
are already established with the default granularity.

To handle this, I've been using an early_param to enforce the larger
VMA alignment system-wide right from boot.

Perhaps, something for global enforcement (Kconfig/early param) and a
prctl/personality flag for per-process opt in?

>
> In general, I would prefer to advertise a new value to userspace that
> would mean preferred virtual address space granularity.

This makes sense for maintaining ABI compatibility. Userspace
allocators might want to optimize their layouts to match PG_SIZE while
still being able to operate at PTE_SIZE when needed.

-- Kalesh

>
> >
> > [1]  https://developer.android.com/guide/practices/page-sizes#16kb-emulator
> > [2] https://source.android.com/docs/core/architecture/16kb-page-size/getting-started-cf-x86-64-pgagnostic
> >
> > Thanks,
> > Kalesh
> >
> >
> >
> >
> > >
> > > --
> > > Cheers,
> > >
> > > David
> > >
>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-20 16:30         ` David Hildenbrand (Arm)
@ 2026-02-20 19:33           ` Kalesh Singh
  2026-02-23 11:04             ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 43+ messages in thread
From: Kalesh Singh @ 2026-02-20 19:33 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Kiryl Shutsemau, lsf-pc, linux-mm, x86, linux-kernel,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif

On Fri, Feb 20, 2026 at 8:30 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 2/20/26 13:07, Kiryl Shutsemau wrote:
> > On Fri, Feb 20, 2026 at 11:24:37AM +0100, David Hildenbrand (Arm) wrote:
> >>>
> >>> Just to clarify, do you want it to be enforced on userspace ABI.
> >>> Like, all mappings are 64k aligned?
> >>
> >> Right, see the proposal from Dev on the list.
> >>
> >>  From user-space POV, the pagesize would be 64K for these emulated processes.
> >> That is, VMAs must be suitable aligned etc.
> >
> > Well, it will drastically limit the adoption. We have too much legacy
> > stuff on x86.
>
> I'd assume that many applications nowadays can deal with differing page
> sizes (thanks to some other architectures paving the way).
>
> But yes, some real legacy stuff, or stuff that ever only cared about
> intel still hardcodes pagesize=4k.

I think most issues will stem from linkers setting the default ELF
segment alignment (max-page-size) for x86 to 4096. So those ELFs will
not load correctly or at all on the larger emulated granularity.

-- Kalesh

>
> In Meta's fleet, I'd be quite interesting how much conversion there
> would have to be done.
>
> For legacy apps, you could still run them as 4k pagesize on the same
> system, of course.
>
> >
> >>>
> >>> Waste of memory for page table is solvable and pretty straight forward.
> >>> Most of such cases can be solve mechanically by switching to slab.
> >>
> >> Well, yes, like Willy says, there are already similar custom solutions for
> >> s390x and ppc.
> >>
> >> Pasha talked recently about the memory waste of 16k kernel stacks and how we
> >> would want to reduce that to 4k. In your proposal, it would be 64k, unless
> >> you somehow manage to allocate multiple kernel stacks from the same 64k
> >> page. My head hurts thinking about whether that could work, maybe it could
> >> (no idea about guard pages in there, though).
> >
> > Kernel stack is allocated from vmalloc. I think mapping them with
> > sub-page granularity should be doable.
>
> I still have to wrap my head around the sub-page mapping here as well.
> It's scary.
>
> Re mapcount: I think if any part of the page is mapped, it would be
> considered mapped -> mapcount += 1.
>
> >
> > BTW, do you see any reason why slab-allocated stack wouldn't work for
> > large base page sizes? There's no requirement for it be aligned to page
> > or PTE, right?
>
> I'd assume that would work. Devil is in the detail with these things
> before we have memdescs.
>
> E.g., page table have a dedicated type (PGTY_table) and store separate
> metadata in the ptdesc. For kernel stack there was once a proposal to
> have a type but it is not upstream.
>
> >
> >> Let's take a look at the history of page size usage on Arm (people can feel
> >> free to correct me):
> >>
> >> (1) Most distros were using 64k on Arm.
> >>
> >> (2) People realized that 64k was suboptimal many use cases (memory
> >>      waste for stacks, pagecache, etc) and started to switch to 4k. I
> >>      remember that mostly HPC-centric users sticked to 64k, but there was
> >>      also demand from others to be able to stay on 64k.
> >>
> >> (3) Arm improved performance on a 4k kernel by adding cont-pte support,
> >>      trying to get closer to 64k native performance.
> >>
> >> (4) Achieving 64k native performance is hard, which is why per-process
> >>      page sizes are being explored to get the best out of both worlds
> >>      (use 64k page size only where it really matters for performance).
> >>
> >> Arm clearly has the added benefit of actually benefiting from hardware
> >> support for 64k.
> >>
> >> IIUC, what you are proposing feels a bit like traveling back in time when it
> >> comes to the memory waste problem that Arm users encountered.
> >>
> >> Where do you see the big difference to 64k on Arm in your proposal? Would
> >> you currently also be running 64k Arm in production and the memory waste etc
> >> is acceptable?
> >
> > That's the point. I don't see a big difference to 64k Arm. I want to
> > bring this option to x86: at some machine size it makes sense trade
> > memory consumption for scalability. I am targeting it to machines with
> > over 2TiB of RAM.
> >
> > BTW, we do run 64k Arm in our fleet. There's some growing pains, but it
> > looks good in general We have no plans to switch to 4k (or 16k) at the
> > moment. 512M THPs also look good on some workloads.
>
> Okay, that's valuable information, thanks!
>
> Being able to remove the sub-page mapping part (or being able to just
> hide it somewhere deep down in arch code) would make this a lot easier
> to digest.
>
> --
> Cheers,
>
> David
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-20 19:33           ` Kalesh Singh
@ 2026-02-23 11:04             ` David Hildenbrand (Arm)
  2026-02-23 11:13               ` Kiryl Shutsemau
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-23 11:04 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: Kiryl Shutsemau, lsf-pc, linux-mm, x86, linux-kernel,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif

On 2/20/26 20:33, Kalesh Singh wrote:
> On Fri, Feb 20, 2026 at 8:30 AM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
>>
>> On 2/20/26 13:07, Kiryl Shutsemau wrote:
>>>
>>> Well, it will drastically limit the adoption. We have too much legacy
>>> stuff on x86.
>>
>> I'd assume that many applications nowadays can deal with differing page
>> sizes (thanks to some other architectures paving the way).
>>
>> But yes, some real legacy stuff, or stuff that ever only cared about
>> intel still hardcodes pagesize=4k.
> 
> I think most issues will stem from linkers setting the default ELF
> segment alignment (max-page-size) for x86 to 4096. So those ELFs will
> not load correctly or at all on the larger emulated granularity.

Right, I assume that they will have to be thought about that, and 
possibly, some binaries/libraries recompiled.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-23 11:04             ` David Hildenbrand (Arm)
@ 2026-02-23 11:13               ` Kiryl Shutsemau
  2026-02-23 11:27                 ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2026-02-23 11:13 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Kalesh Singh, lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif

On Mon, Feb 23, 2026 at 12:04:10PM +0100, David Hildenbrand (Arm) wrote:
> On 2/20/26 20:33, Kalesh Singh wrote:
> > On Fri, Feb 20, 2026 at 8:30 AM David Hildenbrand (Arm)
> > <david@kernel.org> wrote:
> > > 
> > > On 2/20/26 13:07, Kiryl Shutsemau wrote:
> > > > 
> > > > Well, it will drastically limit the adoption. We have too much legacy
> > > > stuff on x86.
> > > 
> > > I'd assume that many applications nowadays can deal with differing page
> > > sizes (thanks to some other architectures paving the way).
> > > 
> > > But yes, some real legacy stuff, or stuff that ever only cared about
> > > intel still hardcodes pagesize=4k.
> > 
> > I think most issues will stem from linkers setting the default ELF
> > segment alignment (max-page-size) for x86 to 4096. So those ELFs will
> > not load correctly or at all on the larger emulated granularity.
> 
> Right, I assume that they will have to be thought about that, and possibly,
> some binaries/libraries recompiled.

I think backward compatibility is important and I believe we can get
there without ABI break. And optimize from there.

BTW, x86-64 SysV ABI allows for 64k page size:

	Systems are permitted to use any power-of-two page size between
	4KB and 64KB, inclusive.

But it doesn't work in practice.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-23 11:13               ` Kiryl Shutsemau
@ 2026-02-23 11:27                 ` David Hildenbrand (Arm)
  2026-02-23 12:16                   ` Kiryl Shutsemau
  2026-02-23 15:14                   ` Dave Hansen
  0 siblings, 2 replies; 43+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-23 11:27 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Kalesh Singh, lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif

On 2/23/26 12:13, Kiryl Shutsemau wrote:
> On Mon, Feb 23, 2026 at 12:04:10PM +0100, David Hildenbrand (Arm) wrote:
>> On 2/20/26 20:33, Kalesh Singh wrote:
>>> On Fri, Feb 20, 2026 at 8:30 AM David Hildenbrand (Arm)
>>> <david@kernel.org> wrote:
>>>
>>> I think most issues will stem from linkers setting the default ELF
>>> segment alignment (max-page-size) for x86 to 4096. So those ELFs will
>>> not load correctly or at all on the larger emulated granularity.
>>
>> Right, I assume that they will have to be thought about that, and possibly,
>> some binaries/libraries recompiled.
> 
> I think backward compatibility is important and I believe we can get
> there without ABI break. And optimize from there.
> 
> BTW, x86-64 SysV ABI allows for 64k page size:
> 
> 	Systems are permitted to use any power-of-two page size between
> 	4KB and 64KB, inclusive.
> 
> But it doesn't work in practice.

Even in well controlled environments you would run in a hyperscaler?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-23 11:27                 ` David Hildenbrand (Arm)
@ 2026-02-23 12:16                   ` Kiryl Shutsemau
  2026-02-23 15:14                   ` Dave Hansen
  1 sibling, 0 replies; 43+ messages in thread
From: Kiryl Shutsemau @ 2026-02-23 12:16 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Kalesh Singh, lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif

On Mon, Feb 23, 2026 at 12:27:33PM +0100, David Hildenbrand (Arm) wrote:
> On 2/23/26 12:13, Kiryl Shutsemau wrote:
> > On Mon, Feb 23, 2026 at 12:04:10PM +0100, David Hildenbrand (Arm) wrote:
> > > On 2/20/26 20:33, Kalesh Singh wrote:
> > > > On Fri, Feb 20, 2026 at 8:30 AM David Hildenbrand (Arm)
> > > > <david@kernel.org> wrote:
> > > > 
> > > > I think most issues will stem from linkers setting the default ELF
> > > > segment alignment (max-page-size) for x86 to 4096. So those ELFs will
> > > > not load correctly or at all on the larger emulated granularity.
> > > 
> > > Right, I assume that they will have to be thought about that, and possibly,
> > > some binaries/libraries recompiled.
> > 
> > I think backward compatibility is important and I believe we can get
> > there without ABI break. And optimize from there.
> > 
> > BTW, x86-64 SysV ABI allows for 64k page size:
> > 
> > 	Systems are permitted to use any power-of-two page size between
> > 	4KB and 64KB, inclusive.
> > 
> > But it doesn't work in practice.
> 
> Even in well controlled environments you would run in a hyperscaler?

I have not invested much time into investigating this.

I intentionally targeted compatible version assuming it will be better
received by upstream. I want it to be usable outside specially cured
userspace. 64k might not be good fit for a desktop, but 16k can be a
different story.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-23 11:27                 ` David Hildenbrand (Arm)
  2026-02-23 12:16                   ` Kiryl Shutsemau
@ 2026-02-23 15:14                   ` Dave Hansen
  2026-02-23 15:31                     ` David Hildenbrand (Arm)
  2026-02-23 16:34                     ` David Laight
  1 sibling, 2 replies; 43+ messages in thread
From: Dave Hansen @ 2026-02-23 15:14 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Kiryl Shutsemau
  Cc: Kalesh Singh, lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif

On 2/23/26 03:27, David Hildenbrand (Arm) wrote:
...
>> BTW, x86-64 SysV ABI allows for 64k page size:
>>
>>     Systems are permitted to use any power-of-two page size between
>>     4KB and 64KB, inclusive.
>>
>> But it doesn't work in practice.
> 
> Even in well controlled environments you would run in a hyperscaler?

I think what Kirill is trying to say is that "it breaks userspace". ;)

A hyperscaler (or other "embedded" environment) might be willing or able
to go fix up userspace breakage. I would suspect our high frequency
trading friends would be all over this if it shaved a microsecond off
their receive times.

The more important question is what it breaks and how badly it breaks
things. 5-level paging, for instance, broke some JITs that historically
used the new (>48) upper virtual address bits for metadata. The gains
from 5-level paging were big enough and the userspace breakage was
confined and fixable enough that 5-level paging was viable.

I'm not sure which side a larger base page side will fall on, though. Is
it going to be an out-of-tree hack that a few folks use, or will it be
more like 5-level paging and be good enough that it goes into mainline?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-23 15:14                   ` Dave Hansen
@ 2026-02-23 15:31                     ` David Hildenbrand (Arm)
  2026-02-23 15:45                       ` Kiryl Shutsemau
  2026-02-23 16:22                       ` Lorenzo Stoakes
  2026-02-23 16:34                     ` David Laight
  1 sibling, 2 replies; 43+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-23 15:31 UTC (permalink / raw)
  To: Dave Hansen, Kiryl Shutsemau
  Cc: Kalesh Singh, lsf-pc, linux-mm, x86, linux-kernel, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Wilcox,
	Johannes Weiner, Usama Arif

On 2/23/26 16:14, Dave Hansen wrote:
> On 2/23/26 03:27, David Hildenbrand (Arm) wrote:
> ...
>>> BTW, x86-64 SysV ABI allows for 64k page size:
>>>
>>>      Systems are permitted to use any power-of-two page size between
>>>      4KB and 64KB, inclusive.
>>>
>>> But it doesn't work in practice.
>>
>> Even in well controlled environments you would run in a hyperscaler?
> 
> I think what Kirill is trying to say is that "it breaks userspace". ;)

Yes. Probably similar to Intel proposing an actual 64k page size. 
Expected. :)

> 
> A hyperscaler (or other "embedded" environment) might be willing or able
> to go fix up userspace breakage. I would suspect our high frequency
> trading friends would be all over this if it shaved a microsecond off
> their receive times.
> 
> The more important question is what it breaks and how badly it breaks
> things. 5-level paging, for instance, broke some JITs that historically
> used the new (>48) upper virtual address bits for metadata. The gains
> from 5-level paging were big enough and the userspace breakage was
> confined and fixable enough that 5-level paging was viable.
> 
> I'm not sure which side a larger base page side will fall on, though. Is
> it going to be an out-of-tree hack that a few folks use, or will it be
> more like 5-level paging and be good enough that it goes into mainline?

Just thinking about VMAs spanning partial pages makes me shiver. Or A 
single page spanning multiple VMAs.

I haven't seen the code yet, but I am certain that I will not like it.

I'm happy to be proven wrong :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-23 15:31                     ` David Hildenbrand (Arm)
@ 2026-02-23 15:45                       ` Kiryl Shutsemau
  2026-02-23 15:49                         ` David Hildenbrand (Arm)
  2026-02-23 16:22                       ` Lorenzo Stoakes
  1 sibling, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2026-02-23 15:45 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Dave Hansen, Kalesh Singh, lsf-pc, linux-mm, x86, linux-kernel,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif

On Mon, Feb 23, 2026 at 04:31:56PM +0100, David Hildenbrand (Arm) wrote:
> On 2/23/26 16:14, Dave Hansen wrote:
> > On 2/23/26 03:27, David Hildenbrand (Arm) wrote:
> > ...
> > > > BTW, x86-64 SysV ABI allows for 64k page size:
> > > > 
> > > >      Systems are permitted to use any power-of-two page size between
> > > >      4KB and 64KB, inclusive.
> > > > 
> > > > But it doesn't work in practice.
> > > 
> > > Even in well controlled environments you would run in a hyperscaler?
> > 
> > I think what Kirill is trying to say is that "it breaks userspace". ;)
> 
> Yes. Probably similar to Intel proposing an actual 64k page size. Expected.
> :)
> 
> > 
> > A hyperscaler (or other "embedded" environment) might be willing or able
> > to go fix up userspace breakage. I would suspect our high frequency
> > trading friends would be all over this if it shaved a microsecond off
> > their receive times.
> > 
> > The more important question is what it breaks and how badly it breaks
> > things. 5-level paging, for instance, broke some JITs that historically
> > used the new (>48) upper virtual address bits for metadata. The gains
> > from 5-level paging were big enough and the userspace breakage was
> > confined and fixable enough that 5-level paging was viable.
> > 
> > I'm not sure which side a larger base page side will fall on, though. Is
> > it going to be an out-of-tree hack that a few folks use, or will it be
> > more like 5-level paging and be good enough that it goes into mainline?
> 
> Just thinking about VMAs spanning partial pages makes me shiver. Or A
> single page spanning multiple VMAs.

Hate to break it to you, but we have it now upstream :P

THP can span multiple VMAs. And can be partially mapped.

The only new thing is that we allow this for order-0 page now. And you
cannot realistically recover wasted memory -- no deferred split.

> I haven't seen the code yet, but I am certain that I will not like it.
> 
> I'm happy to be proven wrong :)

I will do my best, but no promises :)

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-23 15:45                       ` Kiryl Shutsemau
@ 2026-02-23 15:49                         ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-23 15:49 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Dave Hansen, Kalesh Singh, lsf-pc, linux-mm, x86, linux-kernel,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif

>>
>> Just thinking about VMAs spanning partial pages makes me shiver. Or A
>> single page spanning multiple VMAs.
> 
> Hate to break it to you, but we have it now upstream :P
> 
> THP can span multiple VMAs. And can be partially mapped.

Single mapcount, single anon-exclusive flag.

Completely different story :P

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-23 15:31                     ` David Hildenbrand (Arm)
  2026-02-23 15:45                       ` Kiryl Shutsemau
@ 2026-02-23 16:22                       ` Lorenzo Stoakes
  1 sibling, 0 replies; 43+ messages in thread
From: Lorenzo Stoakes @ 2026-02-23 16:22 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Dave Hansen, Kiryl Shutsemau, Kalesh Singh, lsf-pc, linux-mm,
	x86, linux-kernel, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Liam R. Howlett, Mike Rapoport,
	Matthew Wilcox, Johannes Weiner, Usama Arif

On Mon, Feb 23, 2026 at 04:31:56PM +0100, David Hildenbrand (Arm) wrote:
> On 2/23/26 16:14, Dave Hansen wrote:
> > On 2/23/26 03:27, David Hildenbrand (Arm) wrote:
> > ...
> > > > BTW, x86-64 SysV ABI allows for 64k page size:
> > > >
> > > >      Systems are permitted to use any power-of-two page size between
> > > >      4KB and 64KB, inclusive.
> > > >
> > > > But it doesn't work in practice.
> > >
> > > Even in well controlled environments you would run in a hyperscaler?
> >
> > I think what Kirill is trying to say is that "it breaks userspace". ;)
>
> Yes. Probably similar to Intel proposing an actual 64k page size. Expected.
> :)
>
> >
> > A hyperscaler (or other "embedded" environment) might be willing or able
> > to go fix up userspace breakage. I would suspect our high frequency
> > trading friends would be all over this if it shaved a microsecond off
> > their receive times.
> >
> > The more important question is what it breaks and how badly it breaks
> > things. 5-level paging, for instance, broke some JITs that historically
> > used the new (>48) upper virtual address bits for metadata. The gains
> > from 5-level paging were big enough and the userspace breakage was
> > confined and fixable enough that 5-level paging was viable.
> >
> > I'm not sure which side a larger base page side will fall on, though. Is
> > it going to be an out-of-tree hack that a few folks use, or will it be
> > more like 5-level paging and be good enough that it goes into mainline?
>
> Just thinking about VMAs spanning partial pages makes me shiver. Or A single
> page spanning multiple VMAs.

Yeah agree, we're not doing this.

It's already a nightmare to deal with per-page anonexclusive vs. per-folio
pretty much everything else, and we shouldn't have allowed that to be a thing,
but now we have to live with it.

>
> I haven't seen the code yet, but I am certain that I will not like it.

If the code tries to implement anything that even resembles some sub-base-page
metadata then that's just not something that's going to land.

Handling VMA vs. folio state coherently is _already_ painful and difficult.

Piling on more complexity because we theoretically could feels rather along the
lines of 'let's just keep adding features and not worrying about where we end
up', which is I think a bit of an anti-pattern in the kernel in general.

>
> I'm happy to be proven wrong :)

Sure me also if I'm missing something, but what's discussed here is... worrying.

>
> --
> Cheers,
>
> David

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
  2026-02-23 15:14                   ` Dave Hansen
  2026-02-23 15:31                     ` David Hildenbrand (Arm)
@ 2026-02-23 16:34                     ` David Laight
  1 sibling, 0 replies; 43+ messages in thread
From: David Laight @ 2026-02-23 16:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Hildenbrand (Arm),
	Kiryl Shutsemau, Kalesh Singh, lsf-pc, linux-mm, x86,
	linux-kernel, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Matthew Wilcox, Johannes Weiner, Usama Arif

On Mon, 23 Feb 2026 07:14:39 -0800
Dave Hansen <dave.hansen@intel.com> wrote:

> On 2/23/26 03:27, David Hildenbrand (Arm) wrote:
> ...
> >> BTW, x86-64 SysV ABI allows for 64k page size:
> >>
> >>     Systems are permitted to use any power-of-two page size between
> >>     4KB and 64KB, inclusive.
> >>
> >> But it doesn't work in practice.  
> > 
> > Even in well controlled environments you would run in a hyperscaler?  
> 
> I think what Kirill is trying to say is that "it breaks userspace". ;)

With a 4k physical page what stops you dynamically splitting the 64k a
'struct page' references into 16 4k pages (using an extra dynamically
allocated structure)?
I'm not thinking it would happen that often, but it would solve the
problem of 4k aligned .data and (probably) mmap() of small files.

If the cpu supports TLB coalescing there could easily be a net gain
using 64k pages for most of a program binary.

	David


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2026-02-23 16:34 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-19 15:08 [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Kiryl Shutsemau
2026-02-19 15:17 ` Peter Zijlstra
2026-02-19 15:20   ` Peter Zijlstra
2026-02-19 15:27     ` Kiryl Shutsemau
2026-02-19 15:33 ` Pedro Falcato
2026-02-19 15:50   ` Kiryl Shutsemau
2026-02-19 15:53     ` David Hildenbrand (Arm)
2026-02-19 19:31       ` Pedro Falcato
2026-02-19 15:39 ` David Hildenbrand (Arm)
2026-02-19 15:54   ` Kiryl Shutsemau
2026-02-19 16:09     ` David Hildenbrand (Arm)
2026-02-20  2:55       ` Zi Yan
2026-02-19 17:09   ` Kiryl Shutsemau
2026-02-20 10:24     ` David Hildenbrand (Arm)
2026-02-20 12:07       ` Kiryl Shutsemau
2026-02-20 16:30         ` David Hildenbrand (Arm)
2026-02-20 19:33           ` Kalesh Singh
2026-02-23 11:04             ` David Hildenbrand (Arm)
2026-02-23 11:13               ` Kiryl Shutsemau
2026-02-23 11:27                 ` David Hildenbrand (Arm)
2026-02-23 12:16                   ` Kiryl Shutsemau
2026-02-23 15:14                   ` Dave Hansen
2026-02-23 15:31                     ` David Hildenbrand (Arm)
2026-02-23 15:45                       ` Kiryl Shutsemau
2026-02-23 15:49                         ` David Hildenbrand (Arm)
2026-02-23 16:22                       ` Lorenzo Stoakes
2026-02-23 16:34                     ` David Laight
2026-02-19 23:24   ` Kalesh Singh
2026-02-20 12:10     ` Kiryl Shutsemau
2026-02-20 19:21       ` Kalesh Singh
2026-02-19 17:08 ` Dave Hansen
2026-02-19 22:05   ` Kiryl Shutsemau
2026-02-20  3:28     ` Liam R. Howlett
2026-02-20 12:33       ` Kiryl Shutsemau
2026-02-20 15:17         ` Liam R. Howlett
2026-02-20 15:50           ` Kiryl Shutsemau
2026-02-19 17:30 ` Dave Hansen
2026-02-19 22:14   ` Kiryl Shutsemau
2026-02-19 22:21     ` Dave Hansen
2026-02-19 17:47 ` Matthew Wilcox
2026-02-19 22:26   ` Kiryl Shutsemau
2026-02-20  9:04 ` David Laight
2026-02-20 12:12   ` Kiryl Shutsemau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox