linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Mapping text with large folios
@ 2025-03-19 15:38 Ryan Roberts
  2025-03-19 18:16 ` Yang Shi
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Ryan Roberts @ 2025-03-19 15:38 UTC (permalink / raw)
  To: lsf-pc, Linux-MM; +Cc: Matthew Wilcox, Dave Chinner, Barry Song

Hi All,

I know this is very last minute, but I was hoping that it might be possible to
squeeze in a session to discuss the following?

Summary/Background:

On arm64, physically contiguous and naturally aligned regions can take advantage
of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file
regions containing text, current readahead behaviour often yields small,
misaligned folios, preventing this optimization. This proposal introduces a
special-case path for executable mappings, performing synchronous reads of an
architecture-chosen size into large folios (64 KB on arm64). Early performance
tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9%
gains.

I’ve previously posted attempts to enable this performance improvement ([1],
[2]), but there were objections and conversation fizzled out. Now that I have
more compelling performance data, I’m hoping there is now stronger
justification, and we can find a path forwards.

What I’d Like to Cover:

 - Describe how text memory should ideally be mapped and why it benefits
   performance.

 - Brief review of performance data.

 - Discuss options for the best way to encourage text into large folios:
     - Let the architecture request a preferred size
     - Extend VMA attributes to include preferred THP size hint
     - Provide a sysfs knob
     - Plug into the “mapping min folio order” infrastructure
     - Other approaches?

[1] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@arm.com/

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Mapping text with large folios
  2025-03-19 15:38 [LSF/MM/BPF TOPIC] Mapping text with large folios Ryan Roberts
@ 2025-03-19 18:16 ` Yang Shi
  2025-03-19 20:38   ` Dave Chinner
  2025-03-19 20:47 ` Barry Song
  2025-04-01 10:53 ` Ryan Roberts
  2 siblings, 1 reply; 13+ messages in thread
From: Yang Shi @ 2025-03-19 18:16 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: lsf-pc, Linux-MM, Matthew Wilcox, Dave Chinner, Barry Song

On Wed, Mar 19, 2025 at 8:39 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi All,
>
> I know this is very last minute, but I was hoping that it might be possible to
> squeeze in a session to discuss the following?
>
> Summary/Background:
>
> On arm64, physically contiguous and naturally aligned regions can take advantage
> of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file
> regions containing text, current readahead behaviour often yields small,
> misaligned folios, preventing this optimization. This proposal introduces a
> special-case path for executable mappings, performing synchronous reads of an
> architecture-chosen size into large folios (64 KB on arm64). Early performance
> tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9%
> gains.

AFAIK, MySQL is quite sensitive to iTLB pressure. It should be worth
adding to the tests.

>
> I’ve previously posted attempts to enable this performance improvement ([1],
> [2]), but there were objections and conversation fizzled out. Now that I have
> more compelling performance data, I’m hoping there is now stronger
> justification, and we can find a path forwards.
>
> What I’d Like to Cover:
>
>  - Describe how text memory should ideally be mapped and why it benefits
>    performance.
>
>  - Brief review of performance data.
>
>  - Discuss options for the best way to encourage text into large folios:
>      - Let the architecture request a preferred size
>      - Extend VMA attributes to include preferred THP size hint
>      - Provide a sysfs knob
>      - Plug into the “mapping min folio order” infrastructure
>      - Other approaches?

Did you try LBS? You can have 64K block size with LBS, it should
create large folios for page cache so text should get large folios
automatically (IIRC arm64 linker script has 64K alignment by default).

Thanks,
Yang

>
> [1] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@arm.com/
>
> Thanks,
> Ryan
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Mapping text with large folios
  2025-03-19 18:16 ` Yang Shi
@ 2025-03-19 20:38   ` Dave Chinner
  2025-03-19 22:13     ` Barry Song
  2025-03-20 12:13     ` Ryan Roberts
  0 siblings, 2 replies; 13+ messages in thread
From: Dave Chinner @ 2025-03-19 20:38 UTC (permalink / raw)
  To: Yang Shi; +Cc: Ryan Roberts, lsf-pc, Linux-MM, Matthew Wilcox, Barry Song

On Wed, Mar 19, 2025 at 11:16:16AM -0700, Yang Shi wrote:
> On Wed, Mar 19, 2025 at 8:39 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > Hi All,
> >
> > I know this is very last minute, but I was hoping that it might be possible to
> > squeeze in a session to discuss the following?

I'm not going to be at LSFMM, so I'd prefer this sort of thing get
discussed on the dev lists...

> > Summary/Background:
> >
> > On arm64, physically contiguous and naturally aligned regions can take advantage
> > of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file
> > regions containing text, current readahead behaviour often yields small,
> > misaligned folios, preventing this optimization. This proposal introduces a
> > special-case path for executable mappings, performing synchronous reads of an
> > architecture-chosen size into large folios (64 KB on arm64). Early performance
> > tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9%
> > gains.
> 
> AFAIK, MySQL is quite sensitive to iTLB pressure. It should be worth
> adding to the tests.
> 
> >
> > I’ve previously posted attempts to enable this performance improvement ([1],
> > [2]), but there were objections and conversation fizzled out. Now that I have
> > more compelling performance data, I’m hoping there is now stronger
> > justification, and we can find a path forwards.
> >
> > What I’d Like to Cover:
> >
> >  - Describe how text memory should ideally be mapped and why it benefits
> >    performance.

I think the main people involved already understand this...

> >  - Brief review of performance data.

You don't need to convince me - there's 3 decades of evidence
proving that larger, fewer page table mappings for executables
results in better performance.

> >  - Discuss options for the best way to encourage text into large folios:
> >      - Let the architecture request a preferred size
> >      - Extend VMA attributes to include preferred THP size hint
> >      - Provide a sysfs knob
> >      - Plug into the “mapping min folio order” infrastructure
> >      - Other approaches?

Implement generic large folio/sequential PTE mapping optimisations
for each platform, then control it by letting the filesystem decide
what the desired mapping order and alignment should be for any given
inode mapping tree.

> Did you try LBS? You can have 64K block size with LBS, it should
> create large folios for page cache so text should get large folios
> automatically (IIRC arm64 linker script has 64K alignment by default).

We really don't want people using 64kB block size filesystems for
root filesystems - there are plenty of downsides to using huge block
sizes for filesytems that generally hold many tiny files.

However, I agree with the general principle that the fs should be
directing the inode mapping tree folio order behaviour.  i.e. the
filesystem already sets both the floor and the desired behaviour for
folio instantiation for any given inode mapping tree.

It also needs to be able to instantiate large folios -before- the
executable is mapped into VMAs via mmap() because files can be read
into cache before they are run (e.g. boot time readahead hacks).
i.e. a mmap() time directive is too late to apply to the inode
mapping tree to guarantee optimal layout for PTE optimisation. It
also may not be possible to apply mmap() time directives due to
other filesystem constraints, so mmap() time directives may well end
up being unpredictable and unreliable....

There's also an obvious filesystem level trigger for enabling this
behaviour in a generic manner.  e.g. The filesystem can look at the
X perm bits on the inode at instantiation time and if they are set,
set a "desired order" value+flag on the mapping at inode cache
instantiation in addition to "min order".

If a desired order is configured, the page cache read code can then
pass a FGP_TRY_ORDER flag with the fgp_order set to the desired
value to folio allocation. If that can't be allocated then it can
fall back to single page folios instead of failing.

At this point, we will always optimistically try to allocate larger
folios for executables on all architectures. Architectures that
can optimise sequential PTE mappings can then simply add generic
support for large folio optimisation, and more efficient executable
mappings simply fall out of the generic support for efficient
mapping of large folios and filesystems preferring large folios for
executable inode mappings....

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Mapping text with large folios
  2025-03-19 15:38 [LSF/MM/BPF TOPIC] Mapping text with large folios Ryan Roberts
  2025-03-19 18:16 ` Yang Shi
@ 2025-03-19 20:47 ` Barry Song
  2025-03-20 14:57   ` Ryan Roberts
  2025-04-01 10:53 ` Ryan Roberts
  2 siblings, 1 reply; 13+ messages in thread
From: Barry Song @ 2025-03-19 20:47 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: lsf-pc, Linux-MM, Matthew Wilcox, Dave Chinner

On Thu, Mar 20, 2025 at 4:38 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi All,
>
> I know this is very last minute, but I was hoping that it might be possible to
> squeeze in a session to discuss the following?
>
> Summary/Background:
>
> On arm64, physically contiguous and naturally aligned regions can take advantage
> of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file
> regions containing text, current readahead behaviour often yields small,
> misaligned folios, preventing this optimization. This proposal introduces a
> special-case path for executable mappings, performing synchronous reads of an
> architecture-chosen size into large folios (64 KB on arm64). Early performance
> tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9%
> gains.
>
> I’ve previously posted attempts to enable this performance improvement ([1],
> [2]), but there were objections and conversation fizzled out. Now that I have
> more compelling performance data, I’m hoping there is now stronger
> justification, and we can find a path forwards.
>
> What I’d Like to Cover:
>
>  - Describe how text memory should ideally be mapped and why it benefits
>    performance.
>
>  - Brief review of performance data.
>
>  - Discuss options for the best way to encourage text into large folios:
>      - Let the architecture request a preferred size
>      - Extend VMA attributes to include preferred THP size hint

We might need this for a couple of other cases.

1. The native heap—for example, a native heap like jemalloc—can configure
the base "granularity" and then use MADV_DONTNEED/FREE at that granularity
to manage memory. Currently, the default granularity is PAGE_SIZE, which can
lead to excessive folio splitting. For instance, if we set jemalloc's
granularity to
16KB while sysfs supports 16KB, 32KB, 64KB, etc., splitting can still occur.
Therefore, in some cases, I believe the kernel should be aware of how
userspace is managing memory.

2. Java heap GC compaction -  userfaultfd_move() things.
I am considering adding support for batched PTE/folios moves in
userfaultfd_move().
If sysfs enables 16KB, 32KB, 64KB, 128KB, etc., but the userspace Java
heap moves
memory at a 16KB granularity, it could lead to excessive folio splitting.

For exec, it seems we need a userspace-transparent approach. Asking each
application to modify its code to madvise the kernel on its preferred exec folio
size seems cumbersome.

I mean, we could whitelist all execs by default unless an application explicitly
requests to disable it?

>      - Provide a sysfs knob
>      - Plug into the “mapping min folio order” infrastructure
>      - Other approaches?
>
> [1] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@arm.com/
>
> Thanks,
> Ryan

Thanks
Barry


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Mapping text with large folios
  2025-03-19 20:38   ` Dave Chinner
@ 2025-03-19 22:13     ` Barry Song
  2025-03-20  0:53       ` Dave Chinner
  2025-03-20 12:16       ` Ryan Roberts
  2025-03-20 12:13     ` Ryan Roberts
  1 sibling, 2 replies; 13+ messages in thread
From: Barry Song @ 2025-03-19 22:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Yang Shi, Ryan Roberts, lsf-pc, Linux-MM, Matthew Wilcox

On Thu, Mar 20, 2025 at 9:38 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Mar 19, 2025 at 11:16:16AM -0700, Yang Shi wrote:
> > On Wed, Mar 19, 2025 at 8:39 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> > >
> > > Hi All,
> > >
> > > I know this is very last minute, but I was hoping that it might be possible to
> > > squeeze in a session to discuss the following?
>
> I'm not going to be at LSFMM, so I'd prefer this sort of thing get
> discussed on the dev lists...
>
> > > Summary/Background:
> > >
> > > On arm64, physically contiguous and naturally aligned regions can take advantage
> > > of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file
> > > regions containing text, current readahead behaviour often yields small,
> > > misaligned folios, preventing this optimization. This proposal introduces a
> > > special-case path for executable mappings, performing synchronous reads of an
> > > architecture-chosen size into large folios (64 KB on arm64). Early performance
> > > tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9%
> > > gains.
> >
> > AFAIK, MySQL is quite sensitive to iTLB pressure. It should be worth
> > adding to the tests.
> >
> > >
> > > I’ve previously posted attempts to enable this performance improvement ([1],
> > > [2]), but there were objections and conversation fizzled out. Now that I have
> > > more compelling performance data, I’m hoping there is now stronger
> > > justification, and we can find a path forwards.
> > >
> > > What I’d Like to Cover:
> > >
> > >  - Describe how text memory should ideally be mapped and why it benefits
> > >    performance.
>
> I think the main people involved already understand this...
>
> > >  - Brief review of performance data.
>
> You don't need to convince me - there's 3 decades of evidence
> proving that larger, fewer page table mappings for executables
> results in better performance.
>
> > >  - Discuss options for the best way to encourage text into large folios:
> > >      - Let the architecture request a preferred size
> > >      - Extend VMA attributes to include preferred THP size hint
> > >      - Provide a sysfs knob
> > >      - Plug into the “mapping min folio order” infrastructure
> > >      - Other approaches?
>
> Implement generic large folio/sequential PTE mapping optimisations
> for each platform, then control it by letting the filesystem decide
> what the desired mapping order and alignment should be for any given
> inode mapping tree.
>
> > Did you try LBS? You can have 64K block size with LBS, it should
> > create large folios for page cache so text should get large folios
> > automatically (IIRC arm64 linker script has 64K alignment by default).
>
> We really don't want people using 64kB block size filesystems for
> root filesystems - there are plenty of downsides to using huge block
> sizes for filesytems that generally hold many tiny files.

Agreed. Large folios will be compatible with existing file systems and
applications, which don’t always require userspace to adopt them.

>
> However, I agree with the general principle that the fs should be
> directing the inode mapping tree folio order behaviour.  i.e. the
> filesystem already sets both the floor and the desired behaviour for
> folio instantiation for any given inode mapping tree.
>
> It also needs to be able to instantiate large folios -before- the
> executable is mapped into VMAs via mmap() because files can be read
> into cache before they are run (e.g. boot time readahead hacks).
> i.e. a mmap() time directive is too late to apply to the inode
> mapping tree to guarantee optimal layout for PTE optimisation. It
> also may not be possible to apply mmap() time directives due to
> other filesystem constraints, so mmap() time directives may well end
> up being unpredictable and unreliable....
>

ELF loading and the linker may lead to readaheading a small portion
of the code text before mmap(). However, once the executable files
are large, the minor loss of large folios due to limited read-ahead of
the text may not be substantial enough to justify consideration.

But "boot time readahead hacks" seem like something that can read
ahead significantly. Unless we can modify these "boot time readahead
hacks" to use mmap() with EXEC mapping, it seems we would need
something at the sys_read() to apply the preferred size.

> There's also an obvious filesystem level trigger for enabling this
> behaviour in a generic manner.  e.g. The filesystem can look at the
> X perm bits on the inode at instantiation time and if they are set,
> set a "desired order" value+flag on the mapping at inode cache
> instantiation in addition to "min order".
>

Not sure what proportion of an executable file is the text section. If it's
less than 30% or 50%, it seems we might be allocating "preferred size"
large folios to many other sections that may not benefit from them?

Also, a Bash shell script with executable permissions might get a
preferred large folio size. This seems weird?

By the way, are .so files executable files, even though they may contain
a lot of code? As I check my filesystems, it seems not:

/usr/lib/aarch64-linux-gnu # ls -l libz.so.1.2.13
-rw-r--r-- 1 root root 133280 Jan 11  2023 libz.so.1.2.13


> If a desired order is configured, the page cache read code can then
> pass a FGP_TRY_ORDER flag with the fgp_order set to the desired
> value to folio allocation. If that can't be allocated then it can
> fall back to single page folios instead of failing.
>
> At this point, we will always optimistically try to allocate larger
> folios for executables on all architectures. Architectures that
> can optimise sequential PTE mappings can then simply add generic
> support for large folio optimisation, and more efficient executable
> mappings simply fall out of the generic support for efficient
> mapping of large folios and filesystems preferring large folios for
> executable inode mappings....

I feel this falls more within the scope of architecture and memory
management rather than the filesystem. If possible, we should try
to avoid modifying the filesystem code?

>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com

Thanks
Barry


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Mapping text with large folios
  2025-03-19 22:13     ` Barry Song
@ 2025-03-20  0:53       ` Dave Chinner
  2025-03-20 14:47         ` Ryan Roberts
  2025-03-20 12:16       ` Ryan Roberts
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2025-03-20  0:53 UTC (permalink / raw)
  To: Barry Song; +Cc: Yang Shi, Ryan Roberts, lsf-pc, Linux-MM, Matthew Wilcox

On Thu, Mar 20, 2025 at 11:13:11AM +1300, Barry Song wrote:
> On Thu, Mar 20, 2025 at 9:38 AM Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Mar 19, 2025 at 11:16:16AM -0700, Yang Shi wrote:
> > > On Wed, Mar 19, 2025 at 8:39 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> > However, I agree with the general principle that the fs should be
> > directing the inode mapping tree folio order behaviour.  i.e. the
> > filesystem already sets both the floor and the desired behaviour for
> > folio instantiation for any given inode mapping tree.
> >
> > It also needs to be able to instantiate large folios -before- the
> > executable is mapped into VMAs via mmap() because files can be read
> > into cache before they are run (e.g. boot time readahead hacks).
> > i.e. a mmap() time directive is too late to apply to the inode
> > mapping tree to guarantee optimal layout for PTE optimisation. It
> > also may not be possible to apply mmap() time directives due to
> > other filesystem constraints, so mmap() time directives may well end
> > up being unpredictable and unreliable....
> >
> 
> ELF loading and the linker may lead to readaheading a small portion
> of the code text before mmap(). However, once the executable files
> are large, the minor loss of large folios due to limited read-ahead of
> the text may not be substantial enough to justify consideration.
> 
> But "boot time readahead hacks" seem like something that can read
> ahead significantly. Unless we can modify these "boot time readahead
> hacks" to use mmap() with EXEC mapping, it seems we would need
> something at the sys_read() to apply the preferred size.

Yes, that's exactly what I said. :)

But you haven't understood the example I gave (ie.. boot time
readahead). There are many ways to have executables cached without them
being mapped executable. They get accessed by a linker during
compilation of code. They get updated by the OS package manager.
A backup or dedpulication program accesses them. A virus scanners
reads it looking for trojans, etc.

i.e. there are lots of ways of getting executables cached that
prevent optimal large folio formation if the filesystem doesn't
directly control formation of said large folios.

Hence if we don't apply large folio selection criteria to -all-
buffered IO (read, write and mmap), the result when mmap(EXEC)
occurs is going to be .... unpredictable and no always optimal.



value.

So assuming that the cache is cold, we want filemap_fault() to
allocate large folios from cache misses on read faults, yes?

That lands us in do_sync_mmap_readahead(), and that has a bit of a
problem w.r.t. large folios. it ends up calling:

	page_cache_ra_order(.... new_order = 0)

This limits folio allocated by readahead to order-2 in size, unless
the mapping was instantiated by the filesystem with a larger
min_order. In which case if will use the larger min_order value.

Either way, we don't get the desired large folio size the arch wants
to optimise the page table mappings.

I'd suggest this would be fixed by something like this in
do_sync_mmap_readahead():

-	page_cache_ra_order(..., 0);
+	new_order = 0;
+	if (is_exec_mapping(vmf->vma->vm_flags))
+		new_order = <arch specific optimal pte mapping order>
+	page_cache_ra_order(..., new_order);

And now the page cache will be populated with large folios of at
least the order requested if filesystem can support folios of that
size.

Unless I've misunderstood something (cold cache instantiation of
64kB folios is what you desired, isn't it?), that small change
should largely make exec mappings behave the way you want...

> > There's also an obvious filesystem level trigger for enabling this
> > behaviour in a generic manner.  e.g. The filesystem can look at the
> > X perm bits on the inode at instantiation time and if they are set,
> > set a "desired order" value+flag on the mapping at inode cache
> > instantiation in addition to "min order".
> >
> 
> Not sure what proportion of an executable file is the text section. If it's
> less than 30% or 50%, it seems we might be allocating "preferred size"
> large folios to many other sections that may not benefit from them?
>
> Also, a Bash shell script with executable permissions might get a
> preferred large folio size. This seems weird?

But none of this is actually a problem at all.  Fewer, larger folios
still means less page cache and memory reclaim management overhead
even if there is no direct benefit from optimised page table
mapping.

Also, we typically know the file size at mapping tree instantiation
time and hence we could make a sane decision as to whether large
folios should be used for any specific executable file.

> By the way, are .so files executable files, even though they may contain
> a lot of code? As I check my filesystems, it seems not:
> 
> /usr/lib/aarch64-linux-gnu # ls -l libz.so.1.2.13
> -rw-r--r-- 1 root root 133280 Jan 11  2023 libz.so.1.2.13

True, I hadn't considered that.

Seems like fixing do_sync_mmap_readahead() might be the best way to
go then....

> > If a desired order is configured, the page cache read code can then
> > pass a FGP_TRY_ORDER flag with the fgp_order set to the desired
> > value to folio allocation. If that can't be allocated then it can
> > fall back to single page folios instead of failing.
> >
> > At this point, we will always optimistically try to allocate larger
> > folios for executables on all architectures. Architectures that
> > can optimise sequential PTE mappings can then simply add generic
> > support for large folio optimisation, and more efficient executable
> > mappings simply fall out of the generic support for efficient
> > mapping of large folios and filesystems preferring large folios for
> > executable inode mappings....
> 
> I feel this falls more within the scope of architecture and memory
> management rather than the filesystem. If possible, we should try
> to avoid modifying the filesystem code?

Large folios may be a MM construct, but you can't use them
in the page cache without the backing filesystem being fully aware
of them and the mm subsystem has to work within the constraints the
filesystem places on large folios in the page cache.

If we need to change constraints or enact new policies around
file IO specific large folio optimisations, then we definitely are
going to need to modify both mm and filesystem code to implement
them....

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Mapping text with large folios
  2025-03-19 20:38   ` Dave Chinner
  2025-03-19 22:13     ` Barry Song
@ 2025-03-20 12:13     ` Ryan Roberts
  1 sibling, 0 replies; 13+ messages in thread
From: Ryan Roberts @ 2025-03-20 12:13 UTC (permalink / raw)
  To: Dave Chinner, Yang Shi; +Cc: lsf-pc, Linux-MM, Matthew Wilcox, Barry Song

On 19/03/2025 20:38, Dave Chinner wrote:
> On Wed, Mar 19, 2025 at 11:16:16AM -0700, Yang Shi wrote:
>> On Wed, Mar 19, 2025 at 8:39 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> Hi All,
>>>
>>> I know this is very last minute, but I was hoping that it might be possible to
>>> squeeze in a session to discuss the following?
> 
> I'm not going to be at LSFMM, so I'd prefer this sort of thing get
> discussed on the dev lists...

I'd be happy to do it that way. Except it was you that raised the objections to
the original patch, then didn't engage with my responses [1]. So I was trying to
force the issue :)

[1] https://lore.kernel.org/all/bdde4008-60db-4717-a6b5-53d77ab76bdb@arm.com/

> 
>>> Summary/Background:
>>>
>>> On arm64, physically contiguous and naturally aligned regions can take advantage
>>> of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file
>>> regions containing text, current readahead behaviour often yields small,
>>> misaligned folios, preventing this optimization. This proposal introduces a
>>> special-case path for executable mappings, performing synchronous reads of an
>>> architecture-chosen size into large folios (64 KB on arm64). Early performance
>>> tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9%
>>> gains.
>>
>> AFAIK, MySQL is quite sensitive to iTLB pressure. It should be worth
>> adding to the tests.
>>
>>>
>>> I’ve previously posted attempts to enable this performance improvement ([1],
>>> [2]), but there were objections and conversation fizzled out. Now that I have
>>> more compelling performance data, I’m hoping there is now stronger
>>> justification, and we can find a path forwards.
>>>
>>> What I’d Like to Cover:
>>>
>>>  - Describe how text memory should ideally be mapped and why it benefits
>>>    performance.
> 
> I think the main people involved already understand this...
> 
>>>  - Brief review of performance data.
> 
> You don't need to convince me - there's 3 decades of evidence
> proving that larger, fewer page table mappings for executables
> results in better performance.

Sure, I was just trying to set the scene; I think it's worth 1 slide...

> 
>>>  - Discuss options for the best way to encourage text into large folios:
>>>      - Let the architecture request a preferred size
>>>      - Extend VMA attributes to include preferred THP size hint
>>>      - Provide a sysfs knob
>>>      - Plug into the “mapping min folio order” infrastructure
>>>      - Other approaches?
> 
> Implement generic large folio/sequential PTE mapping optimisations
> for each platform, then control it by letting the filesystem decide
> what the desired mapping order and alignment should be for any given
> inode mapping tree.

I don't really understand what this has to do with the filesystem? The
filesystem provides a hard floor for the *permitted* folio size (to satisfy
BS>PS constraints). But it's the readahead that decides the actual folio sizes,
subject to meeting that constraint.

An ELF has multiple sections so setting a particular minimum folio size for the
file doesn't seem appropriate. And additionally for my use case, there is no
hard requirement for a minumum folio size; it's just a preference. We can safely
fall back to the file's minimum folio size if allocation of the preferred folio
size fails or if it would run off the end of the file, etc.

I suspect I've misunderstood your proposal, because the way I've interpretted
it, it makes no sense to me...

> 
>> Did you try LBS? You can have 64K block size with LBS, it should
>> create large folios for page cache so text should get large folios
>> automatically (IIRC arm64 linker script has 64K alignment by default).
> 
> We really don't want people using 64kB block size filesystems for
> root filesystems - there are plenty of downsides to using huge block
> sizes for filesytems that generally hold many tiny files.
> 
> However, I agree with the general principle that the fs should be
> directing the inode mapping tree folio order behaviour.  i.e. the
> filesystem already sets both the floor and the desired behaviour for
> folio instantiation for any given inode mapping tree.
> 
> It also needs to be able to instantiate large folios -before- the
> executable is mapped into VMAs via mmap() because files can be read
> into cache before they are run (e.g. boot time readahead hacks).
> i.e. a mmap() time directive is too late to apply to the inode
> mapping tree to guarantee optimal layout for PTE optimisation. It
> also may not be possible to apply mmap() time directives due to
> other filesystem constraints, so mmap() time directives may well end
> up being unpredictable and unreliable....

Agreed on this issue. A common manifestation of this issue is when user space
read()s the ELF header to figure out how to mmap it. The read() causes readahead
of multiple pages which end up in the page cache as (commonly) 16K folios, which
often overlap into the text section, so after the text section gets mmaped, the
folios already in the page cache will be faulted into the process as is.

We have separately been exploring the possibility of modifying the readahead()
syscall behavior; if user space is asking to readahead a large chunk, it makes
sense to use that as a hint that the region should be treated as a single object
and be read into the largest possible folios. Today if some of the requested
region is already in the page cache, readahead will only read the bits not
present. But it might be preferable to just drop the bits that are present and
re-read into large folio.

Of course you wouldn't want user space to issue readahead() calls for the
entirety of the text section. But if the binary were post-linked with BOLT so
some PGO solution that puts the hot code at the front of the section, the linker
could detect this and request readahead for the hot part.

But independent of the readahead() stuff, VM_EXEC is a good enough indicator to
control this best effort feature in my view; it is sufficent most of the time.
And indeed there is already precedent because readahead consumes MADV_HUGEPAGE
in exectly the same way.

> 
> There's also an obvious filesystem level trigger for enabling this
> behaviour in a generic manner.  e.g. The filesystem can look at the
> X perm bits on the inode at instantiation time and if they are set,
> set a "desired order" value+flag on the mapping at inode cache
> instantiation in addition to "min order".

My understanding is the the X permission only controls whether the kernel will
permit exec()ing the file. It doesn't prevent it from being mapped executable in
a process. And shared libraries usually don't set the X perm. So I'm not sure
this works.

> 
> If a desired order is configured, the page cache read code can then
> pass a FGP_TRY_ORDER flag with the fgp_order set to the desired
> value to folio allocation. If that can't be allocated then it can
> fall back to single page folios instead of failing.

I don't see FGP_TRY_ORDER in the source, is that new? Or are you proposing it as
an addition. I guess this would mainly just disable reclaim? I agree that this
wants to be a best effort allocation. I'm just disagreeing that we want to
direct the policy from the filesystem; why would we want to have to implement
the policy for all filesystems?

> 
> At this point, we will always optimistically try to allocate larger
> folios for executables on all architectures. Architectures that
> can optimise sequential PTE mappings can then simply add generic
> support for large folio optimisation, and more efficient executable
> mappings simply fall out of the generic support for efficient
> mapping of large folios and filesystems preferring large folios for
> executable inode mappings....

arm64 already has the large folio mapping optimizations. It's called "contpte";
it opportunistically sets the contiguous bit in the block of PTEs if the folio
size and alignment are acceptable.

It sounds to me like we agree on most of this, but disagree on where the policy
should be directed and based on what heuristic; filesystem + X perm bit, or
readahead + VM_EXEC bit.

Thanks,
Ryan

> 
> -Dave.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Mapping text with large folios
  2025-03-19 22:13     ` Barry Song
  2025-03-20  0:53       ` Dave Chinner
@ 2025-03-20 12:16       ` Ryan Roberts
  1 sibling, 0 replies; 13+ messages in thread
From: Ryan Roberts @ 2025-03-20 12:16 UTC (permalink / raw)
  To: Barry Song, Dave Chinner; +Cc: Yang Shi, lsf-pc, Linux-MM, Matthew Wilcox

Appologies, I just sent a response to Dave that raises most of the same points
that Barry raises here. I'll read the full thread before replying further :)


On 19/03/2025 22:13, Barry Song wrote:
> On Thu, Mar 20, 2025 at 9:38 AM Dave Chinner <david@fromorbit.com> wrote:
>>
>> On Wed, Mar 19, 2025 at 11:16:16AM -0700, Yang Shi wrote:
>>> On Wed, Mar 19, 2025 at 8:39 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> I know this is very last minute, but I was hoping that it might be possible to
>>>> squeeze in a session to discuss the following?
>>
>> I'm not going to be at LSFMM, so I'd prefer this sort of thing get
>> discussed on the dev lists...
>>
>>>> Summary/Background:
>>>>
>>>> On arm64, physically contiguous and naturally aligned regions can take advantage
>>>> of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file
>>>> regions containing text, current readahead behaviour often yields small,
>>>> misaligned folios, preventing this optimization. This proposal introduces a
>>>> special-case path for executable mappings, performing synchronous reads of an
>>>> architecture-chosen size into large folios (64 KB on arm64). Early performance
>>>> tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9%
>>>> gains.
>>>
>>> AFAIK, MySQL is quite sensitive to iTLB pressure. It should be worth
>>> adding to the tests.
>>>
>>>>
>>>> I’ve previously posted attempts to enable this performance improvement ([1],
>>>> [2]), but there were objections and conversation fizzled out. Now that I have
>>>> more compelling performance data, I’m hoping there is now stronger
>>>> justification, and we can find a path forwards.
>>>>
>>>> What I’d Like to Cover:
>>>>
>>>>  - Describe how text memory should ideally be mapped and why it benefits
>>>>    performance.
>>
>> I think the main people involved already understand this...
>>
>>>>  - Brief review of performance data.
>>
>> You don't need to convince me - there's 3 decades of evidence
>> proving that larger, fewer page table mappings for executables
>> results in better performance.
>>
>>>>  - Discuss options for the best way to encourage text into large folios:
>>>>      - Let the architecture request a preferred size
>>>>      - Extend VMA attributes to include preferred THP size hint
>>>>      - Provide a sysfs knob
>>>>      - Plug into the “mapping min folio order” infrastructure
>>>>      - Other approaches?
>>
>> Implement generic large folio/sequential PTE mapping optimisations
>> for each platform, then control it by letting the filesystem decide
>> what the desired mapping order and alignment should be for any given
>> inode mapping tree.
>>
>>> Did you try LBS? You can have 64K block size with LBS, it should
>>> create large folios for page cache so text should get large folios
>>> automatically (IIRC arm64 linker script has 64K alignment by default).
>>
>> We really don't want people using 64kB block size filesystems for
>> root filesystems - there are plenty of downsides to using huge block
>> sizes for filesytems that generally hold many tiny files.
> 
> Agreed. Large folios will be compatible with existing file systems and
> applications, which don’t always require userspace to adopt them.
> 
>>
>> However, I agree with the general principle that the fs should be
>> directing the inode mapping tree folio order behaviour.  i.e. the
>> filesystem already sets both the floor and the desired behaviour for
>> folio instantiation for any given inode mapping tree.
>>
>> It also needs to be able to instantiate large folios -before- the
>> executable is mapped into VMAs via mmap() because files can be read
>> into cache before they are run (e.g. boot time readahead hacks).
>> i.e. a mmap() time directive is too late to apply to the inode
>> mapping tree to guarantee optimal layout for PTE optimisation. It
>> also may not be possible to apply mmap() time directives due to
>> other filesystem constraints, so mmap() time directives may well end
>> up being unpredictable and unreliable....
>>
> 
> ELF loading and the linker may lead to readaheading a small portion
> of the code text before mmap(). However, once the executable files
> are large, the minor loss of large folios due to limited read-ahead of
> the text may not be substantial enough to justify consideration.
> 
> But "boot time readahead hacks" seem like something that can read
> ahead significantly. Unless we can modify these "boot time readahead
> hacks" to use mmap() with EXEC mapping, it seems we would need
> something at the sys_read() to apply the preferred size.
> 
>> There's also an obvious filesystem level trigger for enabling this
>> behaviour in a generic manner.  e.g. The filesystem can look at the
>> X perm bits on the inode at instantiation time and if they are set,
>> set a "desired order" value+flag on the mapping at inode cache
>> instantiation in addition to "min order".
>>
> 
> Not sure what proportion of an executable file is the text section. If it's
> less than 30% or 50%, it seems we might be allocating "preferred size"
> large folios to many other sections that may not benefit from them?
> 
> Also, a Bash shell script with executable permissions might get a
> preferred large folio size. This seems weird?
> 
> By the way, are .so files executable files, even though they may contain
> a lot of code? As I check my filesystems, it seems not:
> 
> /usr/lib/aarch64-linux-gnu # ls -l libz.so.1.2.13
> -rw-r--r-- 1 root root 133280 Jan 11  2023 libz.so.1.2.13
> 
> 
>> If a desired order is configured, the page cache read code can then
>> pass a FGP_TRY_ORDER flag with the fgp_order set to the desired
>> value to folio allocation. If that can't be allocated then it can
>> fall back to single page folios instead of failing.
>>
>> At this point, we will always optimistically try to allocate larger
>> folios for executables on all architectures. Architectures that
>> can optimise sequential PTE mappings can then simply add generic
>> support for large folio optimisation, and more efficient executable
>> mappings simply fall out of the generic support for efficient
>> mapping of large folios and filesystems preferring large folios for
>> executable inode mappings....
> 
> I feel this falls more within the scope of architecture and memory
> management rather than the filesystem. If possible, we should try
> to avoid modifying the filesystem code?
> 
>>
>> -Dave.
>> --
>> Dave Chinner
>> david@fromorbit.com
> 
> Thanks
> Barry



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Mapping text with large folios
  2025-03-20  0:53       ` Dave Chinner
@ 2025-03-20 14:47         ` Ryan Roberts
  0 siblings, 0 replies; 13+ messages in thread
From: Ryan Roberts @ 2025-03-20 14:47 UTC (permalink / raw)
  To: Dave Chinner, Barry Song; +Cc: Yang Shi, lsf-pc, Linux-MM, Matthew Wilcox

On 20/03/2025 00:53, Dave Chinner wrote:
> On Thu, Mar 20, 2025 at 11:13:11AM +1300, Barry Song wrote:
>> On Thu, Mar 20, 2025 at 9:38 AM Dave Chinner <david@fromorbit.com> wrote:
>>> On Wed, Mar 19, 2025 at 11:16:16AM -0700, Yang Shi wrote:
>>>> On Wed, Mar 19, 2025 at 8:39 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>> However, I agree with the general principle that the fs should be
>>> directing the inode mapping tree folio order behaviour.  i.e. the
>>> filesystem already sets both the floor and the desired behaviour for
>>> folio instantiation for any given inode mapping tree.
>>>
>>> It also needs to be able to instantiate large folios -before- the
>>> executable is mapped into VMAs via mmap() because files can be read
>>> into cache before they are run (e.g. boot time readahead hacks).
>>> i.e. a mmap() time directive is too late to apply to the inode
>>> mapping tree to guarantee optimal layout for PTE optimisation. It
>>> also may not be possible to apply mmap() time directives due to
>>> other filesystem constraints, so mmap() time directives may well end
>>> up being unpredictable and unreliable....
>>>
>>
>> ELF loading and the linker may lead to readaheading a small portion
>> of the code text before mmap(). However, once the executable files
>> are large, the minor loss of large folios due to limited read-ahead of
>> the text may not be substantial enough to justify consideration.
>>
>> But "boot time readahead hacks" seem like something that can read
>> ahead significantly. Unless we can modify these "boot time readahead
>> hacks" to use mmap() with EXEC mapping, it seems we would need
>> something at the sys_read() to apply the preferred size.
> 
> Yes, that's exactly what I said. :)
> 
> But you haven't understood the example I gave (ie.. boot time
> readahead). There are many ways to have executables cached without them
> being mapped executable. They get accessed by a linker during
> compilation of code. They get updated by the OS package manager.
> A backup or dedpulication program accesses them. A virus scanners
> reads it looking for trojans, etc.

But most of these other ways are sequentially reading or writing the file, so
readahead will work more or less as expected in these cases and quickly ramp up
to bigger and bigger folios, I think? So most of the file will end up in folios
at least as large as 64K. When mapped, arm64 will be able to set the contpte bit.

In my experience, it's only when we are faulting in memory due to execution that
the pattern becomes random access and readahead never reads ahead far enough to
use larger folios - that's the case that needs help.

> 
> i.e. there are lots of ways of getting executables cached that
> prevent optimal large folio formation if the filesystem doesn't
> directly control formation of said large folios.
> 
> Hence if we don't apply large folio selection criteria to -all-
> buffered IO (read, write and mmap), the result when mmap(EXEC)
> occurs is going to be .... unpredictable and no always optimal.
> 
> 
> 
> value.
> 
> So assuming that the cache is cold, we want filemap_fault() to
> allocate large folios from cache misses on read faults, yes?

Large folios of a preferred size, yes.

> 
> That lands us in do_sync_mmap_readahead(), and that has a bit of a
> problem w.r.t. large folios. it ends up calling:
> 
> 	page_cache_ra_order(.... new_order = 0)
> 
> This limits folio allocated by readahead to order-2 in size, unless
> the mapping was instantiated by the filesystem with a larger
> min_order. In which case if will use the larger min_order value.
> 
> Either way, we don't get the desired large folio size the arch wants
> to optimise the page table mappings.
> 
> I'd suggest this would be fixed by something like this in
> do_sync_mmap_readahead():
> 
> -	page_cache_ra_order(..., 0);
> +	new_order = 0;
> +	if (is_exec_mapping(vmf->vma->vm_flags))
> +		new_order = <arch specific optimal pte mapping order>
> +	page_cache_ra_order(..., new_order);

That's pretty much what my first attempt at upstreaming does. It's not quite
that straightforward though, because we also have to modify the readahead sync
and async sizes to read an exact multiple of 64K. Otherwise
page_cache_ra_order() will reduce the order of the folio(s) to fit the requested
data size. The "new_order" is only a target starting point.

My code follows the same pattern already used for MADV_HUGEPAGE mappings in
do_sync_mmap_readahead():

	/*
	 * Allow arch to request a preferred minimum folio order for executable
	 * memory. This can often be beneficial to performance if (e.g.) arm64
	 * can contpte-map the folio. Executable memory rarely benefits from
	 * read-ahead anyway, due to its random access nature.
	 */
	if (vm_flags & VM_EXEC) {
		int order = arch_wants_exec_folio_order();

		if (order >= 0) {
			fpin = maybe_unlock_mmap_for_io(vmf, fpin);
			ra->size = 1UL << order;
			ra->async_size = 0;
			ractl._index &= ~((unsigned long)ra->size - 1);
			page_cache_ra_order(&ractl, ra, order);
			return fpin;
		}
	}

On arm64, this would do a sync 64K read into a 64K folio most of the time.

> 
> And now the page cache will be populated with large folios of at
> least the order requested if filesystem can support folios of that
> size.
> 
> Unless I've misunderstood something (cold cache instantiation of
> 64kB folios is what you desired, isn't it?), that small change
> should largely make exec mappings behave the way you want...

So sounds like you support this proposed approach?

> 
>>> There's also an obvious filesystem level trigger for enabling this
>>> behaviour in a generic manner.  e.g. The filesystem can look at the
>>> X perm bits on the inode at instantiation time and if they are set,
>>> set a "desired order" value+flag on the mapping at inode cache
>>> instantiation in addition to "min order".
>>>
>>
>> Not sure what proportion of an executable file is the text section. If it's
>> less than 30% or 50%, it seems we might be allocating "preferred size"
>> large folios to many other sections that may not benefit from them?
>>
>> Also, a Bash shell script with executable permissions might get a
>> preferred large folio size. This seems weird?
> 
> But none of this is actually a problem at all.  Fewer, larger folios
> still means less page cache and memory reclaim management overhead
> even if there is no direct benefit from optimised page table
> mapping.
> 
> Also, we typically know the file size at mapping tree instantiation
> time and hence we could make a sane decision as to whether large
> folios should be used for any specific executable file.
> 
>> By the way, are .so files executable files, even though they may contain
>> a lot of code? As I check my filesystems, it seems not:
>>
>> /usr/lib/aarch64-linux-gnu # ls -l libz.so.1.2.13
>> -rw-r--r-- 1 root root 133280 Jan 11  2023 libz.so.1.2.13
> 
> True, I hadn't considered that.
> 
> Seems like fixing do_sync_mmap_readahead() might be the best way to
> go then....

OK sounds like we might be converging :)

Thanks,
Ryan

> 
>>> If a desired order is configured, the page cache read code can then
>>> pass a FGP_TRY_ORDER flag with the fgp_order set to the desired
>>> value to folio allocation. If that can't be allocated then it can
>>> fall back to single page folios instead of failing.
>>>
>>> At this point, we will always optimistically try to allocate larger
>>> folios for executables on all architectures. Architectures that
>>> can optimise sequential PTE mappings can then simply add generic
>>> support for large folio optimisation, and more efficient executable
>>> mappings simply fall out of the generic support for efficient
>>> mapping of large folios and filesystems preferring large folios for
>>> executable inode mappings....
>>
>> I feel this falls more within the scope of architecture and memory
>> management rather than the filesystem. If possible, we should try
>> to avoid modifying the filesystem code?
> 
> Large folios may be a MM construct, but you can't use them
> in the page cache without the backing filesystem being fully aware
> of them and the mm subsystem has to work within the constraints the
> filesystem places on large folios in the page cache.
> 
> If we need to change constraints or enact new policies around
> file IO specific large folio optimisations, then we definitely are
> going to need to modify both mm and filesystem code to implement
> them....
> 
> -Dave.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Mapping text with large folios
  2025-03-19 20:47 ` Barry Song
@ 2025-03-20 14:57   ` Ryan Roberts
  2025-03-30  4:46     ` Barry Song
  0 siblings, 1 reply; 13+ messages in thread
From: Ryan Roberts @ 2025-03-20 14:57 UTC (permalink / raw)
  To: Barry Song; +Cc: lsf-pc, Linux-MM, Matthew Wilcox, Dave Chinner

On 19/03/2025 20:47, Barry Song wrote:
> On Thu, Mar 20, 2025 at 4:38 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi All,
>>
>> I know this is very last minute, but I was hoping that it might be possible to
>> squeeze in a session to discuss the following?
>>
>> Summary/Background:
>>
>> On arm64, physically contiguous and naturally aligned regions can take advantage
>> of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file
>> regions containing text, current readahead behaviour often yields small,
>> misaligned folios, preventing this optimization. This proposal introduces a
>> special-case path for executable mappings, performing synchronous reads of an
>> architecture-chosen size into large folios (64 KB on arm64). Early performance
>> tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9%
>> gains.
>>
>> I’ve previously posted attempts to enable this performance improvement ([1],
>> [2]), but there were objections and conversation fizzled out. Now that I have
>> more compelling performance data, I’m hoping there is now stronger
>> justification, and we can find a path forwards.
>>
>> What I’d Like to Cover:
>>
>>  - Describe how text memory should ideally be mapped and why it benefits
>>    performance.
>>
>>  - Brief review of performance data.
>>
>>  - Discuss options for the best way to encourage text into large folios:
>>      - Let the architecture request a preferred size
>>      - Extend VMA attributes to include preferred THP size hint
> 
> We might need this for a couple of other cases.
> 
> 1. The native heap—for example, a native heap like jemalloc—can configure
> the base "granularity" and then use MADV_DONTNEED/FREE at that granularity
> to manage memory. Currently, the default granularity is PAGE_SIZE, which can
> lead to excessive folio splitting. For instance, if we set jemalloc's
> granularity to
> 16KB while sysfs supports 16KB, 32KB, 64KB, etc., splitting can still occur.
> Therefore, in some cases, I believe the kernel should be aware of how
> userspace is managing memory.
> 
> 2. Java heap GC compaction -  userfaultfd_move() things.
> I am considering adding support for batched PTE/folios moves in
> userfaultfd_move().
> If sysfs enables 16KB, 32KB, 64KB, 128KB, etc., but the userspace Java
> heap moves
> memory at a 16KB granularity, it could lead to excessive folio splitting.

Would these heaps ever use a 64K granule or is that too big? If they can use
64K, then one simple solution would be to only enable mTHP sizes upto 64K (which
is the magic size for arm64).

Alternatively they could use MADV_NOHUGEPAGE today and be guarranteed that
memory would remain mapped as small folios.

But I see the potential problem if you want to benefit from HPA with 16K granule
there but still enable 64K globally. We have briefly discussed the idea of
supporting MADV_HUGEPAGE via madvise_process() in the past; that has an extra
param that could encode the size hint(s).

> 
> For exec, it seems we need a userspace-transparent approach. Asking each
> application to modify its code to madvise the kernel on its preferred exec folio
> size seems cumbersome.

I would much prefer a transparent approach. If we did take the approach of using
a per-VMA size hint, I was thinking that could be handled by the dynamic linker.
Then it's only one place to update.

> 
> I mean, we could whitelist all execs by default unless an application explicitly
> requests to disable it?

I guess the explicit disable would be MADV_NOHUGEPAGE. But I don't believe the
pagecache honours this right now; presumably because the memory is shared. What
would you do if one process disabled and another didn't?

Thanks,
Ryan

> 
>>      - Provide a sysfs knob
>>      - Plug into the “mapping min folio order” infrastructure
>>      - Other approaches?
>>
>> [1] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@arm.com/
>> [2] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@arm.com/
>>
>> Thanks,
>> Ryan
> 
> Thanks
> Barry



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Mapping text with large folios
  2025-03-20 14:57   ` Ryan Roberts
@ 2025-03-30  4:46     ` Barry Song
  2025-04-01 11:09       ` Ryan Roberts
  0 siblings, 1 reply; 13+ messages in thread
From: Barry Song @ 2025-03-30  4:46 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: lsf-pc, Linux-MM, Matthew Wilcox, Dave Chinner

On Thu, Mar 20, 2025 at 10:57 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 19/03/2025 20:47, Barry Song wrote:
> > On Thu, Mar 20, 2025 at 4:38 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Hi All,
> >>
> >> I know this is very last minute, but I was hoping that it might be possible to
> >> squeeze in a session to discuss the following?
> >>
> >> Summary/Background:
> >>
> >> On arm64, physically contiguous and naturally aligned regions can take advantage
> >> of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file
> >> regions containing text, current readahead behaviour often yields small,
> >> misaligned folios, preventing this optimization. This proposal introduces a
> >> special-case path for executable mappings, performing synchronous reads of an
> >> architecture-chosen size into large folios (64 KB on arm64). Early performance
> >> tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9%
> >> gains.
> >>
> >> I’ve previously posted attempts to enable this performance improvement ([1],
> >> [2]), but there were objections and conversation fizzled out. Now that I have
> >> more compelling performance data, I’m hoping there is now stronger
> >> justification, and we can find a path forwards.
> >>
> >> What I’d Like to Cover:
> >>
> >>  - Describe how text memory should ideally be mapped and why it benefits
> >>    performance.
> >>
> >>  - Brief review of performance data.
> >>
> >>  - Discuss options for the best way to encourage text into large folios:
> >>      - Let the architecture request a preferred size
> >>      - Extend VMA attributes to include preferred THP size hint
> >
> > We might need this for a couple of other cases.
> >
> > 1. The native heap—for example, a native heap like jemalloc—can configure
> > the base "granularity" and then use MADV_DONTNEED/FREE at that granularity
> > to manage memory. Currently, the default granularity is PAGE_SIZE, which can
> > lead to excessive folio splitting. For instance, if we set jemalloc's
> > granularity to
> > 16KB while sysfs supports 16KB, 32KB, 64KB, etc., splitting can still occur.
> > Therefore, in some cases, I believe the kernel should be aware of how
> > userspace is managing memory.
> >
> > 2. Java heap GC compaction -  userfaultfd_move() things.
> > I am considering adding support for batched PTE/folios moves in
> > userfaultfd_move().
> > If sysfs enables 16KB, 32KB, 64KB, 128KB, etc., but the userspace Java
> > heap moves
> > memory at a 16KB granularity, it could lead to excessive folio splitting.
>
> Would these heaps ever use a 64K granule or is that too big? If they can use
> 64K, then one simple solution would be to only enable mTHP sizes upto 64K (which
> is the magic size for arm64).
>

I'm uncertain about how Lokesh plans to implement userfaultfd_move()
mTHP support
in what granularity he'll use in Java heap GC. However, regarding
jemalloc, I've found
that 64KB is actually too large - it ends up increasing memory usage.
The issue is that
we need at least 64KB of freed small objects before we can effectively use
MADV_DONTNEED. Perhaps we could try 16KB instead.

The key requirement is that the kernel's maximum large folio size cannot exceed
the memory management granularity used by userspace heap implementations.
Before implementing madvise-based per-VMA large folios for Java heap, I plan
to first propose a large-folio aware userfaultfd_move() and discuss
this approach
with Lokesh.

> Alternatively they could use MADV_NOHUGEPAGE today and be guarranteed that
> memory would remain mapped as small folios.

Right, I'm using this MADV_NOHUGEPAGE specifically for small size classes in
jemalloc now as large folios will soon be splitted due to unaligned
userspace heap
management.

>
> But I see the potential problem if you want to benefit from HPA with 16K granule
> there but still enable 64K globally. We have briefly discussed the idea of
> supporting MADV_HUGEPAGE via madvise_process() in the past; that has an extra
> param that could encode the size hint(s).
>

I'm not sure what granularity Lokesh plans to support for moving large folios in
Java GC. But first, we need kernel support for userfaultfd_move() with mTHP.
Maybe this could serve as a use case to justify the size hint in
MADV_HUGEPAGE.

> >
> > For exec, it seems we need a userspace-transparent approach. Asking each
> > application to modify its code to madvise the kernel on its preferred exec folio
> > size seems cumbersome.
>
> I would much prefer a transparent approach. If we did take the approach of using
> a per-VMA size hint, I was thinking that could be handled by the dynamic linker.
> Then it's only one place to update.

The dynamic linker (ld.so) primarily manages the runtime linking of
shared libraries for
executables. However, the initial memory mapping of the executable
itself (the binary
file, e.g., a.out) is performed by the kernel during program execution?

>
> >
> > I mean, we could whitelist all execs by default unless an application explicitly
> > requests to disable it?
>
> I guess the explicit disable would be MADV_NOHUGEPAGE. But I don't believe the
> pagecache honours this right now; presumably because the memory is shared. What
> would you do if one process disabled and another didn't?

Correct. My previous concern is that if memory-constrained devices
could experience
increased memory pressure due to mandatory 64KB read operations. A particular
concern is that the 64KiB folio remains in LRU queue when any single subpage
is active, whereas smaller folios would have been reclaimable when inactive.

However, this appears unrelated to your patch [1]. Perhaps such systems should
disable file large folios entirely?

[1] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@arm.com/

>
> Thanks,
> Ryan
>
> >
> >>      - Provide a sysfs knob
> >>      - Plug into the “mapping min folio order” infrastructure
> >>      - Other approaches?
> >>
> >> [1] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@arm.com/
> >> [2] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@arm.com/
> >>
> >> Thanks,
> >> Ryan
> >

Thanks
Barry


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Mapping text with large folios
  2025-03-19 15:38 [LSF/MM/BPF TOPIC] Mapping text with large folios Ryan Roberts
  2025-03-19 18:16 ` Yang Shi
  2025-03-19 20:47 ` Barry Song
@ 2025-04-01 10:53 ` Ryan Roberts
  2 siblings, 0 replies; 13+ messages in thread
From: Ryan Roberts @ 2025-04-01 10:53 UTC (permalink / raw)
  To: lsf-pc, Linux-MM; +Cc: Matthew Wilcox, Dave Chinner, Barry Song

[-- Attachment #1: Type: text/plain, Size: 1949 bytes --]

On 19/03/2025 11:38, Ryan Roberts wrote:
> Hi All,
> 
> I know this is very last minute, but I was hoping that it might be possible to
> squeeze in a session to discuss the following?
> 
> Summary/Background:
> 
> On arm64, physically contiguous and naturally aligned regions can take advantage
> of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file
> regions containing text, current readahead behaviour often yields small,
> misaligned folios, preventing this optimization. This proposal introduces a
> special-case path for executable mappings, performing synchronous reads of an
> architecture-chosen size into large folios (64 KB on arm64). Early performance
> tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9%
> gains.
> 
> I’ve previously posted attempts to enable this performance improvement ([1],
> [2]), but there were objections and conversation fizzled out. Now that I have
> more compelling performance data, I’m hoping there is now stronger
> justification, and we can find a path forwards.
> 
> What I’d Like to Cover:
> 
>  - Describe how text memory should ideally be mapped and why it benefits
>    performance.
> 
>  - Brief review of performance data.
> 
>  - Discuss options for the best way to encourage text into large folios:
>      - Let the architecture request a preferred size
>      - Extend VMA attributes to include preferred THP size hint
>      - Provide a sysfs knob
>      - Plug into the “mapping min folio order” infrastructure
>      - Other approaches?

Slides from session attached. Includes fix to diagram on slide 3; Matthew was
correct that we don't align to the exact sync/async boundary, but extend the
async region down to the previous folio boundary.

> 
> [1] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@arm.com/
> 
> Thanks,
> Ryan

[-- Attachment #2: LSFMM Mapping Text with Large Folios March 2025.pdf --]
[-- Type: application/pdf, Size: 422964 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Mapping text with large folios
  2025-03-30  4:46     ` Barry Song
@ 2025-04-01 11:09       ` Ryan Roberts
  0 siblings, 0 replies; 13+ messages in thread
From: Ryan Roberts @ 2025-04-01 11:09 UTC (permalink / raw)
  To: Barry Song; +Cc: lsf-pc, Linux-MM, Matthew Wilcox, Dave Chinner

On 30/03/2025 00:46, Barry Song wrote:
> On Thu, Mar 20, 2025 at 10:57 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 19/03/2025 20:47, Barry Song wrote:
>>> On Thu, Mar 20, 2025 at 4:38 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> I know this is very last minute, but I was hoping that it might be possible to
>>>> squeeze in a session to discuss the following?
>>>>
>>>> Summary/Background:
>>>>
>>>> On arm64, physically contiguous and naturally aligned regions can take advantage
>>>> of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file
>>>> regions containing text, current readahead behaviour often yields small,
>>>> misaligned folios, preventing this optimization. This proposal introduces a
>>>> special-case path for executable mappings, performing synchronous reads of an
>>>> architecture-chosen size into large folios (64 KB on arm64). Early performance
>>>> tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9%
>>>> gains.
>>>>
>>>> I’ve previously posted attempts to enable this performance improvement ([1],
>>>> [2]), but there were objections and conversation fizzled out. Now that I have
>>>> more compelling performance data, I’m hoping there is now stronger
>>>> justification, and we can find a path forwards.
>>>>
>>>> What I’d Like to Cover:
>>>>
>>>>  - Describe how text memory should ideally be mapped and why it benefits
>>>>    performance.
>>>>
>>>>  - Brief review of performance data.
>>>>
>>>>  - Discuss options for the best way to encourage text into large folios:
>>>>      - Let the architecture request a preferred size
>>>>      - Extend VMA attributes to include preferred THP size hint
>>>
>>> We might need this for a couple of other cases.
>>>
>>> 1. The native heap—for example, a native heap like jemalloc—can configure
>>> the base "granularity" and then use MADV_DONTNEED/FREE at that granularity
>>> to manage memory. Currently, the default granularity is PAGE_SIZE, which can
>>> lead to excessive folio splitting. For instance, if we set jemalloc's
>>> granularity to
>>> 16KB while sysfs supports 16KB, 32KB, 64KB, etc., splitting can still occur.
>>> Therefore, in some cases, I believe the kernel should be aware of how
>>> userspace is managing memory.
>>>
>>> 2. Java heap GC compaction -  userfaultfd_move() things.
>>> I am considering adding support for batched PTE/folios moves in
>>> userfaultfd_move().
>>> If sysfs enables 16KB, 32KB, 64KB, 128KB, etc., but the userspace Java
>>> heap moves
>>> memory at a 16KB granularity, it could lead to excessive folio splitting.
>>
>> Would these heaps ever use a 64K granule or is that too big? If they can use
>> 64K, then one simple solution would be to only enable mTHP sizes upto 64K (which
>> is the magic size for arm64).
>>
> 
> I'm uncertain about how Lokesh plans to implement userfaultfd_move()
> mTHP support
> in what granularity he'll use in Java heap GC. However, regarding
> jemalloc, I've found
> that 64KB is actually too large - it ends up increasing memory usage.
> The issue is that
> we need at least 64KB of freed small objects before we can effectively use
> MADV_DONTNEED. Perhaps we could try 16KB instead.
> 
> The key requirement is that the kernel's maximum large folio size cannot exceed
> the memory management granularity used by userspace heap implementations.
> Before implementing madvise-based per-VMA large folios for Java heap, I plan
> to first propose a large-folio aware userfaultfd_move() and discuss
> this approach
> with Lokesh.

We very briefly discussed this at LSF/MM; Rik Van Riel suggested trying to
maintain a per-VMA heuristic capturing the granule that user space is using then
applying that to the mTHP policy.

> 
>> Alternatively they could use MADV_NOHUGEPAGE today and be guarranteed that
>> memory would remain mapped as small folios.
> 
> Right, I'm using this MADV_NOHUGEPAGE specifically for small size classes in
> jemalloc now as large folios will soon be splitted due to unaligned
> userspace heap
> management.
> 
>>
>> But I see the potential problem if you want to benefit from HPA with 16K granule
>> there but still enable 64K globally. We have briefly discussed the idea of
>> supporting MADV_HUGEPAGE via madvise_process() in the past; that has an extra
>> param that could encode the size hint(s).
>>
> 
> I'm not sure what granularity Lokesh plans to support for moving large folios in
> Java GC. But first, we need kernel support for userfaultfd_move() with mTHP.
> Maybe this could serve as a use case to justify the size hint in
> MADV_HUGEPAGE.
> 
>>>
>>> For exec, it seems we need a userspace-transparent approach. Asking each
>>> application to modify its code to madvise the kernel on its preferred exec folio
>>> size seems cumbersome.
>>
>> I would much prefer a transparent approach. If we did take the approach of using
>> a per-VMA size hint, I was thinking that could be handled by the dynamic linker.
>> Then it's only one place to update.
> 
> The dynamic linker (ld.so) primarily manages the runtime linking of
> shared libraries for
> executables. However, the initial memory mapping of the executable
> itself (the binary
> file, e.g., a.out) is performed by the kernel during program execution?

Yes, but for dynamically linked executables, the kernel also maps the
interpretter (the linker) and that's the first thing that executes in user
space, so it still has the opportunity to madvise() the main executable mappings.

> 
>>
>>>
>>> I mean, we could whitelist all execs by default unless an application explicitly
>>> requests to disable it?
>>
>> I guess the explicit disable would be MADV_NOHUGEPAGE. But I don't believe the
>> pagecache honours this right now; presumably because the memory is shared. What
>> would you do if one process disabled and another didn't?
> 
> Correct. My previous concern is that if memory-constrained devices
> could experience
> increased memory pressure due to mandatory 64KB read operations. 

Perhaps there is a case for continuing to honor ra->ra_pages (or align down to
64K boundary) so that a value of 0 continues to elide readahead?

> A particular
> concern is that the 64KiB folio remains in LRU queue when any single subpage
> is active, whereas smaller folios would have been reclaimable when inactive.

We are discussing this in the context of the new post.

> 
> However, this appears unrelated to your patch [1]. Perhaps such systems should
> disable file large folios entirely?

There is currently no way to disable large folios for page cache, other than to
use file systems that don't support large folios yet :). But I agree that this
is unrelated; if there is deemed to be a problem with large folios and you need
a general switch, that's going to be the case irrespective of my change.

Thanks,
Ryan


> 
> [1] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@arm.com/
> 
>>
>> Thanks,
>> Ryan
>>
>>>
>>>>      - Provide a sysfs knob
>>>>      - Plug into the “mapping min folio order” infrastructure
>>>>      - Other approaches?
>>>>
>>>> [1] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@arm.com/
>>>> [2] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@arm.com/
>>>>
>>>> Thanks,
>>>> Ryan
>>>
> 
> Thanks
> Barry



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-04-01 11:09 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-19 15:38 [LSF/MM/BPF TOPIC] Mapping text with large folios Ryan Roberts
2025-03-19 18:16 ` Yang Shi
2025-03-19 20:38   ` Dave Chinner
2025-03-19 22:13     ` Barry Song
2025-03-20  0:53       ` Dave Chinner
2025-03-20 14:47         ` Ryan Roberts
2025-03-20 12:16       ` Ryan Roberts
2025-03-20 12:13     ` Ryan Roberts
2025-03-19 20:47 ` Barry Song
2025-03-20 14:57   ` Ryan Roberts
2025-03-30  4:46     ` Barry Song
2025-04-01 11:09       ` Ryan Roberts
2025-04-01 10:53 ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox