linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Ryan Roberts <ryan.roberts@arm.com>
To: Barry Song <21cnbao@gmail.com>, Dave Chinner <david@fromorbit.com>
Cc: Yang Shi <shy828301@gmail.com>,
	lsf-pc@lists.linux-foundation.org, Linux-MM <linux-mm@kvack.org>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [LSF/MM/BPF TOPIC] Mapping text with large folios
Date: Thu, 20 Mar 2025 12:16:04 +0000	[thread overview]
Message-ID: <fa8b5351-bf77-4eef-90a8-e9a5a8306be2@arm.com> (raw)
In-Reply-To: <CAGsJ_4zFavrNUiO4YPmFMqfTfkJsGM20RGObzUT20oATAVZdQw@mail.gmail.com>

Appologies, I just sent a response to Dave that raises most of the same points
that Barry raises here. I'll read the full thread before replying further :)


On 19/03/2025 22:13, Barry Song wrote:
> On Thu, Mar 20, 2025 at 9:38 AM Dave Chinner <david@fromorbit.com> wrote:
>>
>> On Wed, Mar 19, 2025 at 11:16:16AM -0700, Yang Shi wrote:
>>> On Wed, Mar 19, 2025 at 8:39 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> I know this is very last minute, but I was hoping that it might be possible to
>>>> squeeze in a session to discuss the following?
>>
>> I'm not going to be at LSFMM, so I'd prefer this sort of thing get
>> discussed on the dev lists...
>>
>>>> Summary/Background:
>>>>
>>>> On arm64, physically contiguous and naturally aligned regions can take advantage
>>>> of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file
>>>> regions containing text, current readahead behaviour often yields small,
>>>> misaligned folios, preventing this optimization. This proposal introduces a
>>>> special-case path for executable mappings, performing synchronous reads of an
>>>> architecture-chosen size into large folios (64 KB on arm64). Early performance
>>>> tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9%
>>>> gains.
>>>
>>> AFAIK, MySQL is quite sensitive to iTLB pressure. It should be worth
>>> adding to the tests.
>>>
>>>>
>>>> I’ve previously posted attempts to enable this performance improvement ([1],
>>>> [2]), but there were objections and conversation fizzled out. Now that I have
>>>> more compelling performance data, I’m hoping there is now stronger
>>>> justification, and we can find a path forwards.
>>>>
>>>> What I’d Like to Cover:
>>>>
>>>>  - Describe how text memory should ideally be mapped and why it benefits
>>>>    performance.
>>
>> I think the main people involved already understand this...
>>
>>>>  - Brief review of performance data.
>>
>> You don't need to convince me - there's 3 decades of evidence
>> proving that larger, fewer page table mappings for executables
>> results in better performance.
>>
>>>>  - Discuss options for the best way to encourage text into large folios:
>>>>      - Let the architecture request a preferred size
>>>>      - Extend VMA attributes to include preferred THP size hint
>>>>      - Provide a sysfs knob
>>>>      - Plug into the “mapping min folio order” infrastructure
>>>>      - Other approaches?
>>
>> Implement generic large folio/sequential PTE mapping optimisations
>> for each platform, then control it by letting the filesystem decide
>> what the desired mapping order and alignment should be for any given
>> inode mapping tree.
>>
>>> Did you try LBS? You can have 64K block size with LBS, it should
>>> create large folios for page cache so text should get large folios
>>> automatically (IIRC arm64 linker script has 64K alignment by default).
>>
>> We really don't want people using 64kB block size filesystems for
>> root filesystems - there are plenty of downsides to using huge block
>> sizes for filesytems that generally hold many tiny files.
> 
> Agreed. Large folios will be compatible with existing file systems and
> applications, which don’t always require userspace to adopt them.
> 
>>
>> However, I agree with the general principle that the fs should be
>> directing the inode mapping tree folio order behaviour.  i.e. the
>> filesystem already sets both the floor and the desired behaviour for
>> folio instantiation for any given inode mapping tree.
>>
>> It also needs to be able to instantiate large folios -before- the
>> executable is mapped into VMAs via mmap() because files can be read
>> into cache before they are run (e.g. boot time readahead hacks).
>> i.e. a mmap() time directive is too late to apply to the inode
>> mapping tree to guarantee optimal layout for PTE optimisation. It
>> also may not be possible to apply mmap() time directives due to
>> other filesystem constraints, so mmap() time directives may well end
>> up being unpredictable and unreliable....
>>
> 
> ELF loading and the linker may lead to readaheading a small portion
> of the code text before mmap(). However, once the executable files
> are large, the minor loss of large folios due to limited read-ahead of
> the text may not be substantial enough to justify consideration.
> 
> But "boot time readahead hacks" seem like something that can read
> ahead significantly. Unless we can modify these "boot time readahead
> hacks" to use mmap() with EXEC mapping, it seems we would need
> something at the sys_read() to apply the preferred size.
> 
>> There's also an obvious filesystem level trigger for enabling this
>> behaviour in a generic manner.  e.g. The filesystem can look at the
>> X perm bits on the inode at instantiation time and if they are set,
>> set a "desired order" value+flag on the mapping at inode cache
>> instantiation in addition to "min order".
>>
> 
> Not sure what proportion of an executable file is the text section. If it's
> less than 30% or 50%, it seems we might be allocating "preferred size"
> large folios to many other sections that may not benefit from them?
> 
> Also, a Bash shell script with executable permissions might get a
> preferred large folio size. This seems weird?
> 
> By the way, are .so files executable files, even though they may contain
> a lot of code? As I check my filesystems, it seems not:
> 
> /usr/lib/aarch64-linux-gnu # ls -l libz.so.1.2.13
> -rw-r--r-- 1 root root 133280 Jan 11  2023 libz.so.1.2.13
> 
> 
>> If a desired order is configured, the page cache read code can then
>> pass a FGP_TRY_ORDER flag with the fgp_order set to the desired
>> value to folio allocation. If that can't be allocated then it can
>> fall back to single page folios instead of failing.
>>
>> At this point, we will always optimistically try to allocate larger
>> folios for executables on all architectures. Architectures that
>> can optimise sequential PTE mappings can then simply add generic
>> support for large folio optimisation, and more efficient executable
>> mappings simply fall out of the generic support for efficient
>> mapping of large folios and filesystems preferring large folios for
>> executable inode mappings....
> 
> I feel this falls more within the scope of architecture and memory
> management rather than the filesystem. If possible, we should try
> to avoid modifying the filesystem code?
> 
>>
>> -Dave.
>> --
>> Dave Chinner
>> david@fromorbit.com
> 
> Thanks
> Barry



  parent reply	other threads:[~2025-03-20 12:16 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-19 15:38 Ryan Roberts
2025-03-19 18:16 ` Yang Shi
2025-03-19 20:38   ` Dave Chinner
2025-03-19 22:13     ` Barry Song
2025-03-20  0:53       ` Dave Chinner
2025-03-20 14:47         ` Ryan Roberts
2025-03-20 12:16       ` Ryan Roberts [this message]
2025-03-20 12:13     ` Ryan Roberts
2025-03-19 20:47 ` Barry Song
2025-03-20 14:57   ` Ryan Roberts
2025-03-30  4:46     ` Barry Song
2025-04-01 11:09       ` Ryan Roberts
2025-04-01 10:53 ` Ryan Roberts

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fa8b5351-bf77-4eef-90a8-e9a5a8306be2@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=21cnbao@gmail.com \
    --cc=david@fromorbit.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=shy828301@gmail.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox