* idea to ponder: handling extra pages on faults of anonymous areas
@ 2025-04-15 16:58 Mateusz Guzik
2025-04-16 7:34 ` Hao Li
0 siblings, 1 reply; 2+ messages in thread
From: Mateusz Guzik @ 2025-04-15 16:58 UTC (permalink / raw)
To: x86, linux-mm
If you have an area not backed by a huge page and fault on it, you
only get the 4K page zeroed even if the given mmapped area is bigger.
I have a promising result from zeroing more pages than that, but don't
have time to evaluate more workloads or code up a proper patch.
Hopefully someone(tm) will be interested enough to pick it up.
Rationale:
4K pages predate the fall of the Soviet Union and ram sizes went up
orders of magnitude since then even on what's considered low end
systems today. Similarly, memory usage of programs went up
significantly. It is not a stretch to suspect a bigger size would
serve real workloads better. 2MB pages of course are applicable in
some capacity, but my testing shows there is still tons of faults on
areas where they are not used.
In particular when running everyone's favourite workload of compiling
stuff, kernel time is quite big (e.g., > 15%), where a large chunk is
spent handling page faults.
While the hardware does not provide good granularity (the immediate
4KB -> 2MB jump) and will still need to use 4KB pages, fault handling
can go down by speculatively sorting out more than just the page which
got faulted on.
I suspect rolling with 8KB would provide a good enough improvement
while suffering negligible waste in practice.
While testing 8KB would require patching the kernel, I was pointed at
knobs in /sys/kernel/mm/transparent_hugepage which facilitate early
experiments. The smallest available size 16K, so that's what I used
below for benchmarking.
I conducted a simple experiment building will-it-scale like so:
taskset --cpu-list 1 hyperfine "gmake -s -j 1 clean all"
stock:
Time (mean ± σ): 20.707 s ± 0.080 s [User: 17.222 s, System: 3.376 s]
16K pages:
Time (mean ± σ): 19.471 s ± 0.046 s [User: 16.836 s, System: 2.608 s]
Or to put it differently a reliable 5% reduction in real time. Page
fault count dropped to less than half, which suggests majority of the
improvement would show up with mere 8K instead of 16.
the 16K thing was tested with:
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
I stress the proposal is not necessarily to use mTHPs here (or
whatever the name), the above was merely employed because it was
readily available. I'm told the use of these might prevent other
optimization by the kernel -- these are artifacts of the
implementation and are not inherent to the idea.
The proposal is to fill in more than one page on faults on anonymous
areas, regardless of how it is specifically handled. I speculate
handling two pages (aka 8KB of size) will be an overall win and should
not be affecting anything else (huge page promotions, whatever TLB
fuckery and what have you). Worst case you got a page you are not
going to use.
I think a good quality proposal is quite time consuming to produce and
I don't have the cycles. I also can't guarantee the mm overlords will
accept something like that. I can however point out that google
experimented with 16KB pages for arm64 and got very promising results
(i have no idea if they switched to use them) -- I would start with
prodding those folk.
cheers
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: idea to ponder: handling extra pages on faults of anonymous areas
2025-04-15 16:58 idea to ponder: handling extra pages on faults of anonymous areas Mateusz Guzik
@ 2025-04-16 7:34 ` Hao Li
0 siblings, 0 replies; 2+ messages in thread
From: Hao Li @ 2025-04-16 7:34 UTC (permalink / raw)
To: Mateusz Guzik; +Cc: x86, linux-mm
On Tue, Apr 15, 2025 at 06:58:43PM +0200, Mateusz Guzik wrote:
> If you have an area not backed by a huge page and fault on it, you
> only get the 4K page zeroed even if the given mmapped area is bigger.
> I have a promising result from zeroing more pages than that, but don't
> have time to evaluate more workloads or code up a proper patch.
> Hopefully someone(tm) will be interested enough to pick it up.
>
> Rationale:
> 4K pages predate the fall of the Soviet Union and ram sizes went up
> orders of magnitude since then even on what's considered low end
> systems today. Similarly, memory usage of programs went up
> significantly. It is not a stretch to suspect a bigger size would
> serve real workloads better. 2MB pages of course are applicable in
> some capacity, but my testing shows there is still tons of faults on
> areas where they are not used.
>
> In particular when running everyone's favourite workload of compiling
> stuff, kernel time is quite big (e.g., > 15%), where a large chunk is
> spent handling page faults.
>
> While the hardware does not provide good granularity (the immediate
> 4KB -> 2MB jump) and will still need to use 4KB pages, fault handling
> can go down by speculatively sorting out more than just the page which
> got faulted on.
>
> I suspect rolling with 8KB would provide a good enough improvement
> while suffering negligible waste in practice.
>
> While testing 8KB would require patching the kernel, I was pointed at
> knobs in /sys/kernel/mm/transparent_hugepage which facilitate early
> experiments. The smallest available size 16K, so that's what I used
> below for benchmarking.
>
> I conducted a simple experiment building will-it-scale like so:
> taskset --cpu-list 1 hyperfine "gmake -s -j 1 clean all"
>
> stock:
> Time (mean ± σ): 20.707 s ± 0.080 s [User: 17.222 s, System: 3.376 s]
> 16K pages:
> Time (mean ± σ): 19.471 s ± 0.046 s [User: 16.836 s, System: 2.608 s]
>
> Or to put it differently a reliable 5% reduction in real time. Page
> fault count dropped to less than half, which suggests majority of the
> improvement would show up with mere 8K instead of 16.
>
> the 16K thing was tested with:
> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
>
> I stress the proposal is not necessarily to use mTHPs here (or
> whatever the name), the above was merely employed because it was
> readily available. I'm told the use of these might prevent other
> optimization by the kernel -- these are artifacts of the
> implementation and are not inherent to the idea.
>
> The proposal is to fill in more than one page on faults on anonymous
> areas, regardless of how it is specifically handled. I speculate
> handling two pages (aka 8KB of size) will be an overall win and should
> not be affecting anything else (huge page promotions, whatever TLB
> fuckery and what have you). Worst case you got a page you are not
> going to use.
Hi,
This sounds like the anonymous memory version of file readahead. For
example, when userspace mmaps a range of anonymous memory, the kernel
could speculate that the application will trigger do_anonymous_page
faults sequentially. So instead of handling one page fault per PTE, it
might be better to preemptively populate several PTEs in one go.
>
> I think a good quality proposal is quite time consuming to produce and
> I don't have the cycles. I also can't guarantee the mm overlords will
> accept something like that. I can however point out that google
> experimented with 16KB pages for arm64 and got very promising results
> (i have no idea if they switched to use them) -- I would start with
> prodding those folk.
>
> cheers
> --
> Mateusz Guzik <mjguzik gmail.com>
>
>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-04-16 7:35 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-15 16:58 idea to ponder: handling extra pages on faults of anonymous areas Mateusz Guzik
2025-04-16 7:34 ` Hao Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox