Project: Improving the PCP allocator

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Project: Improving the PCP allocator
@ 2024-01-22 16:01 Matthew Wilcox
  2024-01-24 19:18 ` Christoph Lameter (Ampere)
  2025-03-12  8:53 ` Sharing the (failed) experimental results of the idea "Keeping compound pages as compound pages in the PCP" Harry Yoo
  0 siblings, 2 replies; 4+ messages in thread
From: Matthew Wilcox @ 2024-01-22 16:01 UTC (permalink / raw)
  To: linux-mm

As I mentioned here [1], I have Thoughts on how the PCP allocator
works in a memdesc world.  Unlike my earlier Thoughts on the buddy
allocator [2], we can actually make progress towards this one (and
see substantial performance improvement, I believe).  So it's ripe for
someone to pick up.

== With memdescs ==

When we have memdescs, allocating a folio from the buddy is a two step
process.  First we allocate the struct folio from slab, then we ask the
buddy allocator for 2^n pages, each of which gets its memdesc set to
point to this folio.  It'll be similar for other memory descriptors,
but let's keep it simple and just talk about folios for now.

Usually when we free folios, it's due to memory pressure (yes, we'll free
memory due to truncating a file or processes exiting and freeing their
anonymous memory, but that's secondary).  That means we're likely to want
to allocate a folio again soon.  Given that, returning the struct folio
to the slab allocator seems like a waste of time.  The PCP allocator
can hold onto the struct folio as well as the underlying memory and
then just hand it back to the next caller of folio_alloc.  This also
saves us from having to invent a 'struct pcpdesc' and swap the memdesc
pointer from the folio to the pcpdesc.

This implies that we no longer have a single pcp allocator for all types
of memory; rather we have one for each memdesc type.  I think that's
going to be OK, but it might introduce some problems.

== Before memdescs ==

Today we take all comers on the PCP list.  __free_pages() calls
free_the_page() calls free_unref_page() calls free_unref_page_prepare()
calls free_pages_prepare() which undoes all the PageCompound work.

Most multi-page allocations are compound.  Slab, file, anon; it's all
compound.  I propose that we _only_ keep compound memory on the PCP list.
Freeing non-compound multi-page memory can either convert it into compound
pages before being placed on the PCP list or just hand the memory back
to the buddy allocator.  Non-compound multi-page allocations can either
go straight to buddy or grab from the PCP list and undo the compound
nature of the pages.

I think this could be a huge saving.  Consider allocating an order-9 PMD
sized THP.  Today we initialise compound_head in each of the 511 tail
pages.  Since a page is 64 bytes, we touch 32kB of memory!  That's 2/3 of
my CPU's L1 D$, so it's just pushed out a good chunk of my working set.
And it's all dirty, so it has to get written back.

We still need to distinguish between specifically folios (which
need the folio_prep_large_rmappable() call on allocation and
folio_undo_large_rmappable() on free) and other compound allocations which
do not need or want this, but that's touching one/two extra cachelines,
not 511.

Do we have a volunteer?

[1] https://lore.kernel.org/linux-mm/Za2lS-jG1s-HCqbx@casper.infradead.org/
[2] https://lore.kernel.org/linux-mm/ZamnIGxD8_dOJVi6@casper.infradead.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Project: Improving the PCP allocator
  2024-01-22 16:01 Project: Improving the PCP allocator Matthew Wilcox
@ 2024-01-24 19:18 ` Christoph Lameter (Ampere)
  2024-01-24 21:03   ` Matthew Wilcox
  2025-03-12  8:53 ` Sharing the (failed) experimental results of the idea "Keeping compound pages as compound pages in the PCP" Harry Yoo
  1 sibling, 1 reply; 4+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-01-24 19:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm

On Mon, 22 Jan 2024, Matthew Wilcox wrote:

> When we have memdescs, allocating a folio from the buddy is a two step
> process.  First we allocate the struct folio from slab, then we ask the
> buddy allocator for 2^n pages, each of which gets its memdesc set to
> point to this folio.  It'll be similar for other memory descriptors,
> but let's keep it simple and just talk about folios for now.

I need to catch up on memdescs. One of the key issues may be fragmentation 
that occurs during alloc / free of folios of different sizes.

Maybe we could use an approach similar to what the slab allocator uses to 
defrag. Allocate larger folios/pages and then break out sub 
folios/sizes/components until the page is full and recycle any frees of 
components in that page before going to the next large page.

With that we end up with a list of per cpu huge pages or so that the page 
allocator will serve from similar to the cpu partial lists in SLUB.

Once the huge page is used up then the page allocator needs to move on to 
a huge page that already has a lot of recent frees of smaller fragments. 
So something like a partial lists can exist also in the page allocator 
that is basically sorted by available space within each huge page.

There is the additional issue of different sizes to break out so it may 
not be as easy as in the SLUB allocator because different sizes are in one 
huge page.

Basically this is a move from SLAB style object management (caching large 
lists of small objects without regard to locality which increases 
fragmentation) to a combination of spatial considerations as well as list 
of large frames. I think this is necessary in order to keep memory as 
defragmented as possible.

> I think this could be a huge saving.  Consider allocating an order-9 PMD
> sized THP.  Today we initialise compound_head in each of the 511 tail
> pages.  Since a page is 64 bytes, we touch 32kB of memory!  That's 2/3 of
> my CPU's L1 D$, so it's just pushed out a good chunk of my working set.
> And it's all dirty, so it has to get written back.

Right.

> We still need to distinguish between specifically folios (which
> need the folio_prep_large_rmappable() call on allocation and
> folio_undo_large_rmappable() on free) and other compound allocations which
> do not need or want this, but that's touching one/two extra cachelines,
> not 511.

> Do we have a volunteer?

Maybe. I have to think about this but since I got my hands dirty years ago 
on the PCP logic I may qualify.

Need to get my head around the details and see where this could go.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Project: Improving the PCP allocator
  2024-01-24 19:18 ` Christoph Lameter (Ampere)
@ 2024-01-24 21:03   ` Matthew Wilcox
  0 siblings, 0 replies; 4+ messages in thread
From: Matthew Wilcox @ 2024-01-24 21:03 UTC (permalink / raw)
  To: Christoph Lameter (Ampere); +Cc: linux-mm

On Wed, Jan 24, 2024 at 11:18:24AM -0800, Christoph Lameter (Ampere) wrote:
> On Mon, 22 Jan 2024, Matthew Wilcox wrote:
> 
> > When we have memdescs, allocating a folio from the buddy is a two step
> > process.  First we allocate the struct folio from slab, then we ask the
> > buddy allocator for 2^n pages, each of which gets its memdesc set to
> > point to this folio.  It'll be similar for other memory descriptors,
> > but let's keep it simple and just talk about folios for now.
> 
> I need to catch up on memdescs. One of the key issues may be fragmentation
> that occurs during alloc / free of folios of different sizes.

A lot of what we have now is opportunistic.  We'll use larger allocations
if they're readily available, and if not we'll fall back (and also kick
kswapd to try to free up some memory).  This is fine for the current
purposes, but may be less fine for the people who want to support large
LBA devices.  I don't think it'll be a problem as they should be able
to allocate more memory that is large enough, just by evicting memory
from the page cache that comes from the same device (so is by definition
large enough).

> Maybe we could use an approach similar to what the slab allocator uses to
> defrag. Allocate larger folios/pages and then break out sub
> folios/sizes/components until the page is full and recycle any frees of
> components in that page before going to the next large page.

It's certainly something we could do, but then we're back to setting
up the compound page again, and the idea was to avoid doing that.
So really this is a competing idea, not a complementary idea.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Sharing the (failed) experimental results of the idea "Keeping compound pages as compound pages in the PCP"
  2024-01-22 16:01 Project: Improving the PCP allocator Matthew Wilcox
  2024-01-24 19:18 ` Christoph Lameter (Ampere)
@ 2025-03-12  8:53 ` Harry Yoo
  1 sibling, 0 replies; 4+ messages in thread
From: Harry Yoo @ 2025-03-12  8:53 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm

was: (Project: Improving the PCP allocator)

On Mon, Jan 22, 2024 at 04:01:34PM +0000, Matthew Wilcox wrote:
> As I mentioned here [1], I have Thoughts on how the PCP allocator
> works in a memdesc world.  Unlike my earlier Thoughts on the buddy
> allocator [2], we can actually make progress towards this one (and
> see substantial performance improvement, I believe).  So it's ripe for
> someone to pick up.
> 
> == With memdescs ==
> 
> When we have memdescs, allocating a folio from the buddy is a two step
> process.  First we allocate the struct folio from slab, then we ask the
> buddy allocator for 2^n pages, each of which gets its memdesc set to
> point to this folio.  It'll be similar for other memory descriptors,
> but let's keep it simple and just talk about folios for now.
> 
> Usually when we free folios, it's due to memory pressure (yes, we'll free
> memory due to truncating a file or processes exiting and freeing their
> anonymous memory, but that's secondary).  That means we're likely to want
> to allocate a folio again soon.  Given that, returning the struct folio
> to the slab allocator seems like a waste of time.  The PCP allocator
> can hold onto the struct folio as well as the underlying memory and
> then just hand it back to the next caller of folio_alloc.  This also
> saves us from having to invent a 'struct pcpdesc' and swap the memdesc
> pointer from the folio to the pcpdesc.
> 
> This implies that we no longer have a single pcp allocator for all types
> of memory; rather we have one for each memdesc type.  I think that's
> going to be OK, but it might introduce some problems.
> 
> == Before memdescs ==
> 
> Today we take all comers on the PCP list.  __free_pages() calls
> free_the_page() calls free_unref_page() calls free_unref_page_prepare()
> calls free_pages_prepare() which undoes all the PageCompound work.
> 
> Most multi-page allocations are compound.  Slab, file, anon; it's all
> compound.  I propose that we _only_ keep compound memory on the PCP list.
> Freeing non-compound multi-page memory can either convert it into compound
> pages before being placed on the PCP list or just hand the memory back
> to the buddy allocator.  Non-compound multi-page allocations can either
> go straight to buddy or grab from the PCP list and undo the compound
> nature of the pages.
>
> I think this could be a huge saving.  Consider allocating an order-9 PMD
> sized THP.  Today we initialise compound_head in each of the 511 tail
> pages.  Since a page is 64 bytes, we touch 32kB of memory!  That's 2/3 of
> my CPU's L1 D$, so it's just pushed out a good chunk of my working set.
> And it's all dirty, so it has to get written back.

I've been exploring the idea of keeping compound pages as compound pages
while they reside in PCP lists.

My experimental results so far suggest that reducing the overhead of
destroying and preparing compound pages does not have a substantial
impact on pft and kernbench.

For the record, let me share what I did and the results I obtained.

# Implementation

My toy implementation is here:
https://gitlab.com/hyeyoo/linux/-/tree/pcp-keep-compound

In this implementation, I modified free_pages_prepare() to skip
destroying compound pages when trying to free to the PCP. In case the kernel
failed to acquire the PCP lock, then it destroys it and frees to the
buddy. Because order>0 pages in the PCP are compound pages, rmqueue() does
not prepare compound pages if they are already compound pages.

rmqueue_bulk() prepares compound pages when grabbing some pages from the
buddy and free_pcppages_bulk() destroys compound pages before giving it
back to the buddy.

Non-compound multi-page allocations simply bypass the PCP.

# Evaluation

My experiments are done on my Ryzen 5900X (12core, 24threads) box with 128GiB
of DDR4-3200MHz memory. In the summary below, '2MB-32kB-16kB' means the
kernel first tries to allocate 2MB, and then falls back to 32kB/16kB.

[PFT, 2MB-32kB-16kB]
# timings
Amean     system-1          1.51 (   0.00%)        1.50 (   0.20%)
Amean     system-3          4.20 (   0.00%)        4.19 (   0.39%)
Amean     system-5          7.23 (   0.00%)        7.20 (   0.48%)
Amean     system-7          8.91 (   0.00%)        8.97 (  -0.67%)
Amean     system-12        17.80 (   0.00%)       17.75 (   0.25%)
Amean     system-18        26.26 (   0.00%)       26.14 (   0.47%)
Amean     system-24        34.88 (   0.00%)       34.97 (  -0.28%)
Amean     elapsed-1         1.55 (   0.00%)        1.55 (  -0.00%)
Amean     elapsed-3         1.42 (   0.00%)        1.42 (   0.23%)
Amean     elapsed-5         1.48 (   0.00%)        1.48 (   0.11%)
Amean     elapsed-7         1.47 (   0.00%)        1.47 (  -0.11%)
Amean     elapsed-12        1.50 (   0.00%)        1.49 (   0.18%)
Amean     elapsed-18        1.53 (   0.00%)        1.53 (   0.15%)
Amean     elapsed-24        1.56 (   0.00%)        1.56 (   0.04%)

# faults
Hmean     faults/cpu-1   4268815.8976 (   0.00%)  4270575.7783 (   0.04%)
Hmean     faults/cpu-3   1547723.5433 (   0.00%)  1551782.9058 (   0.26%)
Hmean     faults/cpu-5    895050.1683 (   0.00%)   894949.2879 (  -0.01%)
Hmean     faults/cpu-7    726379.5638 (   0.00%)   719439.1475 (  -0.96%)
Hmean     faults/cpu-12   368548.1793 (   0.00%)   369313.2284 (   0.21%)
Hmean     faults/cpu-18   246883.6901 (   0.00%)   248085.6269 (   0.49%)
Hmean     faults/cpu-24   182289.8542 (   0.00%)   182349.3383 (   0.03%)
Hmean     faults/sec-1   4263408.8848 (   0.00%)  4266020.8111 (   0.06%)
Hmean     faults/sec-3   4631463.5864 (   0.00%)  4642967.3237 (   0.25%)
Hmean     faults/sec-5   4440309.3622 (   0.00%)  4445610.5062 (   0.12%)
Hmean     faults/sec-7   4482600.6360 (   0.00%)  4477462.5227 (  -0.11%)
Hmean     faults/sec-12  4402147.6818 (   0.00%)  4410717.5042 (   0.19%)
Hmean     faults/sec-18  4301172.0933 (   0.00%)  4308668.3804 (   0.17%)
Hmean     faults/sec-24  4234656.1409 (   0.00%)  4237997.7612 (   0.08%)

[PFT, 32kB-16kB]
# timings
Amean     system-1          2.34 (   0.00%)        2.38 (  -1.93%)
Amean     system-3          4.14 (   0.00%)        4.15 (  -0.10%)
Amean     system-5          7.11 (   0.00%)        7.10 (   0.16%)
Amean     system-7          9.32 (   0.00%)        9.28 (   0.42%)
Amean     system-12        17.92 (   0.00%)       17.96 (  -0.22%)
Amean     system-18        25.86 (   0.00%)       25.62 (   0.93%)
Amean     system-24        35.67 (   0.00%)       35.66 (   0.03%)
Amean     elapsed-1         2.47 (   0.00%)        2.51 *  -1.95%*
Amean     elapsed-3         1.42 (   0.00%)        1.42 (  -0.07%)
Amean     elapsed-5         1.46 (   0.00%)        1.45 (   0.32%)
Amean     elapsed-7         1.47 (   0.00%)        1.46 (   0.30%)
Amean     elapsed-12        1.51 (   0.00%)        1.51 (  -0.31%)
Amean     elapsed-18        1.53 (   0.00%)        1.52 (   0.22%)
Amean     elapsed-24        1.52 (   0.00%)        1.52 (   0.09%)

# faults
Hmean     faults/cpu-1   2674791.9210 (   0.00%)  2622289.9694 *  -1.96%*
Hmean     faults/cpu-3   1548744.0561 (   0.00%)  1547049.2525 (  -0.11%)
Hmean     faults/cpu-5    909932.0813 (   0.00%)   911274.4409 (   0.15%)
Hmean     faults/cpu-7    697042.1240 (   0.00%)   700115.5680 (   0.44%)
Hmean     faults/cpu-12   365314.7268 (   0.00%)   364452.9596 (  -0.24%)
Hmean     faults/cpu-18   251711.1949 (   0.00%)   253133.5633 (   0.57%)
Hmean     faults/cpu-24   183212.0204 (   0.00%)   183350.4502 (   0.08%)
Hmean     faults/sec-1   2673435.8060 (   0.00%)  2621259.7412 *  -1.95%*
Hmean     faults/sec-3   4637478.9272 (   0.00%)  4634486.4260 (  -0.06%)
Hmean     faults/sec-5   4531908.1729 (   0.00%)  4541788.9414 (   0.22%)
Hmean     faults/sec-7   4482779.5699 (   0.00%)  4499528.9948 (   0.37%)
Hmean     faults/sec-12  4365201.3350 (   0.00%)  4353874.4404 (  -0.26%)
Hmean     faults/sec-18  4314991.0014 (   0.00%)  4321977.0563 (   0.16%)
Hmean     faults/sec-24  4344249.9657 (   0.00%)  4345263.1455 (   0.02%)

[PFT, 16kB]
# timings
Amean     system-1          3.13 (   0.00%)        3.22 *  -2.84%*
Amean     system-3          4.31 (   0.00%)        4.38 (  -1.70%)
Amean     system-5          6.93 (   0.00%)        6.95 (  -0.21%)
Amean     system-7          9.40 (   0.00%)        9.47 (  -0.66%)
Amean     system-12        17.72 (   0.00%)       17.75 (  -0.15%)
Amean     system-18        26.19 (   0.00%)       26.12 (   0.24%)
Amean     system-24        35.41 (   0.00%)       35.31 (   0.29%)
Amean     elapsed-1         3.34 (   0.00%)        3.43 *  -2.79%*
Amean     elapsed-3         1.50 (   0.00%)        1.52 *  -1.67%*
Amean     elapsed-5         1.44 (   0.00%)        1.44 (  -0.09%)
Amean     elapsed-7         1.46 (   0.00%)        1.46 (  -0.43%)
Amean     elapsed-12        1.50 (   0.00%)        1.50 (  -0.04%)
Amean     elapsed-18        1.50 (   0.00%)        1.49 (   0.18%)
Amean     elapsed-24        1.51 (   0.00%)        1.50 (   0.58%)

# faults
Hmean     faults/cpu-1   1975802.9271 (   0.00%)  1921547.2378 *  -2.75%*
Hmean     faults/cpu-3   1468850.1816 (   0.00%)  1444987.8129 *  -1.62%*
Hmean     faults/cpu-5    921763.7641 (   0.00%)   920183.2144 (  -0.17%)
Hmean     faults/cpu-7    686678.6801 (   0.00%)   681824.6507 (  -0.71%)
Hmean     faults/cpu-12   367972.3092 (   0.00%)   367485.9867 (  -0.13%)
Hmean     faults/cpu-18   248991.2074 (   0.00%)   249470.5429 (   0.19%)
Hmean     faults/cpu-24   184426.2136 (   0.00%)   184968.3682 (   0.29%)
Hmean     faults/sec-1   1975110.1317 (   0.00%)  1920789.4502 *  -2.75%*
Hmean     faults/sec-3   4400629.2802 (   0.00%)  4327578.7473 *  -1.66%*
Hmean     faults/sec-5   4594228.5709 (   0.00%)  4589647.3446 (  -0.10%)
Hmean     faults/sec-7   4524451.1451 (   0.00%)  4502414.9629 (  -0.49%)
Hmean     faults/sec-12  4391814.1854 (   0.00%)  4392709.6380 (   0.02%)
Hmean     faults/sec-18  4406851.3807 (   0.00%)  4409517.3005 (   0.06%)
Hmean     faults/sec-24  4376471.9465 (   0.00%)  4395938.4025 (   0.44%)

[kernbench, 2MB-32kB-16kB THP]
Amean     user-24    16236.12 (   0.00%)    16402.41 (  -1.02%)
Amean     syst-24     1289.28 (   0.00%)     1329.39 *  -3.11%*
Amean     elsp-24      763.75 (   0.00%)      775.03 (  -1.48%)

[kernbench, 32kB-16kB THP]
Amean     user-24    16334.94 (   0.00%)    16314.30 (   0.13%)
Amean     syst-24     1565.50 (   0.00%)     1609.74 *  -2.83%*
Amean     elsp-24      779.84 (   0.00%)      780.55 (  -0.09%)

[kernbench, 16kB]
Amean     user-24    16205.47 (   0.00%)    16282.73 *  -0.48%*
Amean     syst-24     1784.03 (   0.00%)     1837.13 *  -2.98%*
Amean     elsp-24      787.37 (   0.00%)      791.00 *  -0.46%*

I thought the fact that I made zone->lock in free_pcppages_bulk()
a little bit finer-grained in the implementation had a negative impact,
so I modified it to be coarse-grained and experimented again, but it was
still not a clear win on kernbench and pft.

So... yeah. The results for this idea are not that encouraging.
Based on the results, my theories are:

1) The overhead of preparing compound pages is too small to show measurable
difference (the percentage of prep_compound_page() profiles in the total
profile is measured as <0.5% on pft).

2) At the time the kernel prepares compound pages, a large chunk of the
cpu cache is already dirtied (e.g., when allocating 2MB anon THP, the kernel
simply touches the whole 2MB of memory), thus dirtying 1.56% more of
cache lines does not have a huge impact on performance.

-- 
Cheers,
Harry


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-03-12  8:53 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-22 16:01 Project: Improving the PCP allocator Matthew Wilcox
2024-01-24 19:18 ` Christoph Lameter (Ampere)
2024-01-24 21:03   ` Matthew Wilcox
2025-03-12  8:53 ` Sharing the (failed) experimental results of the idea "Keeping compound pages as compound pages in the PCP" Harry Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox