* Re: Project: Improving the PCP allocator
2024-01-22 16:01 Project: Improving the PCP allocator Matthew Wilcox
@ 2024-01-24 19:18 ` Christoph Lameter (Ampere)
2024-01-24 21:03 ` Matthew Wilcox
2025-03-12 8:53 ` Sharing the (failed) experimental results of the idea "Keeping compound pages as compound pages in the PCP" Harry Yoo
1 sibling, 1 reply; 4+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-01-24 19:18 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: linux-mm
On Mon, 22 Jan 2024, Matthew Wilcox wrote:
> When we have memdescs, allocating a folio from the buddy is a two step
> process. First we allocate the struct folio from slab, then we ask the
> buddy allocator for 2^n pages, each of which gets its memdesc set to
> point to this folio. It'll be similar for other memory descriptors,
> but let's keep it simple and just talk about folios for now.
I need to catch up on memdescs. One of the key issues may be fragmentation
that occurs during alloc / free of folios of different sizes.
Maybe we could use an approach similar to what the slab allocator uses to
defrag. Allocate larger folios/pages and then break out sub
folios/sizes/components until the page is full and recycle any frees of
components in that page before going to the next large page.
With that we end up with a list of per cpu huge pages or so that the page
allocator will serve from similar to the cpu partial lists in SLUB.
Once the huge page is used up then the page allocator needs to move on to
a huge page that already has a lot of recent frees of smaller fragments.
So something like a partial lists can exist also in the page allocator
that is basically sorted by available space within each huge page.
There is the additional issue of different sizes to break out so it may
not be as easy as in the SLUB allocator because different sizes are in one
huge page.
Basically this is a move from SLAB style object management (caching large
lists of small objects without regard to locality which increases
fragmentation) to a combination of spatial considerations as well as list
of large frames. I think this is necessary in order to keep memory as
defragmented as possible.
> I think this could be a huge saving. Consider allocating an order-9 PMD
> sized THP. Today we initialise compound_head in each of the 511 tail
> pages. Since a page is 64 bytes, we touch 32kB of memory! That's 2/3 of
> my CPU's L1 D$, so it's just pushed out a good chunk of my working set.
> And it's all dirty, so it has to get written back.
Right.
> We still need to distinguish between specifically folios (which
> need the folio_prep_large_rmappable() call on allocation and
> folio_undo_large_rmappable() on free) and other compound allocations which
> do not need or want this, but that's touching one/two extra cachelines,
> not 511.
> Do we have a volunteer?
Maybe. I have to think about this but since I got my hands dirty years ago
on the PCP logic I may qualify.
Need to get my head around the details and see where this could go.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Sharing the (failed) experimental results of the idea "Keeping compound pages as compound pages in the PCP"
2024-01-22 16:01 Project: Improving the PCP allocator Matthew Wilcox
2024-01-24 19:18 ` Christoph Lameter (Ampere)
@ 2025-03-12 8:53 ` Harry Yoo
1 sibling, 0 replies; 4+ messages in thread
From: Harry Yoo @ 2025-03-12 8:53 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: linux-mm
was: (Project: Improving the PCP allocator)
On Mon, Jan 22, 2024 at 04:01:34PM +0000, Matthew Wilcox wrote:
> As I mentioned here [1], I have Thoughts on how the PCP allocator
> works in a memdesc world. Unlike my earlier Thoughts on the buddy
> allocator [2], we can actually make progress towards this one (and
> see substantial performance improvement, I believe). So it's ripe for
> someone to pick up.
>
> == With memdescs ==
>
> When we have memdescs, allocating a folio from the buddy is a two step
> process. First we allocate the struct folio from slab, then we ask the
> buddy allocator for 2^n pages, each of which gets its memdesc set to
> point to this folio. It'll be similar for other memory descriptors,
> but let's keep it simple and just talk about folios for now.
>
> Usually when we free folios, it's due to memory pressure (yes, we'll free
> memory due to truncating a file or processes exiting and freeing their
> anonymous memory, but that's secondary). That means we're likely to want
> to allocate a folio again soon. Given that, returning the struct folio
> to the slab allocator seems like a waste of time. The PCP allocator
> can hold onto the struct folio as well as the underlying memory and
> then just hand it back to the next caller of folio_alloc. This also
> saves us from having to invent a 'struct pcpdesc' and swap the memdesc
> pointer from the folio to the pcpdesc.
>
> This implies that we no longer have a single pcp allocator for all types
> of memory; rather we have one for each memdesc type. I think that's
> going to be OK, but it might introduce some problems.
>
> == Before memdescs ==
>
> Today we take all comers on the PCP list. __free_pages() calls
> free_the_page() calls free_unref_page() calls free_unref_page_prepare()
> calls free_pages_prepare() which undoes all the PageCompound work.
>
> Most multi-page allocations are compound. Slab, file, anon; it's all
> compound. I propose that we _only_ keep compound memory on the PCP list.
> Freeing non-compound multi-page memory can either convert it into compound
> pages before being placed on the PCP list or just hand the memory back
> to the buddy allocator. Non-compound multi-page allocations can either
> go straight to buddy or grab from the PCP list and undo the compound
> nature of the pages.
>
> I think this could be a huge saving. Consider allocating an order-9 PMD
> sized THP. Today we initialise compound_head in each of the 511 tail
> pages. Since a page is 64 bytes, we touch 32kB of memory! That's 2/3 of
> my CPU's L1 D$, so it's just pushed out a good chunk of my working set.
> And it's all dirty, so it has to get written back.
I've been exploring the idea of keeping compound pages as compound pages
while they reside in PCP lists.
My experimental results so far suggest that reducing the overhead of
destroying and preparing compound pages does not have a substantial
impact on pft and kernbench.
For the record, let me share what I did and the results I obtained.
# Implementation
My toy implementation is here:
https://gitlab.com/hyeyoo/linux/-/tree/pcp-keep-compound
In this implementation, I modified free_pages_prepare() to skip
destroying compound pages when trying to free to the PCP. In case the kernel
failed to acquire the PCP lock, then it destroys it and frees to the
buddy. Because order>0 pages in the PCP are compound pages, rmqueue() does
not prepare compound pages if they are already compound pages.
rmqueue_bulk() prepares compound pages when grabbing some pages from the
buddy and free_pcppages_bulk() destroys compound pages before giving it
back to the buddy.
Non-compound multi-page allocations simply bypass the PCP.
# Evaluation
My experiments are done on my Ryzen 5900X (12core, 24threads) box with 128GiB
of DDR4-3200MHz memory. In the summary below, '2MB-32kB-16kB' means the
kernel first tries to allocate 2MB, and then falls back to 32kB/16kB.
[PFT, 2MB-32kB-16kB]
# timings
Amean system-1 1.51 ( 0.00%) 1.50 ( 0.20%)
Amean system-3 4.20 ( 0.00%) 4.19 ( 0.39%)
Amean system-5 7.23 ( 0.00%) 7.20 ( 0.48%)
Amean system-7 8.91 ( 0.00%) 8.97 ( -0.67%)
Amean system-12 17.80 ( 0.00%) 17.75 ( 0.25%)
Amean system-18 26.26 ( 0.00%) 26.14 ( 0.47%)
Amean system-24 34.88 ( 0.00%) 34.97 ( -0.28%)
Amean elapsed-1 1.55 ( 0.00%) 1.55 ( -0.00%)
Amean elapsed-3 1.42 ( 0.00%) 1.42 ( 0.23%)
Amean elapsed-5 1.48 ( 0.00%) 1.48 ( 0.11%)
Amean elapsed-7 1.47 ( 0.00%) 1.47 ( -0.11%)
Amean elapsed-12 1.50 ( 0.00%) 1.49 ( 0.18%)
Amean elapsed-18 1.53 ( 0.00%) 1.53 ( 0.15%)
Amean elapsed-24 1.56 ( 0.00%) 1.56 ( 0.04%)
# faults
Hmean faults/cpu-1 4268815.8976 ( 0.00%) 4270575.7783 ( 0.04%)
Hmean faults/cpu-3 1547723.5433 ( 0.00%) 1551782.9058 ( 0.26%)
Hmean faults/cpu-5 895050.1683 ( 0.00%) 894949.2879 ( -0.01%)
Hmean faults/cpu-7 726379.5638 ( 0.00%) 719439.1475 ( -0.96%)
Hmean faults/cpu-12 368548.1793 ( 0.00%) 369313.2284 ( 0.21%)
Hmean faults/cpu-18 246883.6901 ( 0.00%) 248085.6269 ( 0.49%)
Hmean faults/cpu-24 182289.8542 ( 0.00%) 182349.3383 ( 0.03%)
Hmean faults/sec-1 4263408.8848 ( 0.00%) 4266020.8111 ( 0.06%)
Hmean faults/sec-3 4631463.5864 ( 0.00%) 4642967.3237 ( 0.25%)
Hmean faults/sec-5 4440309.3622 ( 0.00%) 4445610.5062 ( 0.12%)
Hmean faults/sec-7 4482600.6360 ( 0.00%) 4477462.5227 ( -0.11%)
Hmean faults/sec-12 4402147.6818 ( 0.00%) 4410717.5042 ( 0.19%)
Hmean faults/sec-18 4301172.0933 ( 0.00%) 4308668.3804 ( 0.17%)
Hmean faults/sec-24 4234656.1409 ( 0.00%) 4237997.7612 ( 0.08%)
[PFT, 32kB-16kB]
# timings
Amean system-1 2.34 ( 0.00%) 2.38 ( -1.93%)
Amean system-3 4.14 ( 0.00%) 4.15 ( -0.10%)
Amean system-5 7.11 ( 0.00%) 7.10 ( 0.16%)
Amean system-7 9.32 ( 0.00%) 9.28 ( 0.42%)
Amean system-12 17.92 ( 0.00%) 17.96 ( -0.22%)
Amean system-18 25.86 ( 0.00%) 25.62 ( 0.93%)
Amean system-24 35.67 ( 0.00%) 35.66 ( 0.03%)
Amean elapsed-1 2.47 ( 0.00%) 2.51 * -1.95%*
Amean elapsed-3 1.42 ( 0.00%) 1.42 ( -0.07%)
Amean elapsed-5 1.46 ( 0.00%) 1.45 ( 0.32%)
Amean elapsed-7 1.47 ( 0.00%) 1.46 ( 0.30%)
Amean elapsed-12 1.51 ( 0.00%) 1.51 ( -0.31%)
Amean elapsed-18 1.53 ( 0.00%) 1.52 ( 0.22%)
Amean elapsed-24 1.52 ( 0.00%) 1.52 ( 0.09%)
# faults
Hmean faults/cpu-1 2674791.9210 ( 0.00%) 2622289.9694 * -1.96%*
Hmean faults/cpu-3 1548744.0561 ( 0.00%) 1547049.2525 ( -0.11%)
Hmean faults/cpu-5 909932.0813 ( 0.00%) 911274.4409 ( 0.15%)
Hmean faults/cpu-7 697042.1240 ( 0.00%) 700115.5680 ( 0.44%)
Hmean faults/cpu-12 365314.7268 ( 0.00%) 364452.9596 ( -0.24%)
Hmean faults/cpu-18 251711.1949 ( 0.00%) 253133.5633 ( 0.57%)
Hmean faults/cpu-24 183212.0204 ( 0.00%) 183350.4502 ( 0.08%)
Hmean faults/sec-1 2673435.8060 ( 0.00%) 2621259.7412 * -1.95%*
Hmean faults/sec-3 4637478.9272 ( 0.00%) 4634486.4260 ( -0.06%)
Hmean faults/sec-5 4531908.1729 ( 0.00%) 4541788.9414 ( 0.22%)
Hmean faults/sec-7 4482779.5699 ( 0.00%) 4499528.9948 ( 0.37%)
Hmean faults/sec-12 4365201.3350 ( 0.00%) 4353874.4404 ( -0.26%)
Hmean faults/sec-18 4314991.0014 ( 0.00%) 4321977.0563 ( 0.16%)
Hmean faults/sec-24 4344249.9657 ( 0.00%) 4345263.1455 ( 0.02%)
[PFT, 16kB]
# timings
Amean system-1 3.13 ( 0.00%) 3.22 * -2.84%*
Amean system-3 4.31 ( 0.00%) 4.38 ( -1.70%)
Amean system-5 6.93 ( 0.00%) 6.95 ( -0.21%)
Amean system-7 9.40 ( 0.00%) 9.47 ( -0.66%)
Amean system-12 17.72 ( 0.00%) 17.75 ( -0.15%)
Amean system-18 26.19 ( 0.00%) 26.12 ( 0.24%)
Amean system-24 35.41 ( 0.00%) 35.31 ( 0.29%)
Amean elapsed-1 3.34 ( 0.00%) 3.43 * -2.79%*
Amean elapsed-3 1.50 ( 0.00%) 1.52 * -1.67%*
Amean elapsed-5 1.44 ( 0.00%) 1.44 ( -0.09%)
Amean elapsed-7 1.46 ( 0.00%) 1.46 ( -0.43%)
Amean elapsed-12 1.50 ( 0.00%) 1.50 ( -0.04%)
Amean elapsed-18 1.50 ( 0.00%) 1.49 ( 0.18%)
Amean elapsed-24 1.51 ( 0.00%) 1.50 ( 0.58%)
# faults
Hmean faults/cpu-1 1975802.9271 ( 0.00%) 1921547.2378 * -2.75%*
Hmean faults/cpu-3 1468850.1816 ( 0.00%) 1444987.8129 * -1.62%*
Hmean faults/cpu-5 921763.7641 ( 0.00%) 920183.2144 ( -0.17%)
Hmean faults/cpu-7 686678.6801 ( 0.00%) 681824.6507 ( -0.71%)
Hmean faults/cpu-12 367972.3092 ( 0.00%) 367485.9867 ( -0.13%)
Hmean faults/cpu-18 248991.2074 ( 0.00%) 249470.5429 ( 0.19%)
Hmean faults/cpu-24 184426.2136 ( 0.00%) 184968.3682 ( 0.29%)
Hmean faults/sec-1 1975110.1317 ( 0.00%) 1920789.4502 * -2.75%*
Hmean faults/sec-3 4400629.2802 ( 0.00%) 4327578.7473 * -1.66%*
Hmean faults/sec-5 4594228.5709 ( 0.00%) 4589647.3446 ( -0.10%)
Hmean faults/sec-7 4524451.1451 ( 0.00%) 4502414.9629 ( -0.49%)
Hmean faults/sec-12 4391814.1854 ( 0.00%) 4392709.6380 ( 0.02%)
Hmean faults/sec-18 4406851.3807 ( 0.00%) 4409517.3005 ( 0.06%)
Hmean faults/sec-24 4376471.9465 ( 0.00%) 4395938.4025 ( 0.44%)
[kernbench, 2MB-32kB-16kB THP]
Amean user-24 16236.12 ( 0.00%) 16402.41 ( -1.02%)
Amean syst-24 1289.28 ( 0.00%) 1329.39 * -3.11%*
Amean elsp-24 763.75 ( 0.00%) 775.03 ( -1.48%)
[kernbench, 32kB-16kB THP]
Amean user-24 16334.94 ( 0.00%) 16314.30 ( 0.13%)
Amean syst-24 1565.50 ( 0.00%) 1609.74 * -2.83%*
Amean elsp-24 779.84 ( 0.00%) 780.55 ( -0.09%)
[kernbench, 16kB]
Amean user-24 16205.47 ( 0.00%) 16282.73 * -0.48%*
Amean syst-24 1784.03 ( 0.00%) 1837.13 * -2.98%*
Amean elsp-24 787.37 ( 0.00%) 791.00 * -0.46%*
I thought the fact that I made zone->lock in free_pcppages_bulk()
a little bit finer-grained in the implementation had a negative impact,
so I modified it to be coarse-grained and experimented again, but it was
still not a clear win on kernbench and pft.
So... yeah. The results for this idea are not that encouraging.
Based on the results, my theories are:
1) The overhead of preparing compound pages is too small to show measurable
difference (the percentage of prep_compound_page() profiles in the total
profile is measured as <0.5% on pft).
2) At the time the kernel prepares compound pages, a large chunk of the
cpu cache is already dirtied (e.g., when allocating 2MB anon THP, the kernel
simply touches the whole 2MB of memory), thus dirtying 1.56% more of
cache lines does not have a huge impact on performance.
--
Cheers,
Harry
^ permalink raw reply [flat|nested] 4+ messages in thread