Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dev Jain <dev.jain@arm.com>
To: Ryan Roberts <ryan.roberts@arm.com>,
	"Vishal Moola (Oracle)" <vishal.moola@gmail.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Uladzislau Rezki <urezki@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator
Date: Thu, 11 Dec 2025 21:08:07 +0530	[thread overview]
Message-ID: <3f4285cf-adb8-4fc5-ad18-c3a0d6de4db0@arm.com> (raw)
In-Reply-To: <0fa1c315-70ef-46f3-95ec-feb3a75a10be@arm.com>


On 11/12/25 9:05 pm, Dev Jain wrote:
>
> On 11/12/25 8:58 pm, Ryan Roberts wrote:
>> On 10/12/2025 22:28, Vishal Moola (Oracle) wrote:
>>> On Wed, Dec 10, 2025 at 01:21:22PM +0000, Ryan Roberts wrote:
>>>> Hi Vishal,
>>>>
>>>>
>>>> On 21/10/2025 20:44, Vishal Moola (Oracle) wrote:
>>>>> Sometimes, vm_area_alloc_pages() will want many pages from the buddy
>>>>> allocator. Rather than making requests to the buddy allocator for at
>>>>> most 100 pages at a time, we can eagerly request large order pages a
>>>>> smaller number of times.
>>>>>
>>>>> We still split the large order pages down to order-0 as the rest 
>>>>> of the
>>>>> vmalloc code (and some callers) depend on it. We still defer to 
>>>>> the bulk
>>>>> allocator and fallback path in case of order-0 pages or failure.
>>>>>
>>>>> Running 1000 iterations of allocations on a small 4GB system finds:
>>>>>
>>>>> 1000 2mb allocations:
>>>>>     [Baseline]            [This patch]
>>>>>     real    46.310s            real    0m34.582
>>>>>     user    0.001s            user    0.006s
>>>>>     sys     46.058s            sys     0m34.365s
>>>>>
>>>>> 10000 200kb allocations:
>>>>>     [Baseline]            [This patch]
>>>>>     real    56.104s            real    0m43.696
>>>>>     user    0.001s            user    0.003s
>>>>>     sys     55.375s            sys     0m42.995s
>>>> I'm seeing some big vmalloc micro benchmark regressions on arm64, 
>>>> for which
>>>> bisect is pointing to this patch.
>>> Ulad had similar findings/concerns[1]. Tldr: The numbers you are seeing
>>> are expected for how the test module is currently written.
>> Hmm... simplistically, I'd say that either the tests are bad, in 
>> which case they
>> should be deleted, or they are good, in which case we shouldn't 
>> ignore the
>> regressions. Having tests that we learn to ignore is the worst of 
>> both worlds.
>
> AFAICR the test does some million-odd iterations by default, which is 
> the real problem.
> On my RFC [1] I notice that reducing the iterations reduces the 
> regression - till
> some multiple of ten thousand iterations, the regression is zero. 
> Doing this
> alloc->free a million freaking times messes up the buddy badly.
>
> [1] https://lore.kernel.org/all/20251112110807.69958-1-dev.jain@arm.com/


So this line:

__param(int, test_loop_count, 1000000,
         "Set test loop counter");

We should just change it to 20k or something and that should resolve it.


>
>>
>> But I see your point about the allocation pattern not being very 
>> realistic.
>>
>>>> The tests are all originally from the vmalloc_test module. Note 
>>>> that (R)
>>>> indicates a statistically significant regression and (I) indicates a
>>>> statistically improvement.
>>>>
>>>> p is number of pages in the allocation, h is huge. So it looks like 
>>>> the
>>>> regressions are all coming for the non-huge case, where we want to 
>>>> split to
>>>> order-0.
>>>>
>>>> +---------------------------------+----------------------------------------------------------+------------+------------------------+ 
>>>>
>>>> | Benchmark                       | Result 
>>>> Class                                             | 6-18-0 |   
>>>> 6-18-0-gc2f2b01b74be |
>>>> +=================================+==========================================================+============+========================+ 
>>>>
>>>> | micromm/vmalloc                 | fix_align_alloc_test: p:1, h:0, 
>>>> l:500000 (usec)          |  514126.58 | (R) -42.20% |
>>>> |                                 | fix_size_alloc_test: p:1, h:0, 
>>>> l:500000 (usec)           |  320458.33 |                 -0.02% |
>>>> |                                 | fix_size_alloc_test: p:4, h:0, 
>>>> l:500000 (usec)           |  399680.33 |            (R) -23.43% |
>>>> |                                 | fix_size_alloc_test: p:16, h:0, 
>>>> l:500000 (usec)          |  788723.25 |            (R) -23.66% |
>>>> |                                 | fix_size_alloc_test: p:16, h:1, 
>>>> l:500000 (usec)          |  979839.58 |                 -1.05% |
>>>> |                                 | fix_size_alloc_test: p:64, h:0, 
>>>> l:100000 (usec)          |  481454.58 |            (R) -23.99% |
>>>> |                                 | fix_size_alloc_test: p:64, h:1, 
>>>> l:100000 (usec)          |  615924.00 |              (I) 2.56% |
>>>> |                                 | fix_size_alloc_test: p:256, 
>>>> h:0, l:100000 (usec)         | 1799224.08 |            (R) -23.28% |
>>>> |                                 | fix_size_alloc_test: p:256, 
>>>> h:1, l:100000 (usec)         | 2313859.25 |              (I) 3.43% |
>>>> |                                 | fix_size_alloc_test: p:512, 
>>>> h:0, l:100000 (usec)         | 3541904.75 |            (R) -23.86% |
>>>> |                                 | fix_size_alloc_test: p:512, 
>>>> h:1, l:100000 (usec)         | 3597577.25 |             (R) -2.97% |
>>>> |                                 | full_fit_alloc_test: p:1, h:0, 
>>>> l:500000 (usec)           |  487021.83 |              (I) 4.95% |
>>>> |                                 | kvfree_rcu_1_arg_vmalloc_test: 
>>>> p:1, h:0, l:500000 (usec) | 344466.33 |                 -0.65% |
>>>> |                                 | kvfree_rcu_2_arg_vmalloc_test: 
>>>> p:1, h:0, l:500000 (usec) | 342484.25 |                 -1.58% |
>>>> |                                 | long_busy_list_alloc_test: p:1, 
>>>> h:0, l:500000 (usec)     | 4034901.17 |            (R) -25.35% |
>>>> |                                 | pcpu_alloc_test: p:1, h:0, 
>>>> l:500000 (usec)               |  195973.42 |                  0.57% |
>>>> |                                 | random_size_align_alloc_test: 
>>>> p:1, h:0, l:500000 (usec)  | 643489.33 |            (R) -47.63% |
>>>> |                                 | random_size_alloc_test: p:1, 
>>>> h:0, l:500000 (usec)        | 2029261.33 | (R) -27.88% |
>>>> |                                 | vm_map_ram_test: p:1, h:0, 
>>>> l:500000 (usec)               |   83557.08 |                 -0.22% |
>>>> +---------------------------------+----------------------------------------------------------+------------+------------------------+ 
>>>>
>>>>
>>>> I have a couple of thoughts from looking at the patch:
>>>>
>>>>   - Perhaps split_page() is the bulk of the cost? Previously for 
>>>> this case we
>>>>     were allocating order-0 so there was no split to do. For h=1, 
>>>> split would
>>>>     have already been called so that would explain why no 
>>>> regression for that
>>>>     case?
>>> For h=1, this patch shouldn't change (as long as nr_pages <
>>> arch_vmap_{pte,pmd}_supported_shift). This is why you don't see 
>>> regressions
>>> in those cases.
>> arm64 supports 64K contigous-mappings with vmalloc so once nr_pages 
>> >= 16 we can
>> take the huge path.
>>
>>>>   - I guess we are bypassing the pcpu cache? Could this be having 
>>>> an effect? Dev
>>>>     (cc'ed) did some similar investigation a while back and saw 
>>>> increased vmalloc
>>>>     latencies when bypassing pcpu cache.
>>> I'd say this is more a case of this test module targeting the pcpu
>>> cache. The module allocates then frees one at a time, which promotes
>>> reusing pcpu pages. [1] Has some numbers after modifying the test such
>>> that all the allocations are made before freeing any.
>> OK fair enough.
>>
>> We are seeing a bunch of other regressions in higher level benchmarks 
>> too; but
>> haven't yet concluded what's causing those. I'll report back if this 
>> patch looks
>> connected.
>>
>> Thanks,
>> Ryan
>>
>>
>>>>   - Philosophically is allocating physically contiguous memory when 
>>>> it is not
>>>>     strictly needed the right thing to do? Large physically 
>>>> contiguous blocks are
>>>>     a scarce resource so we don't want to waste them. Although I 
>>>> guess it could
>>>>     be argued that this actually preserves the contiguous blocks 
>>>> because the
>>>>     lifetime of all the pages is tied together. Anyway, I doubt 
>>>> this is the
>>> This was the primary incentive for this patch :)
>>>
>>>>     reason for the slow down, since those benchmarks are not under 
>>>> memory
>>>>     pressure.
>>>>
>>>> Anyway, it would be good to resolve the performance regressions if 
>>>> we can.
>>> Imo, the appropriate way to address these is to modify the test module
>>> as seen in [1].
>>>
>>> [1] https://lore.kernel.org/linux-mm/aPJ6lLf24TfW_1n7@milan/
>>
>

next prev parent reply	other threads:[~2025-12-11 15:38 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-21 19:44 Vishal Moola (Oracle)
2025-10-21 21:24 ` Andrew Morton
2025-10-22 14:33   ` Matthew Wilcox
2025-10-22 17:50 ` Uladzislau Rezki
2025-12-10 13:21 ` Ryan Roberts
2025-12-10 22:28   ` Vishal Moola (Oracle)
2025-12-11 15:28     ` Ryan Roberts
2025-12-11 15:35       ` Dev Jain
2025-12-11 15:38         ` Dev Jain [this message]
2025-12-11 15:39       ` Uladzislau Rezki
2025-12-11 15:43         ` Dev Jain
2025-12-11 16:24           ` Uladzislau Rezki
2025-12-12  3:55             ` Dev Jain
2025-12-11 20:41       ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3f4285cf-adb8-4fc5-ad18-c3a0d6de4db0@arm.com \
    --to=dev.jain@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ryan.roberts@arm.com \
    --cc=urezki@gmail.com \
    --cc=vishal.moola@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox