From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D8CCBD41D61 for ; Thu, 11 Dec 2025 15:38:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2D8736B0007; Thu, 11 Dec 2025 10:38:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 289196B0008; Thu, 11 Dec 2025 10:38:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1A0206B000A; Thu, 11 Dec 2025 10:38:16 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id F3FD06B0007 for ; Thu, 11 Dec 2025 10:38:15 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id A3291140230 for ; Thu, 11 Dec 2025 15:38:15 +0000 (UTC) X-FDA: 84207596550.17.F4112DE Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf23.hostedemail.com (Postfix) with ESMTP id D329F140014 for ; Thu, 11 Dec 2025 15:38:13 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf23.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1765467494; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GbUM4g7uKw8vYe2Nt2y2OqI2UjY9+TowbckYxK7VsiY=; b=nWlM7AzXqrZoWahdmGBaRiqmX5VTk+xTx7tImAnAq2aiF7qPG38bUB8Mady/vt3uBen5Pm iQhPvtcJcZ0IM79R/oqYecBYngwdlmWmxkBrEAllBkVNnJwZEAXeQlsJFD1XO4TeQEocUF BdwNYfQWIUGuiut0404s5xOJaOPZCzc= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf23.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1765467494; a=rsa-sha256; cv=none; b=0RXceOFGumBM817egOS4cy/5RDZ+1l/zNIwZFuQQVNsK+5seyS3FDqyA3wP9O9WePtoRLC Rgonyl1jsNFklMT21+vKrfKqRrl5fUfCj3Uan6DDHk3kuzA1wAJWpQJjmtiIUzjk8ia9t3 ugHpwnamZNblEumSupFaigSbd1yc4KI= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9D3681063; Thu, 11 Dec 2025 07:38:05 -0800 (PST) Received: from [10.163.53.200] (unknown [10.163.53.200]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B37943F740; Thu, 11 Dec 2025 07:38:10 -0800 (PST) Message-ID: <3f4285cf-adb8-4fc5-ad18-c3a0d6de4db0@arm.com> Date: Thu, 11 Dec 2025 21:08:07 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator From: Dev Jain To: Ryan Roberts , "Vishal Moola (Oracle)" Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Uladzislau Rezki , Andrew Morton References: <20251021194455.33351-2-vishal.moola@gmail.com> <66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com> <0fa1c315-70ef-46f3-95ec-feb3a75a10be@arm.com> Content-Language: en-US In-Reply-To: <0fa1c315-70ef-46f3-95ec-feb3a75a10be@arm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Stat-Signature: 99giioss7gmrajpis1dmfcrx4nwfo1dd X-Rspamd-Queue-Id: D329F140014 X-Rspamd-Server: rspam06 X-HE-Tag: 1765467493-563174 X-HE-Meta: U2FsdGVkX18VR19K17kWFLmCdu5iMXB+mdjrGIotT4EsjsgjmhFy3GcyTJ9ccQ3AYf7DVLbspOspAW4I6ja82qC+62jSX49AkSemBMeAbgRPzx5jomdhEFw4RZgdIIzg4/2Bb48LQXwS9bV6vglVT+jBMMunDfchX9Y5FGO5m/vtzvLkLhM095XsOYj8c9sYJ69Ue4eg8nvwFeGdIBuO6wYQyRLsy2XP/Q3qceaoXDOXnb97UdsTurGujc7Tq+6CBKGIXr9hHeybVmTCLvh32XKqzzxQsyC6NJ5uNo4AIPhd5idepCFP9vrexSlefd6xxjepGHgVFaz22RFrHkkx7jD35ChUhPociZcS6twUDYmk2SADOXnY92mfCW9qr+cavYQRmboSxrPRTJFB0kDjU+v02pGScsk8iboT2v70nnQehnC7rujximtT9qguuTOLi+WQBJFotBZzgJScDB4MfhxVOQ8BJydL+IQ/ira/0mf3NUGT+0ta94guacjt41cu7jIDeXFG3Gur9nCiHnUOllBhA+vBVy01qSBaxiIOQdjVWKbSmxf/9M/oIJnTOMy7o3949LQ+MZBUuwz51KHJJEfOarQSSxEcTPJ0lMRyjUmALudmHTvKIQgNk742ij5OYKCtTRRwt9YT3afPYBfKOG7vS7Krfoir0AtUxdv9WotBmLbS9YkPGrnlBG5L0fbhtuULbfuggdut7s9UXMIuO/WBDoY82wLHj7Q+J7okx7JkN+IjsOgQTrFRT8M+qOP0M4tTvi23bDS69XIEcsUzN1frJBT5H7dttH4wwwx86KIO0YUVQuka1lubqVT1HYgOjEflycycCQfZmyVoQN2UMMYst03XjdmHyCLNq1xi+25bG6kbqcs5/H3Ss2w0W25e15tG/V3h9AFTAJ55A1da4cb0JQin6aTEBNe5m/OMi1PTP6FjHUYxr6M5e+g9gY5H/w3SKrLxIHJq++RIjPv Lc8qD/Cc zDTJfROWsLOyBBgWhFWQw7wwtS0RdLCOMbQnHwU72Vx+270VNtLgtb/4jx+TEZYVSEE6pwaayCOYCH82OUCmMlPc1x4JflygdhVCRhzlmVf9ebE+xlgxtoKx6eVFnlY/eIw3p6Fk/GxN6UkMm+eBmfFbqHOj9zKoItMaUdEl6sLsvffNsB9hMbta4ZLuco9ttQLpUqSAaeai/Syt8+678PPBla3GP96urJFYXevn3FYDbNstPmI28nBXrWIo9ZKl/EfqN4dqdRcHGQZzBhOcE25G4wlnKfrhDShOYcb9muhlT4TPNPPlngwIdqghDKwsq5est X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 11/12/25 9:05 pm, Dev Jain wrote: > > On 11/12/25 8:58 pm, Ryan Roberts wrote: >> On 10/12/2025 22:28, Vishal Moola (Oracle) wrote: >>> On Wed, Dec 10, 2025 at 01:21:22PM +0000, Ryan Roberts wrote: >>>> Hi Vishal, >>>> >>>> >>>> On 21/10/2025 20:44, Vishal Moola (Oracle) wrote: >>>>> Sometimes, vm_area_alloc_pages() will want many pages from the buddy >>>>> allocator. Rather than making requests to the buddy allocator for at >>>>> most 100 pages at a time, we can eagerly request large order pages a >>>>> smaller number of times. >>>>> >>>>> We still split the large order pages down to order-0 as the rest >>>>> of the >>>>> vmalloc code (and some callers) depend on it. We still defer to >>>>> the bulk >>>>> allocator and fallback path in case of order-0 pages or failure. >>>>> >>>>> Running 1000 iterations of allocations on a small 4GB system finds: >>>>> >>>>> 1000 2mb allocations: >>>>>     [Baseline]            [This patch] >>>>>     real    46.310s            real    0m34.582 >>>>>     user    0.001s            user    0.006s >>>>>     sys     46.058s            sys     0m34.365s >>>>> >>>>> 10000 200kb allocations: >>>>>     [Baseline]            [This patch] >>>>>     real    56.104s            real    0m43.696 >>>>>     user    0.001s            user    0.003s >>>>>     sys     55.375s            sys     0m42.995s >>>> I'm seeing some big vmalloc micro benchmark regressions on arm64, >>>> for which >>>> bisect is pointing to this patch. >>> Ulad had similar findings/concerns[1]. Tldr: The numbers you are seeing >>> are expected for how the test module is currently written. >> Hmm... simplistically, I'd say that either the tests are bad, in >> which case they >> should be deleted, or they are good, in which case we shouldn't >> ignore the >> regressions. Having tests that we learn to ignore is the worst of >> both worlds. > > AFAICR the test does some million-odd iterations by default, which is > the real problem. > On my RFC [1] I notice that reducing the iterations reduces the > regression - till > some multiple of ten thousand iterations, the regression is zero. > Doing this > alloc->free a million freaking times messes up the buddy badly. > > [1] https://lore.kernel.org/all/20251112110807.69958-1-dev.jain@arm.com/ So this line: __param(int, test_loop_count, 1000000,         "Set test loop counter"); We should just change it to 20k or something and that should resolve it. > >> >> But I see your point about the allocation pattern not being very >> realistic. >> >>>> The tests are all originally from the vmalloc_test module. Note >>>> that (R) >>>> indicates a statistically significant regression and (I) indicates a >>>> statistically improvement. >>>> >>>> p is number of pages in the allocation, h is huge. So it looks like >>>> the >>>> regressions are all coming for the non-huge case, where we want to >>>> split to >>>> order-0. >>>> >>>> +---------------------------------+----------------------------------------------------------+------------+------------------------+ >>>> >>>> | Benchmark                       | Result >>>> Class                                             | 6-18-0 |   >>>> 6-18-0-gc2f2b01b74be | >>>> +=================================+==========================================================+============+========================+ >>>> >>>> | micromm/vmalloc                 | fix_align_alloc_test: p:1, h:0, >>>> l:500000 (usec)          |  514126.58 | (R) -42.20% | >>>> |                                 | fix_size_alloc_test: p:1, h:0, >>>> l:500000 (usec)           |  320458.33 |                 -0.02% | >>>> |                                 | fix_size_alloc_test: p:4, h:0, >>>> l:500000 (usec)           |  399680.33 |            (R) -23.43% | >>>> |                                 | fix_size_alloc_test: p:16, h:0, >>>> l:500000 (usec)          |  788723.25 |            (R) -23.66% | >>>> |                                 | fix_size_alloc_test: p:16, h:1, >>>> l:500000 (usec)          |  979839.58 |                 -1.05% | >>>> |                                 | fix_size_alloc_test: p:64, h:0, >>>> l:100000 (usec)          |  481454.58 |            (R) -23.99% | >>>> |                                 | fix_size_alloc_test: p:64, h:1, >>>> l:100000 (usec)          |  615924.00 |              (I) 2.56% | >>>> |                                 | fix_size_alloc_test: p:256, >>>> h:0, l:100000 (usec)         | 1799224.08 |            (R) -23.28% | >>>> |                                 | fix_size_alloc_test: p:256, >>>> h:1, l:100000 (usec)         | 2313859.25 |              (I) 3.43% | >>>> |                                 | fix_size_alloc_test: p:512, >>>> h:0, l:100000 (usec)         | 3541904.75 |            (R) -23.86% | >>>> |                                 | fix_size_alloc_test: p:512, >>>> h:1, l:100000 (usec)         | 3597577.25 |             (R) -2.97% | >>>> |                                 | full_fit_alloc_test: p:1, h:0, >>>> l:500000 (usec)           |  487021.83 |              (I) 4.95% | >>>> |                                 | kvfree_rcu_1_arg_vmalloc_test: >>>> p:1, h:0, l:500000 (usec) | 344466.33 |                 -0.65% | >>>> |                                 | kvfree_rcu_2_arg_vmalloc_test: >>>> p:1, h:0, l:500000 (usec) | 342484.25 |                 -1.58% | >>>> |                                 | long_busy_list_alloc_test: p:1, >>>> h:0, l:500000 (usec)     | 4034901.17 |            (R) -25.35% | >>>> |                                 | pcpu_alloc_test: p:1, h:0, >>>> l:500000 (usec)               |  195973.42 |                  0.57% | >>>> |                                 | random_size_align_alloc_test: >>>> p:1, h:0, l:500000 (usec)  | 643489.33 |            (R) -47.63% | >>>> |                                 | random_size_alloc_test: p:1, >>>> h:0, l:500000 (usec)        | 2029261.33 | (R) -27.88% | >>>> |                                 | vm_map_ram_test: p:1, h:0, >>>> l:500000 (usec)               |   83557.08 |                 -0.22% | >>>> +---------------------------------+----------------------------------------------------------+------------+------------------------+ >>>> >>>> >>>> I have a couple of thoughts from looking at the patch: >>>> >>>>   - Perhaps split_page() is the bulk of the cost? Previously for >>>> this case we >>>>     were allocating order-0 so there was no split to do. For h=1, >>>> split would >>>>     have already been called so that would explain why no >>>> regression for that >>>>     case? >>> For h=1, this patch shouldn't change (as long as nr_pages < >>> arch_vmap_{pte,pmd}_supported_shift). This is why you don't see >>> regressions >>> in those cases. >> arm64 supports 64K contigous-mappings with vmalloc so once nr_pages >> >= 16 we can >> take the huge path. >> >>>>   - I guess we are bypassing the pcpu cache? Could this be having >>>> an effect? Dev >>>>     (cc'ed) did some similar investigation a while back and saw >>>> increased vmalloc >>>>     latencies when bypassing pcpu cache. >>> I'd say this is more a case of this test module targeting the pcpu >>> cache. The module allocates then frees one at a time, which promotes >>> reusing pcpu pages. [1] Has some numbers after modifying the test such >>> that all the allocations are made before freeing any. >> OK fair enough. >> >> We are seeing a bunch of other regressions in higher level benchmarks >> too; but >> haven't yet concluded what's causing those. I'll report back if this >> patch looks >> connected. >> >> Thanks, >> Ryan >> >> >>>>   - Philosophically is allocating physically contiguous memory when >>>> it is not >>>>     strictly needed the right thing to do? Large physically >>>> contiguous blocks are >>>>     a scarce resource so we don't want to waste them. Although I >>>> guess it could >>>>     be argued that this actually preserves the contiguous blocks >>>> because the >>>>     lifetime of all the pages is tied together. Anyway, I doubt >>>> this is the >>> This was the primary incentive for this patch :) >>> >>>>     reason for the slow down, since those benchmarks are not under >>>> memory >>>>     pressure. >>>> >>>> Anyway, it would be good to resolve the performance regressions if >>>> we can. >>> Imo, the appropriate way to address these is to modify the test module >>> as seen in [1]. >>> >>> [1] https://lore.kernel.org/linux-mm/aPJ6lLf24TfW_1n7@milan/ >> >