From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B9952C05027 for ; Thu, 9 Feb 2023 10:34:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 39C6C6B0071; Thu, 9 Feb 2023 05:34:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 325866B0072; Thu, 9 Feb 2023 05:34:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1C63A6B0074; Thu, 9 Feb 2023 05:34:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 067936B0071 for ; Thu, 9 Feb 2023 05:34:46 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id D0C4116072E for ; Thu, 9 Feb 2023 10:34:45 +0000 (UTC) X-FDA: 80447394930.27.6BEF7AD Received: from mailout2n.rrzn.uni-hannover.de (mailout2n.rrzn.uni-hannover.de [130.75.2.113]) by imf22.hostedemail.com (Postfix) with ESMTP id 340F6C0002 for ; Thu, 9 Feb 2023 10:34:42 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=none; spf=pass (imf22.hostedemail.com: domain of halbuer@sra.uni-hannover.de designates 130.75.2.113 as permitted sender) smtp.mailfrom=halbuer@sra.uni-hannover.de; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675938884; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=nW/ORjZB8XIM79JL0Kj6BYo1r/eeVoaoRgQM//DWAxI=; b=ooektatp5yz50QuOi7I5H/VFdICPdWs+1jwz5nNb2J8J47y7pTI+fyxd7hhIP9nToMuLX5 I/ofOblEXKEBPHOB/KwG3dYX0risDa5tp9KbRo0dLfYfedf+w5UhbjA9P/DiTLQkjlTjbY mFvHtAtG4Fv6RZKPaO/WVUmAv1ZMFec= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=none; spf=pass (imf22.hostedemail.com: domain of halbuer@sra.uni-hannover.de designates 130.75.2.113 as permitted sender) smtp.mailfrom=halbuer@sra.uni-hannover.de; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675938884; a=rsa-sha256; cv=none; b=Qn8TVbk9KJAwfD/oB08iOBc5i2OJKxt5bZx3yya+9bOQNpWiXRSzPOVu7Hf2UvfofY04il 0JioTAtkvO4XwaZ6nZhAliWQunKpcD3zospgqLoRpXngHLUYVRAOuyYCow/AAlW3jAJbtz o7nmQTgnWWfSTBPBAWd9Lnx3KAF6hq0= Received: from [192.168.178.64] (ip5f5abe40.dynamic.kabel-deutschland.de [95.90.190.64]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mailout2n.rrzn.uni-hannover.de (Postfix) with ESMTPSA id 971C21F590; Thu, 9 Feb 2023 11:34:40 +0100 (CET) Message-ID: <610eb535-e2cc-15cf-d776-5486aba118ff@sra.uni-hannover.de> Date: Thu, 9 Feb 2023 11:34:39 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.7.2 Subject: Re: [PATCH] mm: reduce lock contention of pcp buffer refill Content-Language: en-US, de-DE To: Vlastimil Babka , Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Mel Gorman References: <20230201162549.68384-1-halbuer@sra.uni-hannover.de> <20230202152501.297639031e96baad35cdab17@linux-foundation.org> <68ba44d8-6899-c018-dcb3-36f3a96e6bea@sra.uni-hannover.de> <8dc0a48f-f29a-27de-0ced-d39a920b2afd@suse.cz> From: Alexander Halbuer In-Reply-To: <8dc0a48f-f29a-27de-0ced-d39a920b2afd@suse.cz> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Virus-Scanned: clamav-milter 0.103.7 at mailout2n X-Virus-Status: Clean X-Stat-Signature: uh7cq9c6y58mqyb1ixrnskd4tnp7uc1e X-Rspam-User: X-Rspamd-Queue-Id: 340F6C0002 X-Rspamd-Server: rspam06 X-HE-Tag: 1675938882-802981 X-HE-Meta: U2FsdGVkX183OO+59qwg2B70XseFpUBbgRF0tgYdE9RvoxzKip8H6bhxTQIOUaQo+tFqy1b+aBIpruYB08Mj6dttdsYD81II21flRV8vtY0MCY9nSKg5q/fQQ0tsFQ/RuyQr2MkQlqwJAev/U1jysoRw6tkTEXstwWEtWiUaLRyEpfhjgpgKOfVL6ZiR1JwyTr0EVdmFfDE+yXq6jsiFaqil/MsCwFZZMfrUDcvOZzYiXUzpm92qWcwSPokNlouWpNiBFiCeXmYF5TLdTdjHJjFtepnEgKbq8s54uCsJ7qyYkAgTSgp11wnDGBXovfJ9XamKuQTNDoJZXTxiGVXB5cv/m3Qvk3y7U41QBs5rO15F1YVmInEmjBArEOhjerfFbLKQInCwc27ymzNTVLMG3EyAdiwgbI6ExcrVxIHQfRWotrftXsrKNeNkUe6SNvmOVCIaXAfAJf3ySIjIbxpa6B533tH4haSgJAWlxwHnoIHgt5OqxHvLHee/Gx2mxOh5dgbvrXCEXUF/ZN3VbGAAp6tVG6B5e02Mi0ZIwmAFGKDy5aF77h4U8waZEq8x1otYTwNIYC1IVNhb7YxiXjNvejfyd8+R+elwU4oLtT0TsJzTazBjLmr4YS169zciihhyfE2JCC8jRfVg99UNFsdr3fJ58KNRgLhjNk6RlO5vUtyPr64g8RReh5+7AkI1YV+jgTPFFaKXDHcdWfZ8QvaDa2GMr+tY6FBg2UFVn2iuQPCrcX/TtTeC+WdCMgzl9/lQwfyrS2GveWRnCuc5vlKUpIFwErtBqF9FsAv1xwZ38D9tRz08Jtha99GSXPBEm5hkkI69iM6MvkEFhr9KzChw6cjHCQPGcGvFKwHmTukUJQfVv9y+4L2u+stCz/fQGc33stXFxRIUd62wjUgsu8n31ksRBQjVq06zRlx1lHr/DBSJaTV7LL71kVeeYzH4DdL6mTYdOoIUrH4+/pUiPY8 0PmB3ofl JrF7dpHIpVfyFPJEyHdBctFHdZufdjOXaPo+Gg4QXlGvxvZC4QQclTskDHUAFuf4b2XEPTCX9fY1qdIibY9wHJQJVDbYy5pWRBQ0QOQjcJyvcPnI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2/8/23 16:11, Vlastimil Babka wrote: > On 2/7/23 17:11, Alexander Halbuer wrote: >> On 2/3/23 00:25, Andrew Morton wrote: >>> On Wed, 1 Feb 2023 17:25:49 +0100 Alexander Halbuer wrote: >>> >>>> The `rmqueue_bulk` function batches the allocation of multiple elements to >>>> refill the per-CPU buffers into a single hold of the zone lock. Each >>>> element is allocated and checked using the `check_pcp_refill` function. >>>> The check touches every related struct page which is especially expensive >>>> for higher order allocations (huge pages). This patch reduces the time >>>> holding the lock by moving the check out of the critical section similar >>>> to the `rmqueue_buddy` function which allocates a single element. >>>> Measurements of parallel allocation-heavy workloads show a reduction of >>>> the average huge page allocation latency of 50 percent for two cores and >>>> nearly 90 percent for 24 cores. >>> Sounds nice. >>> >>> Were you able to test how much benefit we get by simply removing the >>> check_new_pages() call from rmqueue_bulk()? >> I did some further investigations and measurements to quantify potential >> performance gains. Benchmarks ran on a machine with 24 physical cores >> fixed at 2.1 GHz. The results show significant performance gains with the >> patch applied in parallel scenarios. Eliminating the check reduces >> allocation latency further, especially for low core counts. > > Are the benchmarks done via kernel module calling bulk allocation, or from > userspace? The benchmarks are done via a kernel module using the alloc_pages_node() function to allocate from a specific NUMA node. (The test machine is a dual-socket system featuring two Intel Xeon Gold 6252 with 24 physical cores each; benchmark threads are pinned to the first CPU to prevent additional NUMA effects.) The module has originally been developed as part of a discussion about different allocator designs to get actual numbers of an actual buddy allocator implementation. Measurements include different allocation scenarios (bulk, repeat, random) for different allocation sizes (orders) and degrees of parallelism. The parallel bulk allocation benchmark showed an outlier for huge page allocations (order 9) that was much worse than order 8 and even order 10, despite the lack of a pcp buffer for those sizes. This has been the origin of my further investigations. The benchmark module internally uses multiple threads that independently perform a specified number of allocations with a loop around the mentioned alloc_pages_node() function. A barrier synchronizes all those threads at the beginning. The values shown in the table are the average durations of all threads divided by the number of allocations to break it down to a single allocation. >> The following tables show the average allocation latencies for huge pages >> and normal pages for three different configurations: >> The unpatched kernel, the patched kernel, and an additional version >> without the check. >> >> Huge pages >> +-------+---------+-------+----------+---------+----------+ >> | Cores | Default | Patch |    Patch | NoCheck |  NoCheck | >> |       |    (ns) |  (ns) |     Diff |    (ns) |     Diff | >> +-------+---------+-------+----------+---------+----------+ >> |     1 |     127 |   124 |  (-2.4%) |     118 |  (-7.1%) | >> |     2 |     140 |   140 |  (-0.0%) |     134 |  (-4.3%) | >> |     3 |     143 |   142 |  (-0.7%) |     134 |  (-6.3%) | >> |     4 |     178 |   159 | (-10.7%) |     156 | (-12.4%) | >> |     6 |     269 |   239 | (-11.2%) |     237 | (-11.9%) | >> |     8 |     363 |   321 | (-11.6%) |     319 | (-12.1%) | >> |    10 |     454 |   409 |  (-9.9%) |     409 |  (-9.9%) | >> |    12 |     545 |   494 |  (-9.4%) |     488 | (-10.5%) | >> |    14 |     639 |   578 |  (-9.5%) |     574 | (-10.2%) | >> |    16 |     735 |   660 | (-10.2%) |     653 | (-11.2%) | >> |    20 |     915 |   826 |  (-9.7%) |     815 | (-10.9%) | >> |    24 |    1105 |   992 | (-10.2%) |     982 | (-11.1%) | >> +-------+---------+-------+----------+---------+----------+ >> >> Normal pages >> +-------+---------+-------+----------+---------+----------+ >> | Cores | Default | Patch |    Patch | NoCheck |  NoCheck | >> |       |    (ns) |  (ns) |     Diff |    (ns) |     Diff | >> +-------+---------+-------+----------+---------+----------+ >> |     1 |    2790 |  2767 |  (-0.8%) |     171 | (-93.9%) | >> |     2 |    6685 |  3484 | (-47.9%) |     519 | (-92.2%) | >> |     3 |   10501 |  3599 | (-65.7%) |     855 | (-91.9%) | >> |     4 |   14264 |  3635 | (-74.5%) |    1139 | (-92.0%) | >> |     6 |   21800 |  3551 | (-83.7%) |    1713 | (-92.1%) | >> |     8 |   29563 |  3570 | (-87.9%) |    2268 | (-92.3%) | >> |    10 |   37210 |  3845 | (-89.7%) |    2872 | (-92.3%) | >> |    12 |   44780 |  4452 | (-90.1%) |    3417 | (-92.4%) | >> |    14 |   52576 |  5100 | (-90.3%) |    4020 | (-92.4%) | >> |    16 |   60118 |  5785 | (-90.4%) |    4604 | (-92.3%) | >> |    20 |   75037 |  7270 | (-90.3%) |    6486 | (-91.4%) | >> |    24 |   90226 |  8712 | (-90.3%) |    7061 | (-92.2%) | >> +-------+---------+-------+----------+---------+----------+ > > Nice improvements. I assume the table headings are switched? Why would > Normal be more epensive than Huge? Yes, you are right, I messed up the table headings. >>> >>> Vlastimil, I find this quite confusing: >>> >>> #ifdef CONFIG_DEBUG_VM >>> /* >>> * With DEBUG_VM enabled, order-0 pages are checked for expected state when >>> * being allocated from pcp lists. With debug_pagealloc also enabled, they are >>> * also checked when pcp lists are refilled from the free lists. >>> */ >>> static inline bool check_pcp_refill(struct page *page, unsigned int order) >>> { >>> if (debug_pagealloc_enabled_static()) >>> return check_new_pages(page, order); >>> else >>> return false; >>> } >>> >>> static inline bool check_new_pcp(struct page *page, unsigned int order) >>> { >>> return check_new_pages(page, order); >>> } >>> #else >>> /* >>> * With DEBUG_VM disabled, free order-0 pages are checked for expected state >>> * when pcp lists are being refilled from the free lists. With debug_pagealloc >>> * enabled, they are also checked when being allocated from the pcp lists. >>> */ >>> static inline bool check_pcp_refill(struct page *page, unsigned int order) >>> { >>> return check_new_pages(page, order); >>> } >>> static inline bool check_new_pcp(struct page *page, unsigned int order) >>> { >>> if (debug_pagealloc_enabled_static()) >>> return check_new_pages(page, order); >>> else >>> return false; >>> } >>> #endif /* CONFIG_DEBUG_VM */ >>> >>> and the 4462b32c9285b5 changelog is a struggle to follow. >>> >>> Why are we performing *any* checks when CONFIG_DEBUG_VM=n and when >>> debug_pagealloc_enabled is false? >>> >>> Anyway, these checks sounds quite costly so let's revisit their >>> desirability? >>> >