From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6268AC636D4
	for <linux-mm@archiver.kernel.org>; Tue,  7 Feb 2023 16:11:06 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id AC47A6B00F3; Tue,  7 Feb 2023 11:11:05 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A73A66B00F6; Tue,  7 Feb 2023 11:11:05 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 93B576B00F7; Tue,  7 Feb 2023 11:11:05 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 83CA26B00F3
	for <linux-mm@kvack.org>; Tue,  7 Feb 2023 11:11:05 -0500 (EST)
Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 3F3EDA0583
	for <linux-mm@kvack.org>; Tue,  7 Feb 2023 16:11:05 +0000 (UTC)
X-FDA: 80440984890.29.68B7A56
Received: from mailout2n.rrzn.uni-hannover.de (mailout2n.rrzn.uni-hannover.de [130.75.2.113])
	by imf10.hostedemail.com (Postfix) with ESMTP id ACD05C0007
	for <linux-mm@kvack.org>; Tue,  7 Feb 2023 16:11:02 +0000 (UTC)
Authentication-Results: imf10.hostedemail.com;
	dkim=none;
	dmarc=none;
	spf=pass (imf10.hostedemail.com: domain of halbuer@sra.uni-hannover.de designates 130.75.2.113 as permitted sender) smtp.mailfrom=halbuer@sra.uni-hannover.de
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1675786263;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=Dr/RZluAxNITSzdjJqxDiHubYP+/qkXx5+Q9ZYFrBuw=;
	b=YOEmir58Ub7LGctWikc1vlFISikcS3O2gxB1lqQ+gcX8qZ18+tIl/N5uKSemEpxGpgqGMQ
	G1bO1sC/09xKlqTSXHBLbZWUFxSRnQZYrii7IMp0Y0V8Vo6oStICcr84d3FSFbwGgCi4VF
	Qql1jSKuMNNakU97FbGL3IjIsGoKTpM=
ARC-Authentication-Results: i=1;
	imf10.hostedemail.com;
	dkim=none;
	dmarc=none;
	spf=pass (imf10.hostedemail.com: domain of halbuer@sra.uni-hannover.de designates 130.75.2.113 as permitted sender) smtp.mailfrom=halbuer@sra.uni-hannover.de
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675786263; a=rsa-sha256;
	cv=none;
	b=LAh9SsfrJLIa6qbL4tVyyQfgPvXzUKk9D2Czon2W5w8qhQP0d7Fg81beIXINWuuTPq5VFA
	kNBeQnU2e7h2btU6oo3ZcqiGSx8oASCZRFPBWmX1ylT0C2+0NXgcKIfyO/xgcJXju71Z1o
	CZ0V+Oy9P/qqZR3vVKOcBPreiFAnEJM=
Received: from [10.23.33.223] (lab.sra.uni-hannover.de [130.75.33.87])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits))
	(No client certificate requested)
	by mailout2n.rrzn.uni-hannover.de (Postfix) with ESMTPSA id 4787B1F53B;
	Tue,  7 Feb 2023 17:11:00 +0100 (CET)
Message-ID: <68ba44d8-6899-c018-dcb3-36f3a96e6bea@sra.uni-hannover.de>
Date: Tue, 7 Feb 2023 17:11:00 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.7.1
Subject: Re: [PATCH] mm: reduce lock contention of pcp buffer refill
Content-Language: en-US, de-DE
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
 Vlastimil Babka <vbabka@suse.cz>, Mel Gorman <mgorman@techsingularity.net>
References: <20230201162549.68384-1-halbuer@sra.uni-hannover.de>
 <20230202152501.297639031e96baad35cdab17@linux-foundation.org>
From: Alexander Halbuer <halbuer@sra.uni-hannover.de>
In-Reply-To: <20230202152501.297639031e96baad35cdab17@linux-foundation.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Virus-Scanned: clamav-milter 0.103.7 at mailout2n
X-Virus-Status: Clean
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: ACD05C0007
X-Stat-Signature: x8d857o6cd1fftwz6nc4bnyf7pajr7ya
X-HE-Tag: 1675786262-186058
X-HE-Meta: U2FsdGVkX1+shKhRrApkOY90Al/fPjZ5UzH9J0DcfDzJblq+/A++mam6LzrC9ceeCaHQoVqUTYHeL22K6BJIbm1zWECIV3kT7ZNhRm7G0e+0BFE9hCJe3RMePMCQvtE7inlzkck9ccG/nofUeAU+vvUGmZLgP474Pkf4n9g2ezc0BsANl5OnuPN1+/fJSQL8Zogn7hl+pDVAkpt13qJMLJI8HEZKWMv60ZGsra5RL0KBvLuOHbV7WokKWWuEf/Z2eAl/x9bYFkUC+tKPxXL5cBcQbQT2458AUUx3b18iP77pg0f0BlfRyEeq5sKvVYChcs4Um7o674jRyb6tSl2MLLj2xQnK9HX8DYlgPbo/KkzgIIWgcA3c91i2LsDZFd782DgYMqowUweo9sYcJIF3/OTENM+ePiUqnxUVkZcpZXCst53qpArKP/t6UoYApk0fKNLa2AjRTM26SqP/ZSSrzPX4g20n2VoiVJwRpLxpDI2fiBwtP9yeCs2sScvM+0lNX3uYwhXEma3Ujp9TQJTHl947t6i8dh/hkwgIIA6AEDbF+yg3v8bdmhfhOrM2qh2W4c1uspwL9qBOpA91mHkd8fow51TGBbS6f8KgqpDtFr2QbYDqMC6Dthz7VQRIJUpdGzxlnZDmEWEr2qlnaYlosnyNFZ1O4K5KFKDq9hkgxbUg18GmbhPdBhrfRfrqdZcaRTX+o3J8EhHXl8LxlmJnafNM/RIpWDE/n131UZPohKqSSMBXbb1fzU+4Hmr0e/8RJSXEUGpz7Tq2OZIP7MzMYT4Vak8ja8SpAPXHyP7tGfGlyfRztNb48C6kwD7fP6a9bp1XfMOj/EXApbDuZts8Pns3LnNj/R/XwZcTYSGN114eDsWI9mk94RiQnXSlkNu7TBeRV8D/nwEtX1Ot+pt7TA+jraVEUB/6xv1sMjCUNPD/0BnOTs8DsrvyIl1KXYxOgZnGI5S86/GR52r9+kG
 preD7W7Q
 7Htql34xQ4YyAgmyhvPxC2oYfgmqXgFHW0qaXauY0Oonso/bKAHYIzIBV5as0tmsmGdt6h8FlcH+n1o7YH98MLsa5KifLA+KBy7oatlzO8MLU3IU=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 2/3/23 00:25, Andrew Morton wrote:
> On Wed,  1 Feb 2023 17:25:49 +0100 Alexander Halbuer <halbuer@sra.uni-hannover.de> wrote:
>
>> The `rmqueue_bulk` function batches the allocation of multiple elements to
>> refill the per-CPU buffers into a single hold of the zone lock. Each
>> element is allocated and checked using the `check_pcp_refill` function.
>> The check touches every related struct page which is especially expensive
>> for higher order allocations (huge pages). This patch reduces the time
>> holding the lock by moving the check out of the critical section similar
>> to the `rmqueue_buddy` function which allocates a single element.
>> Measurements of parallel allocation-heavy workloads show a reduction of
>> the average huge page allocation latency of 50 percent for two cores and
>> nearly 90 percent for 24 cores.
> Sounds nice.
>
> Were you able to test how much benefit we get by simply removing the
> check_new_pages() call from rmqueue_bulk()?
I did some further investigations and measurements to quantify potential
performance gains. Benchmarks ran on a machine with 24 physical cores
fixed at 2.1 GHz. The results show significant performance gains with the
patch applied in parallel scenarios. Eliminating the check reduces
allocation latency further, especially for low core counts.

The following tables show the average allocation latencies for huge pages
and normal pages for three different configurations:
The unpatched kernel, the patched kernel, and an additional version
without the check.

Huge pages
+-------+---------+-------+----------+---------+----------+
| Cores | Default | Patch |    Patch | NoCheck |  NoCheck |
|       |    (ns) |  (ns) |     Diff |    (ns) |     Diff |
+-------+---------+-------+----------+---------+----------+
|     1 |     127 |   124 |  (-2.4%) |     118 |  (-7.1%) |
|     2 |     140 |   140 |  (-0.0%) |     134 |  (-4.3%) |
|     3 |     143 |   142 |  (-0.7%) |     134 |  (-6.3%) |
|     4 |     178 |   159 | (-10.7%) |     156 | (-12.4%) |
|     6 |     269 |   239 | (-11.2%) |     237 | (-11.9%) |
|     8 |     363 |   321 | (-11.6%) |     319 | (-12.1%) |
|    10 |     454 |   409 |  (-9.9%) |     409 |  (-9.9%) |
|    12 |     545 |   494 |  (-9.4%) |     488 | (-10.5%) |
|    14 |     639 |   578 |  (-9.5%) |     574 | (-10.2%) |
|    16 |     735 |   660 | (-10.2%) |     653 | (-11.2%) |
|    20 |     915 |   826 |  (-9.7%) |     815 | (-10.9%) |
|    24 |    1105 |   992 | (-10.2%) |     982 | (-11.1%) |
+-------+---------+-------+----------+---------+----------+

Normal pages
+-------+---------+-------+----------+---------+----------+
| Cores | Default | Patch |    Patch | NoCheck |  NoCheck |
|       |    (ns) |  (ns) |     Diff |    (ns) |     Diff |
+-------+---------+-------+----------+---------+----------+
|     1 |    2790 |  2767 |  (-0.8%) |     171 | (-93.9%) |
|     2 |    6685 |  3484 | (-47.9%) |     519 | (-92.2%) |
|     3 |   10501 |  3599 | (-65.7%) |     855 | (-91.9%) |
|     4 |   14264 |  3635 | (-74.5%) |    1139 | (-92.0%) |
|     6 |   21800 |  3551 | (-83.7%) |    1713 | (-92.1%) |
|     8 |   29563 |  3570 | (-87.9%) |    2268 | (-92.3%) |
|    10 |   37210 |  3845 | (-89.7%) |    2872 | (-92.3%) |
|    12 |   44780 |  4452 | (-90.1%) |    3417 | (-92.4%) |
|    14 |   52576 |  5100 | (-90.3%) |    4020 | (-92.4%) |
|    16 |   60118 |  5785 | (-90.4%) |    4604 | (-92.3%) |
|    20 |   75037 |  7270 | (-90.3%) |    6486 | (-91.4%) |
|    24 |   90226 |  8712 | (-90.3%) |    7061 | (-92.2%) |
+-------+---------+-------+----------+---------+----------+

>
> Vlastimil, I find this quite confusing:
>
> #ifdef CONFIG_DEBUG_VM
> /*
>  * With DEBUG_VM enabled, order-0 pages are checked for expected state when
>  * being allocated from pcp lists. With debug_pagealloc also enabled, they are
>  * also checked when pcp lists are refilled from the free lists.
>  */
> static inline bool check_pcp_refill(struct page *page, unsigned int order)
> {
> 	if (debug_pagealloc_enabled_static())
> 		return check_new_pages(page, order);
> 	else
> 		return false;
> }
>
> static inline bool check_new_pcp(struct page *page, unsigned int order)
> {
> 	return check_new_pages(page, order);
> }
> #else
> /*
>  * With DEBUG_VM disabled, free order-0 pages are checked for expected state
>  * when pcp lists are being refilled from the free lists. With debug_pagealloc
>  * enabled, they are also checked when being allocated from the pcp lists.
>  */
> static inline bool check_pcp_refill(struct page *page, unsigned int order)
> {
> 	return check_new_pages(page, order);
> }
> static inline bool check_new_pcp(struct page *page, unsigned int order)
> {
> 	if (debug_pagealloc_enabled_static())
> 		return check_new_pages(page, order);
> 	else
> 		return false;
> }
> #endif /* CONFIG_DEBUG_VM */
>
> and the 4462b32c9285b5 changelog is a struggle to follow.
>
> Why are we performing *any* checks when CONFIG_DEBUG_VM=n and when
> debug_pagealloc_enabled is false?
>
> Anyway, these checks sounds quite costly so let's revisit their
> desirability?
>