From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 49AD8C3DA61
	for <linux-mm@archiver.kernel.org>; Mon, 29 Jul 2024 06:18:34 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A80E26B00A0; Mon, 29 Jul 2024 02:18:33 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A31E66B00A2; Mon, 29 Jul 2024 02:18:33 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8AA586B00A3; Mon, 29 Jul 2024 02:18:33 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 64D3F6B00A0
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 02:18:33 -0400 (EDT)
Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 0A521A1D3A
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 06:18:33 +0000 (UTC)
X-FDA: 82391786106.10.C1DF61D
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7])
	by imf29.hostedemail.com (Postfix) with ESMTP id 0C05812001E
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 06:18:29 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=WcweZG61;
	spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.7 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1722233857;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=iwOyCqHiTIfEBOhyxrlOuTbpqvDFrvxcIEBeMKbZaFM=;
	b=FWOCEoETgSI0ILbY4XzyW6EwzAegEEw0W8ZkYWZzs9sic/HgaDF06bGeUW/UOlE+CAXJ+s
	9J0o3p6iLeHAnaqwsgp+4ueFs5DrzsA4f2pJuiRBlPGDQpR+tPIZNGrCcaXWxVlPCH5QAW
	KK7F2Cz9fswKpRMd0mlTnOcsorHsJ+U=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722233857; a=rsa-sha256;
	cv=none;
	b=VpaWYYcu7xdagNNQ43WgVujUfWAlmK4SHZKoS7k/typFdHa/RAXv9kgcYxoQ+CEAzbGaX/
	JsOrJ97a4qvuRWwVWvlVpglqzTUVgJTezCPwvNAwWtlcfFGOvKKeQzuzUVGmVjHkNWlUlX
	aAau0GSt2kEMv96U6b2K5DN80+fYxnk=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=WcweZG61;
	spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.7 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1722233910; x=1753769910;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=wTMoCXYfqTfchbVu2w6DMZ8Uehuw0YJHVNNqp9hEK8w=;
  b=WcweZG61C6pf2s7rdmbdlApZa9Zk5RspxOuWNmlxm63JGXiDCHtDP3fQ
   Bva1622aVU8DQtNz+E6I2WuAEmy3SG/2wG/L4/qdSIoE3X9h9sQfwPdaz
   qs7dtEuh+/YxA6r6AKa80QpGhaW21eRKI8Hv14glEz5zm+CPjn0/93ss/
   VYbAPp8TIws5M0tQOGmz+7Fy2VOmusbc5pUkE3ZbWNubHnx/hhDRzMX1a
   m/j2kyh2N1eH+Xr+YsAL9Y0SZPy25ZfYG0t28D+zKPUlH6TtAjMvBCs/E
   zuaNs2cgowWqu95TIkRC7j7m2ZHUXTIhpwxHEwCGjFwTXjevheJBL8D4+
   Q==;
X-CSE-ConnectionGUID: 3751QmbjQeSBpSaWGh/UUw==
X-CSE-MsgGUID: R3PYbAWiT6+y8m8unuQjjw==
X-IronPort-AV: E=McAfee;i="6700,10204,11147"; a="45379927"
X-IronPort-AV: E=Sophos;i="6.09,245,1716274800"; 
   d="scan'208";a="45379927"
Received: from fmviesa001.fm.intel.com ([10.60.135.141])
  by fmvoesa101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2024 23:18:29 -0700
X-CSE-ConnectionGUID: Skge7SkPQl+dWuVnLEVP5Q==
X-CSE-MsgGUID: pBTU9Ha1R0K721wUb0hAIw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.09,245,1716274800"; 
   d="scan'208";a="84823554"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2024 23:18:27 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: akpm@linux-foundation.org,  mgorman@techsingularity.net,
  linux-mm@kvack.org,  Matthew Wilcox <willy@infradead.org>,  David
 Rientjes <rientjes@google.com>
Subject: Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob
 vm.pcp_batch_scale_max
In-Reply-To: <CALOAHbCGBVfeiyy5+03u3qqJyXjdNjoQQJbeAhUGKpvtFByb=A@mail.gmail.com>
	(Yafang Shao's message of "Mon, 29 Jul 2024 14:13:07 +0800")
References: <20240729023532.1555-1-laoar.shao@gmail.com>
	<20240729023532.1555-4-laoar.shao@gmail.com>
	<878qxkyjfr.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbBm+Mdhu9qcJkoOcLhVUoLKF50dLfhpTd_w_uR8h3yy6g@mail.gmail.com>
	<874j88ye65.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbCi3E61PWHKBJneWT-A-X_C9ezBdjZn0zrLM9gig5XJ+Q@mail.gmail.com>
	<87zfq0wxub.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbDF3veOjfLuoo8ufznvn2w1qxZR18iz3MASOZSiG3jB_A@mail.gmail.com>
	<87v80owxd8.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbCGBVfeiyy5+03u3qqJyXjdNjoQQJbeAhUGKpvtFByb=A@mail.gmail.com>
Date: Mon, 29 Jul 2024 14:14:54 +0800
Message-ID: <87r0bcwwpt.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 0C05812001E
X-Stat-Signature: tukbggjka3rt7j1853jxnwo5xtejrsgs
X-HE-Tag: 1722233909-33220
X-HE-Meta: U2FsdGVkX1/3ykRfUJ7vD/J/ANxNVUArYvsSrCogO7cKofih8eJ0i6GwPfuXErSNn1SMNv2Ls2bULEDV7VmuBcpWmrvxT2ZTeQZY3LeZNxJyREWYaaJo9hEq11UojkJIMqyqJINSTcwLKlzOVz59nFhWvmCp42K1cAw7kP+8NCKISWK6BsjV3P+d1ct6i/9Lez8AjYDe718SW1Q/sbwSNIX51ART94lO2hzhNFS1fSPksI5ZUbIVjDuRz60hL8gQhCIvkqxgejcZ7WGmf48IZXQlMQOgitipwT6qnVeYvsUtWZz1KHwvowPtdqWoPNdWANohlOlxdeADvL2VKXjEl2YoVAaadaXskqJm34ecg147ww5k5v4QoNNNbijc5jXqh2NErcxcpZKPNz9i1ikIfb1cuZxhKOmMfPQhRPmonGg6jmN2D4gqtNcQppos8ei7EeGx+8UV9cYIxFIQjq3GfClAbGMUqO/BOvh4taeGNPtGS+6T3q4vKw4e9UCyDli0gX5SW7ubYNvMVkJ2qR7DpqLoqPwRIOBrBi2ltfJiKH0MhBCRXP7DSyo2RdUx44WXqE21/RlN2g+ER1ySvnrrysqFeCK/Cn6MQ/w/XdV428DBg9vsuQHD2ZNbZuY6iQajUbTCUX4EhpTuz81wJ0R9Bpauy+IQoTO2TBxZtmhS1vBa8TjxTjKrfcfYDSudM9As3POS/EaOYdgxBVV0bftMhm2H9SNfV70vSC1rHVKEdcBL4QMjz3L5kci/PvOSqk7y121xBgY/UiXtTRuv7xB4AnN4sIaR3ZdYr/CpT3kJ6Z8Uja4LP4oO0bv3XQzNneXEHmUnL8GanseG47yGFIYzKXgQGkuULPr4waYwYOXguQKLHZ1H9zvLA/d/TJpHhOpadoHEXCA0U0rNInJlu7SFvF7UWCRmUuv1rM0mEVNiNThup0/cH9gMth63U6Qgolc/9/C1amU2qOnUzxIXkMx
 WRA7lwOh
 OEQo+ODma6MEBcKOKksbygPGNFisQGaAZ2kgMVQDpgD0qSSCeXggerfGi9vt77WtSE/ulG/lEXAN9IKXFqFR5EQq+w1HktV9AXuWo2a2v9rP1/ccB0duHs3ZOWjA73JzeSGs1T2NcUrpdVZeskpAvKY0/PvympvZr2ALP559zJnvO3IliurPGgOG7P3alPpkY+DiHWLW83zvYCdTbGVELXCkneNpFeVGexa7szwXXeWg/4tltVfY4htNxT+ZVrx3kNdDk4x8E3s02ZhEZPCbvX4LpH6c+ladsLoUqR8kGJ2JVf3br84Z+u+k1Vn7rmIxfBETLSO/XOLxZt04WKjLz+olfUTnn3QEyDqO6l0ovXZgkO1EtZagenR5oMw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Yafang Shao <laoar.shao@gmail.com> writes:

> On Mon, Jul 29, 2024 at 2:04=E2=80=AFPM Huang, Ying <ying.huang@intel.com=
> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Mon, Jul 29, 2024 at 1:54=E2=80=AFPM Huang, Ying <ying.huang@intel.=
com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Mon, Jul 29, 2024 at 1:16=E2=80=AFPM Huang, Ying <ying.huang@int=
el.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > On Mon, Jul 29, 2024 at 11:22=E2=80=AFAM Huang, Ying <ying.huang=
@intel.com> wrote:
>> >> >> >>
>> >> >> >> Hi, Yafang,
>> >> >> >>
>> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > During my recent work to resolve latency spikes caused by zon=
e->lock
>> >> >> >> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is dif=
ficult to use
>> >> >> >> > in practice.
>> >> >> >>
>> >> >> >> As we discussed before [1], I still feel confusing about the de=
scription
>> >> >> >> about zone->lock contention.  How about change the description =
to
>> >> >> >> something like,
>> >> >> >
>> >> >> > Sure, I will change it.
>> >> >> >
>> >> >> >>
>> >> >> >> Larger page allocation/freeing batch number may cause longer ru=
n time of
>> >> >> >> code holding zone->lock.  If zone->lock is heavily contended at=
 the same
>> >> >> >> time, latency spikes may occur even for casual page allocation/=
freeing.
>> >> >> >> Although reducing the batch number cannot make zone->lock conte=
nded
>> >> >> >> lighter, it can reduce the latency spikes effectively.
>> >> >> >>
>> >> >> >> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-des=
k2.ccr.corp.intel.com/
>> >> >> >>
>> >> >> >> > To demonstrate this, I wrote a Python script:
>> >> >> >> >
>> >> >> >> >   import mmap
>> >> >> >> >
>> >> >> >> >   size =3D 6 * 1024**3
>> >> >> >> >
>> >> >> >> >   while True:
>> >> >> >> >       mm =3D mmap.mmap(-1, size)
>> >> >> >> >       mm[:] =3D b'\xff' * size
>> >> >> >> >       mm.close()
>> >> >> >> >
>> >> >> >> > Run this script 10 times in parallel and measure the allocati=
on latency by
>> >> >> >> > measuring the duration of rmqueue_bulk() with the BCC tools
>> >> >> >> > funclatency[1]:
>> >> >> >> >
>> >> >> >> >   funclatency -T -i 600 rmqueue_bulk
>> >> >> >> >
>> >> >> >> > Here are the results for both AMD and Intel CPUs.
>> >> >> >> >
>> >> >> >> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtua=
l server
>> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> >> >> >
>> >> >> >> > - Default value of 5
>> >> >> >> >
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                        =
                |
>> >> >> >> >          2 -> 3          : 0        |                        =
                |
>> >> >> >> >          4 -> 7          : 0        |                        =
                |
>> >> >> >> >          8 -> 15         : 0        |                        =
                |
>> >> >> >> >         16 -> 31         : 0        |                        =
                |
>> >> >> >> >         32 -> 63         : 0        |                        =
                |
>> >> >> >> >         64 -> 127        : 0        |                        =
                |
>> >> >> >> >        128 -> 255        : 0        |                        =
                |
>> >> >> >> >        256 -> 511        : 0        |                        =
                |
>> >> >> >> >        512 -> 1023       : 12       |                        =
                |
>> >> >> >> >       1024 -> 2047       : 9116     |                        =
                |
>> >> >> >> >       2048 -> 4095       : 2004     |                        =
                |
>> >> >> >> >       4096 -> 8191       : 2497     |                        =
                |
>> >> >> >> >       8192 -> 16383      : 2127     |                        =
                |
>> >> >> >> >      16384 -> 32767      : 2483     |                        =
                |
>> >> >> >> >      32768 -> 65535      : 10102    |                        =
                |
>> >> >> >> >      65536 -> 131071     : 212730   |*******************     =
                |
>> >> >> >> >     131072 -> 262143     : 314692   |************************=
*****           |
>> >> >> >> >     262144 -> 524287     : 430058   |************************=
****************|
>> >> >> >> >     524288 -> 1048575    : 224032   |********************    =
                |
>> >> >> >> >    1048576 -> 2097151    : 73567    |******                  =
                |
>> >> >> >> >    2097152 -> 4194303    : 17079    |*                       =
                |
>> >> >> >> >    4194304 -> 8388607    : 3900     |                        =
                |
>> >> >> >> >    8388608 -> 16777215   : 750      |                        =
                |
>> >> >> >> >   16777216 -> 33554431   : 88       |                        =
                |
>> >> >> >> >   33554432 -> 67108863   : 2        |                        =
                |
>> >> >> >> >
>> >> >> >> > avg =3D 449775 nsecs, total: 587066511229 nsecs, count: 13052=
42
>> >> >> >> >
>> >> >> >> > The avg alloc latency can be 449us, and the max latency can b=
e higher
>> >> >> >> > than 30ms.
>> >> >> >> >
>> >> >> >> > - Value set to 0
>> >> >> >> >
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                        =
                |
>> >> >> >> >          2 -> 3          : 0        |                        =
                |
>> >> >> >> >          4 -> 7          : 0        |                        =
                |
>> >> >> >> >          8 -> 15         : 0        |                        =
                |
>> >> >> >> >         16 -> 31         : 0        |                        =
                |
>> >> >> >> >         32 -> 63         : 0        |                        =
                |
>> >> >> >> >         64 -> 127        : 0        |                        =
                |
>> >> >> >> >        128 -> 255        : 0        |                        =
                |
>> >> >> >> >        256 -> 511        : 0        |                        =
                |
>> >> >> >> >        512 -> 1023       : 92       |                        =
                |
>> >> >> >> >       1024 -> 2047       : 8594     |                        =
                |
>> >> >> >> >       2048 -> 4095       : 2042818  |******                  =
                |
>> >> >> >> >       4096 -> 8191       : 8737624  |************************=
**              |
>> >> >> >> >       8192 -> 16383      : 13147872 |************************=
****************|
>> >> >> >> >      16384 -> 32767      : 8799951  |************************=
**              |
>> >> >> >> >      32768 -> 65535      : 2879715  |********                =
                |
>> >> >> >> >      65536 -> 131071     : 659600   |**                      =
                |
>> >> >> >> >     131072 -> 262143     : 204004   |                        =
                |
>> >> >> >> >     262144 -> 524287     : 78246    |                        =
                |
>> >> >> >> >     524288 -> 1048575    : 30800    |                        =
                |
>> >> >> >> >    1048576 -> 2097151    : 12251    |                        =
                |
>> >> >> >> >    2097152 -> 4194303    : 2950     |                        =
                |
>> >> >> >> >    4194304 -> 8388607    : 78       |                        =
                |
>> >> >> >> >
>> >> >> >> > avg =3D 19359 nsecs, total: 708638369918 nsecs, count: 366046=
36
>> >> >> >> >
>> >> >> >> > The avg was reduced significantly to 19us, and the max latenc=
y is reduced
>> >> >> >> > to less than 8ms.
>> >> >> >> >
>> >> >> >> > - Conclusion
>> >> >> >> >
>> >> >> >> > On this AMD CPU, reducing vm.pcp_batch_scale_max significantl=
y helps reduce
>> >> >> >> > latency. Latency-sensitive applications will benefit from thi=
s tuning.
>> >> >> >> >
>> >> >> >> > However, I don't have access to other types of AMD CPUs, so I=
 was unable to
>> >> >> >> > test it on different AMD models.
>> >> >> >> >
>> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
>> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> >> >> >
>> >> >> >> > - Default value of 5
>> >> >> >> >
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                        =
                |
>> >> >> >> >          2 -> 3          : 0        |                        =
                |
>> >> >> >> >          4 -> 7          : 0        |                        =
                |
>> >> >> >> >          8 -> 15         : 0        |                        =
                |
>> >> >> >> >         16 -> 31         : 0        |                        =
                |
>> >> >> >> >         32 -> 63         : 0        |                        =
                |
>> >> >> >> >         64 -> 127        : 0        |                        =
                |
>> >> >> >> >        128 -> 255        : 0        |                        =
                |
>> >> >> >> >        256 -> 511        : 0        |                        =
                |
>> >> >> >> >        512 -> 1023       : 2419     |                        =
                |
>> >> >> >> >       1024 -> 2047       : 34499    |*                       =
                |
>> >> >> >> >       2048 -> 4095       : 4272     |                        =
                |
>> >> >> >> >       4096 -> 8191       : 9035     |                        =
                |
>> >> >> >> >       8192 -> 16383      : 4374     |                        =
                |
>> >> >> >> >      16384 -> 32767      : 2963     |                        =
                |
>> >> >> >> >      32768 -> 65535      : 6407     |                        =
                |
>> >> >> >> >      65536 -> 131071     : 884806   |************************=
****************|
>> >> >> >> >     131072 -> 262143     : 145931   |******                  =
                |
>> >> >> >> >     262144 -> 524287     : 13406    |                        =
                |
>> >> >> >> >     524288 -> 1048575    : 1874     |                        =
                |
>> >> >> >> >    1048576 -> 2097151    : 249      |                        =
                |
>> >> >> >> >    2097152 -> 4194303    : 28       |                        =
                |
>> >> >> >> >
>> >> >> >> > avg =3D 96173 nsecs, total: 106778157925 nsecs, count: 1110263
>> >> >> >> >
>> >> >> >> > - Conclusion
>> >> >> >> >
>> >> >> >> > This Intel CPU works fine with the default setting.
>> >> >> >> >
>> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
>> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> >> >> >
>> >> >> >> > Using the cpuset cgroup, we can restrict the test script to r=
un on NUMA
>> >> >> >> > node 0 only.
>> >> >> >> >
>> >> >> >> > - Default value of 5
>> >> >> >> >
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                        =
                |
>> >> >> >> >          2 -> 3          : 0        |                        =
                |
>> >> >> >> >          4 -> 7          : 0        |                        =
                |
>> >> >> >> >          8 -> 15         : 0        |                        =
                |
>> >> >> >> >         16 -> 31         : 0        |                        =
                |
>> >> >> >> >         32 -> 63         : 0        |                        =
                |
>> >> >> >> >         64 -> 127        : 0        |                        =
                |
>> >> >> >> >        128 -> 255        : 0        |                        =
                |
>> >> >> >> >        256 -> 511        : 46       |                        =
                |
>> >> >> >> >        512 -> 1023       : 695      |                        =
                |
>> >> >> >> >       1024 -> 2047       : 19950    |*                       =
                |
>> >> >> >> >       2048 -> 4095       : 1788     |                        =
                |
>> >> >> >> >       4096 -> 8191       : 3392     |                        =
                |
>> >> >> >> >       8192 -> 16383      : 2569     |                        =
                |
>> >> >> >> >      16384 -> 32767      : 2619     |                        =
                |
>> >> >> >> >      32768 -> 65535      : 3809     |                        =
                |
>> >> >> >> >      65536 -> 131071     : 616182   |************************=
****************|
>> >> >> >> >     131072 -> 262143     : 295587   |*******************     =
                |
>> >> >> >> >     262144 -> 524287     : 75357    |****                    =
                |
>> >> >> >> >     524288 -> 1048575    : 15471    |*                       =
                |
>> >> >> >> >    1048576 -> 2097151    : 2939     |                        =
                |
>> >> >> >> >    2097152 -> 4194303    : 243      |                        =
                |
>> >> >> >> >    4194304 -> 8388607    : 3        |                        =
                |
>> >> >> >> >
>> >> >> >> > avg =3D 144410 nsecs, total: 150281196195 nsecs, count: 10406=
51
>> >> >> >> >
>> >> >> >> > The zone->lock contention becomes severe when there is only a=
 single NUMA
>> >> >> >> > node. The average latency is approximately 144us, with the ma=
ximum
>> >> >> >> > latency exceeding 4ms.
>> >> >> >> >
>> >> >> >> > - Value set to 0
>> >> >> >> >
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                        =
                |
>> >> >> >> >          2 -> 3          : 0        |                        =
                |
>> >> >> >> >          4 -> 7          : 0        |                        =
                |
>> >> >> >> >          8 -> 15         : 0        |                        =
                |
>> >> >> >> >         16 -> 31         : 0        |                        =
                |
>> >> >> >> >         32 -> 63         : 0        |                        =
                |
>> >> >> >> >         64 -> 127        : 0        |                        =
                |
>> >> >> >> >        128 -> 255        : 0        |                        =
                |
>> >> >> >> >        256 -> 511        : 24       |                        =
                |
>> >> >> >> >        512 -> 1023       : 2686     |                        =
                |
>> >> >> >> >       1024 -> 2047       : 10246    |                        =
                |
>> >> >> >> >       2048 -> 4095       : 4061529  |*********               =
                |
>> >> >> >> >       4096 -> 8191       : 16894971 |************************=
****************|
>> >> >> >> >       8192 -> 16383      : 6279310  |**************          =
                |
>> >> >> >> >      16384 -> 32767      : 1658240  |***                     =
                |
>> >> >> >> >      32768 -> 65535      : 445760   |*                       =
                |
>> >> >> >> >      65536 -> 131071     : 110817   |                        =
                |
>> >> >> >> >     131072 -> 262143     : 20279    |                        =
                |
>> >> >> >> >     262144 -> 524287     : 4176     |                        =
                |
>> >> >> >> >     524288 -> 1048575    : 436      |                        =
                |
>> >> >> >> >    1048576 -> 2097151    : 8        |                        =
                |
>> >> >> >> >    2097152 -> 4194303    : 2        |                        =
                |
>> >> >> >> >
>> >> >> >> > avg =3D 8401 nsecs, total: 247739809022 nsecs, count: 29488508
>> >> >> >> >
>> >> >> >> > After setting it to 0, the avg latency is reduced to around 8=
us, and the
>> >> >> >> > max latency is less than 4ms.
>> >> >> >> >
>> >> >> >> > - Conclusion
>> >> >> >> >
>> >> >> >> > On this Intel CPU, this tuning doesn't help much. Latency-sen=
sitive
>> >> >> >> > applications work well with the default setting.
>> >> >> >> >
>> >> >> >> > It is worth noting that all the above data were tested using =
the upstream
>> >> >> >> > kernel.
>> >> >> >> >
>> >> >> >> > Why introduce a systl knob?
>> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D
>> >> >> >> >
>> >> >> >> > From the above data, it's clear that different CPU types have=
 varying
>> >> >> >> > allocation latencies concerning zone->lock contention. Typica=
lly, people
>> >> >> >> > don't release individual kernel packages for each type of x86=
_64 CPU.
>> >> >> >> >
>> >> >> >> > Furthermore, for latency-insensitive applications, we can kee=
p the default
>> >> >> >> > setting for better throughput. In our production environment,=
 we set this
>> >> >> >> > value to 0 for applications running on Kubernetes servers whi=
le keeping it
>> >> >> >> > at the default value of 5 for other applications like big dat=
a. It's not
>> >> >> >> > common to release individual kernel packages for each applica=
tion.
>> >> >> >>
>> >> >> >> Thanks for detailed performance data!
>> >> >> >>
>> >> >> >> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MA=
X to 0 in
>> >> >> >> your environment?  If not, I suggest to use 0 as default for
>> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX.  Because we have clear evidence that
>> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads.  A=
fter
>> >> >> >> that, if someone found some other workloads need larger
>> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.
>> >> >> >>
>> >> >> >
>> >> >> > The decision doesn=E2=80=99t rest with us, the kernel team at ou=
r company.
>> >> >> > It=E2=80=99s made by the system administrators who manage a larg=
e number of
>> >> >> > servers. The latency spikes only occur on the Kubernetes (k8s)
>> >> >> > servers, not in other environments like big data servers. We have
>> >> >> > informed other system administrators, such as those managing the=
 big
>> >> >> > data servers, about the latency spike issues, but they are unwil=
ling
>> >> >> > to make the change.
>> >> >> >
>> >> >> > No one wants to make changes unless there is evidence showing th=
at the
>> >> >> > old settings will negatively impact them. However, as you know,
>> >> >> > latency is not a critical concern for big data; throughput is mo=
re
>> >> >> > important. If we keep the current settings, we will have to rele=
ase
>> >> >> > different kernel packages for different environments, which is a
>> >> >> > significant burden for us.
>> >> >>
>> >> >> Totally understand your requirements.  And, I think that this is b=
etter
>> >> >> to be resolved in your downstream kernel.  If there are clear evid=
ences
>> >> >> to prove small batch number hurts throughput for some workloads, w=
e can
>> >> >> make the change in the upstream kernel.
>> >> >>
>> >> >
>> >> > Please don't make this more complicated. We are at an impasse.
>> >> >
>> >> > The key issue here is that the upstream kernel has a default value =
of
>> >> > 5, not 0. If you can change it to 0, we can persuade our users to
>> >> > follow the upstream changes. They currently set it to 5, not because
>> >> > you, the author, chose this value, but because it is the default in
>> >> > Linus's tree. Since it's in Linus's tree, kernel developers worldwi=
de
>> >> > support it. It's not just your decision as the author, but the enti=
re
>> >> > community supports this default.
>> >> >
>> >> > If, in the future, we find that the value of 0 is not suitable, you=
'll
>> >> > tell us, "It is an issue in your downstream kernel, not in the
>> >> > upstream kernel, so we won't accept it."  PANIC.
>> >>
>> >> I don't think so.  I suggest you to change the default value to 0.  If
>> >> someone reported that his workloads need some other value, then we ha=
ve
>> >> evidence that different workloads need different value.  At that time,
>> >> we can suggest to add an user tunable knob.
>> >>
>> >
>> > The problem is that others are unaware we've set it to 0, and I can't
>> > constantly monitor the linux-mm mailing list. Additionally, it's
>> > possible that you can't always keep an eye on it either.
>>
>> IIUC, they will use the default value.  Then, if there is any
>> performance regression, they can report it.
>
> Now we report it. What is your replyment? "Keep it in your downstream
> kernel." Wow, PANIC again.

This is not all of my reply.  I suggested you to change the default
value too.

>
>>
>> > I believe we should hear Andrew's suggestion. Andrew, what is your opi=
nion?
>>

--
Best Regards,
Huang, Ying