From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3B48BC3DA61
	for <linux-mm@archiver.kernel.org>; Mon, 29 Jul 2024 05:54:18 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B2EB46B00A0; Mon, 29 Jul 2024 01:54:17 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AE8D66B00A1; Mon, 29 Jul 2024 01:54:17 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 97F706B00A2; Mon, 29 Jul 2024 01:54:17 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 750FC6B00A0
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 01:54:17 -0400 (EDT)
Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 1CA97140108
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 05:54:17 +0000 (UTC)
X-FDA: 82391724954.03.AD36DBD
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8])
	by imf06.hostedemail.com (Postfix) with ESMTP id 1BFE718000C
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 05:54:13 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=mdKG2sbU;
	spf=pass (imf06.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.8 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1722232413;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=dPQLWaDV4CsHVnajbfb1YP8OLIxKEz0ies22Sk14ZrI=;
	b=q7X5MDqL/pmbn3yujUV4P2omVqEjeB3XR60BHvEzUtRtatYl0GRg8Icl24i1DapOBsv+1h
	I5swAqfWmiNtdtrrdXn7Wxi4i5nQw/zL7Ugpw/LvjVyxmcgMX2zDDB7/xqtOtupzJMF5cx
	Gt56M4ugwLvhZSpu16fQhSph+ZMNsB8=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=mdKG2sbU;
	spf=pass (imf06.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.8 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722232413; a=rsa-sha256;
	cv=none;
	b=IDI4RXes+YmnC8cHv0yRPHexVzLXgsU9geivP3UHEWsnEVZy95RrBgjZ1tXDjKcQVFENNl
	9wsO7eoQgWQbyv6PxI1jqlBVib1aYd1Ap6eBuPzmAP5CdgmXjaJh76gvJ+lhe1x9yMM4Gu
	HxoxUxPiTqleG7TnuewAFUmhi8LiKpA=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1722232454; x=1753768454;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=YVptAlMjI+NkPs2nZXS+ERgRqyiLFSqM9zL7RLAu4Rw=;
  b=mdKG2sbUBOKE73h8rUJAE7DBwvbIsVStnXfxKvWC20shfT6obFhBflzG
   MRKvn1Dza6ZXaLWkuL+bqT5aEzQpb7KQYb+7KbMOtzjaC0IDx0UHIni1O
   kATStlS4Rw47N7xecU3sVSbR3tW+c5PiU9qmA30MiXV/hD/JOzoTIU5Iv
   TA5meLp/74QurqKwUGseWwQ6vSfIxa8boL9RctdwxBuHqLA73M/EDWQXW
   vV3sWKWYdBBc1TT+A/6XbswdHICX2ETYqL+rtU3JKlg+hsTw/nQhWVLsC
   SiNf2vG3rBX/sqQj+N4JQDr1W5xaLz+t4U1qMWeW7huInLfVTqpSoEbNK
   A==;
X-CSE-ConnectionGUID: yo3t/L8XTuiHWUi8NBiu6Q==
X-CSE-MsgGUID: zgstiy8KQOCHGfPQDzNbWg==
X-IronPort-AV: E=McAfee;i="6700,10204,11147"; a="37471628"
X-IronPort-AV: E=Sophos;i="6.09,245,1716274800"; 
   d="scan'208";a="37471628"
Received: from orviesa004.jf.intel.com ([10.64.159.144])
  by fmvoesa102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2024 22:54:13 -0700
X-CSE-ConnectionGUID: l8KExrSlT9uO4Zi7XaCwgg==
X-CSE-MsgGUID: jA/Jm8olSK2gullgO52J5Q==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.09,245,1716274800"; 
   d="scan'208";a="58961059"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orviesa004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2024 22:54:10 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: akpm@linux-foundation.org,  mgorman@techsingularity.net,
  linux-mm@kvack.org,  Matthew Wilcox <willy@infradead.org>,  David
 Rientjes <rientjes@google.com>
Subject: Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob
 vm.pcp_batch_scale_max
In-Reply-To: <CALOAHbCi3E61PWHKBJneWT-A-X_C9ezBdjZn0zrLM9gig5XJ+Q@mail.gmail.com>
	(Yafang Shao's message of "Mon, 29 Jul 2024 13:45:44 +0800")
References: <20240729023532.1555-1-laoar.shao@gmail.com>
	<20240729023532.1555-4-laoar.shao@gmail.com>
	<878qxkyjfr.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbBm+Mdhu9qcJkoOcLhVUoLKF50dLfhpTd_w_uR8h3yy6g@mail.gmail.com>
	<874j88ye65.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbCi3E61PWHKBJneWT-A-X_C9ezBdjZn0zrLM9gig5XJ+Q@mail.gmail.com>
Date: Mon, 29 Jul 2024 13:50:36 +0800
Message-ID: <87zfq0wxub.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: 4i9z1sni8hksye5td54iqfd186gwiyes
X-Rspam-User: 
X-Rspamd-Queue-Id: 1BFE718000C
X-Rspamd-Server: rspam02
X-HE-Tag: 1722232453-179224
X-HE-Meta: U2FsdGVkX18OeEN2qx1LBlhdoD2J3XdmcjJcNbj/625Ef9/T2rxHHiLgaANiTeN0VIFB80nzhlUBcKqZYbCZYrpWnk4QJ4LpAmXBc92OEBgUaTIt3ITSKxtebUKZ8eYtprsUkpPWQn8uoWukxQYtiKoPBag2MV3CBrf97udvFmheLc4oMlZf1Q/29dyyTlhCJR0+Y9/voAQ45+ABdyRS3s6CQ53KMroS1v/SM60KmiDM3QoYrMZVKs65g+MpDRXuzVyh9Rb1AqZdJRgYDFp/YXhzMnqOfbv7JfTnvMAAmwDOD3KAKWvJMK1cc5RYPgwB3Nhutw9yN4XKRwyoysxN2BfuaIF4Pm9kn87BL15L0KBeV+D5uF33qUx9zORjnB639A4sLjGrCEclUnWYVV3KVh0ORCUnDjhMhz/7hZJAiqqqOecQgX5u/Wy6kJ84ilmwvSGJu1WjU21olPT06auvcdrj/F0p8zbC8eyPBS/NRnbeB4hFQsDQ0oL0d2h+g3RtYaPbnI7TLje/WFmVMwFqgpPOtWcEjGkMwFOQsLJXrNbT1lr6yyAfBCVo58lTMPekvuTCsXWQjRpwSFJw4i+jI17WDVSwH6w66l5XXolgRsiMpMQfTZ399lDTbid7s9+9AusEeXZ0fX/yeluw9brFQlhXyNm0LZ3AvCFncWMttEtzOADvZKz2qVB6EK1382scc4D3X5ytVagPBilFoUzo0pblvSXA89XTUW6Svuxct5dz3lb3DtJZBPH4dTIEsGQjJM/aiw8IB/1H5J2YfBuDFOOAD0ti8eDA3ZFs5vqPusEy/m80IEEUOuhczuT4hnUz1zra3LBWRUy6UdCZ8EuxB5aJXiLppNzhZ6dlpxh3B82z8Mk3lbsUvmrB1atI78SsbD1lx5cgz1ddncu1/lFgud0/4cxG3TEWzOnyCpIFEncEXdWCvJftUBZuqJY3P/PJ683Cy8ZUehzMdCkO37H
 hgfAcw0Q
 MdtRFkQT4FkfMb+1ifUHQ+QTvHgNO9OFfR/7RvIpYNmTsSVm+sxjIv2PfQ0K04kyQqc7Ll9KtD8zx++xozTd/2FtJq9dCKDkO0uJJKr6NaG6n2AaWv+zkUR+oEyy49IzfbFTNqfuJrOZ5jkcD8uHzRuWyzRhbw9CCfp2cTIjiu5+yGKYXF+ncUvBUseIcCceiY96WReS2fHRTgtT2pocDsqJdk7T4yDgfxXSyYBbRhYQVzbrZAsKmvAZl3HWTVATeFoqqNTRS04C2zFnxXX3RDesSD97nqPr502moFV1gVa4YvJeqNE+ZSWVFhh4XxRxA5OAbY1jKOBb6vr0vpNZGaqAY/IEUPAyspnJSkGaD+kDxX+9+pJkrLil0aA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Yafang Shao <laoar.shao@gmail.com> writes:

> On Mon, Jul 29, 2024 at 1:16=E2=80=AFPM Huang, Ying <ying.huang@intel.com=
> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Mon, Jul 29, 2024 at 11:22=E2=80=AFAM Huang, Ying <ying.huang@intel=
.com> wrote:
>> >>
>> >> Hi, Yafang,
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > During my recent work to resolve latency spikes caused by zone->lock
>> >> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult=
 to use
>> >> > in practice.
>> >>
>> >> As we discussed before [1], I still feel confusing about the descript=
ion
>> >> about zone->lock contention.  How about change the description to
>> >> something like,
>> >
>> > Sure, I will change it.
>> >
>> >>
>> >> Larger page allocation/freeing batch number may cause longer run time=
 of
>> >> code holding zone->lock.  If zone->lock is heavily contended at the s=
ame
>> >> time, latency spikes may occur even for casual page allocation/freein=
g.
>> >> Although reducing the batch number cannot make zone->lock contended
>> >> lighter, it can reduce the latency spikes effectively.
>> >>
>> >> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-desk2.ccr=
.corp.intel.com/
>> >>
>> >> > To demonstrate this, I wrote a Python script:
>> >> >
>> >> >   import mmap
>> >> >
>> >> >   size =3D 6 * 1024**3
>> >> >
>> >> >   while True:
>> >> >       mm =3D mmap.mmap(-1, size)
>> >> >       mm[:] =3D b'\xff' * size
>> >> >       mm.close()
>> >> >
>> >> > Run this script 10 times in parallel and measure the allocation lat=
ency by
>> >> > measuring the duration of rmqueue_bulk() with the BCC tools
>> >> > funclatency[1]:
>> >> >
>> >> >   funclatency -T -i 600 rmqueue_bulk
>> >> >
>> >> > Here are the results for both AMD and Intel CPUs.
>> >> >
>> >> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual serv=
er
>> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> >
>> >> > - Default value of 5
>> >> >
>> >> >      nsecs               : count     distribution
>> >> >          0 -> 1          : 0        |                              =
          |
>> >> >          2 -> 3          : 0        |                              =
          |
>> >> >          4 -> 7          : 0        |                              =
          |
>> >> >          8 -> 15         : 0        |                              =
          |
>> >> >         16 -> 31         : 0        |                              =
          |
>> >> >         32 -> 63         : 0        |                              =
          |
>> >> >         64 -> 127        : 0        |                              =
          |
>> >> >        128 -> 255        : 0        |                              =
          |
>> >> >        256 -> 511        : 0        |                              =
          |
>> >> >        512 -> 1023       : 12       |                              =
          |
>> >> >       1024 -> 2047       : 9116     |                              =
          |
>> >> >       2048 -> 4095       : 2004     |                              =
          |
>> >> >       4096 -> 8191       : 2497     |                              =
          |
>> >> >       8192 -> 16383      : 2127     |                              =
          |
>> >> >      16384 -> 32767      : 2483     |                              =
          |
>> >> >      32768 -> 65535      : 10102    |                              =
          |
>> >> >      65536 -> 131071     : 212730   |*******************           =
          |
>> >> >     131072 -> 262143     : 314692   |***************************** =
          |
>> >> >     262144 -> 524287     : 430058   |******************************=
**********|
>> >> >     524288 -> 1048575    : 224032   |********************          =
          |
>> >> >    1048576 -> 2097151    : 73567    |******                        =
          |
>> >> >    2097152 -> 4194303    : 17079    |*                             =
          |
>> >> >    4194304 -> 8388607    : 3900     |                              =
          |
>> >> >    8388608 -> 16777215   : 750      |                              =
          |
>> >> >   16777216 -> 33554431   : 88       |                              =
          |
>> >> >   33554432 -> 67108863   : 2        |                              =
          |
>> >> >
>> >> > avg =3D 449775 nsecs, total: 587066511229 nsecs, count: 1305242
>> >> >
>> >> > The avg alloc latency can be 449us, and the max latency can be high=
er
>> >> > than 30ms.
>> >> >
>> >> > - Value set to 0
>> >> >
>> >> >      nsecs               : count     distribution
>> >> >          0 -> 1          : 0        |                              =
          |
>> >> >          2 -> 3          : 0        |                              =
          |
>> >> >          4 -> 7          : 0        |                              =
          |
>> >> >          8 -> 15         : 0        |                              =
          |
>> >> >         16 -> 31         : 0        |                              =
          |
>> >> >         32 -> 63         : 0        |                              =
          |
>> >> >         64 -> 127        : 0        |                              =
          |
>> >> >        128 -> 255        : 0        |                              =
          |
>> >> >        256 -> 511        : 0        |                              =
          |
>> >> >        512 -> 1023       : 92       |                              =
          |
>> >> >       1024 -> 2047       : 8594     |                              =
          |
>> >> >       2048 -> 4095       : 2042818  |******                        =
          |
>> >> >       4096 -> 8191       : 8737624  |**************************    =
          |
>> >> >       8192 -> 16383      : 13147872 |******************************=
**********|
>> >> >      16384 -> 32767      : 8799951  |**************************    =
          |
>> >> >      32768 -> 65535      : 2879715  |********                      =
          |
>> >> >      65536 -> 131071     : 659600   |**                            =
          |
>> >> >     131072 -> 262143     : 204004   |                              =
          |
>> >> >     262144 -> 524287     : 78246    |                              =
          |
>> >> >     524288 -> 1048575    : 30800    |                              =
          |
>> >> >    1048576 -> 2097151    : 12251    |                              =
          |
>> >> >    2097152 -> 4194303    : 2950     |                              =
          |
>> >> >    4194304 -> 8388607    : 78       |                              =
          |
>> >> >
>> >> > avg =3D 19359 nsecs, total: 708638369918 nsecs, count: 36604636
>> >> >
>> >> > The avg was reduced significantly to 19us, and the max latency is r=
educed
>> >> > to less than 8ms.
>> >> >
>> >> > - Conclusion
>> >> >
>> >> > On this AMD CPU, reducing vm.pcp_batch_scale_max significantly help=
s reduce
>> >> > latency. Latency-sensitive applications will benefit from this tuni=
ng.
>> >> >
>> >> > However, I don't have access to other types of AMD CPUs, so I was u=
nable to
>> >> > test it on different AMD models.
>> >> >
>> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
>> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> >
>> >> > - Default value of 5
>> >> >
>> >> >      nsecs               : count     distribution
>> >> >          0 -> 1          : 0        |                              =
          |
>> >> >          2 -> 3          : 0        |                              =
          |
>> >> >          4 -> 7          : 0        |                              =
          |
>> >> >          8 -> 15         : 0        |                              =
          |
>> >> >         16 -> 31         : 0        |                              =
          |
>> >> >         32 -> 63         : 0        |                              =
          |
>> >> >         64 -> 127        : 0        |                              =
          |
>> >> >        128 -> 255        : 0        |                              =
          |
>> >> >        256 -> 511        : 0        |                              =
          |
>> >> >        512 -> 1023       : 2419     |                              =
          |
>> >> >       1024 -> 2047       : 34499    |*                             =
          |
>> >> >       2048 -> 4095       : 4272     |                              =
          |
>> >> >       4096 -> 8191       : 9035     |                              =
          |
>> >> >       8192 -> 16383      : 4374     |                              =
          |
>> >> >      16384 -> 32767      : 2963     |                              =
          |
>> >> >      32768 -> 65535      : 6407     |                              =
          |
>> >> >      65536 -> 131071     : 884806   |******************************=
**********|
>> >> >     131072 -> 262143     : 145931   |******                        =
          |
>> >> >     262144 -> 524287     : 13406    |                              =
          |
>> >> >     524288 -> 1048575    : 1874     |                              =
          |
>> >> >    1048576 -> 2097151    : 249      |                              =
          |
>> >> >    2097152 -> 4194303    : 28       |                              =
          |
>> >> >
>> >> > avg =3D 96173 nsecs, total: 106778157925 nsecs, count: 1110263
>> >> >
>> >> > - Conclusion
>> >> >
>> >> > This Intel CPU works fine with the default setting.
>> >> >
>> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
>> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> >
>> >> > Using the cpuset cgroup, we can restrict the test script to run on =
NUMA
>> >> > node 0 only.
>> >> >
>> >> > - Default value of 5
>> >> >
>> >> >      nsecs               : count     distribution
>> >> >          0 -> 1          : 0        |                              =
          |
>> >> >          2 -> 3          : 0        |                              =
          |
>> >> >          4 -> 7          : 0        |                              =
          |
>> >> >          8 -> 15         : 0        |                              =
          |
>> >> >         16 -> 31         : 0        |                              =
          |
>> >> >         32 -> 63         : 0        |                              =
          |
>> >> >         64 -> 127        : 0        |                              =
          |
>> >> >        128 -> 255        : 0        |                              =
          |
>> >> >        256 -> 511        : 46       |                              =
          |
>> >> >        512 -> 1023       : 695      |                              =
          |
>> >> >       1024 -> 2047       : 19950    |*                             =
          |
>> >> >       2048 -> 4095       : 1788     |                              =
          |
>> >> >       4096 -> 8191       : 3392     |                              =
          |
>> >> >       8192 -> 16383      : 2569     |                              =
          |
>> >> >      16384 -> 32767      : 2619     |                              =
          |
>> >> >      32768 -> 65535      : 3809     |                              =
          |
>> >> >      65536 -> 131071     : 616182   |******************************=
**********|
>> >> >     131072 -> 262143     : 295587   |*******************           =
          |
>> >> >     262144 -> 524287     : 75357    |****                          =
          |
>> >> >     524288 -> 1048575    : 15471    |*                             =
          |
>> >> >    1048576 -> 2097151    : 2939     |                              =
          |
>> >> >    2097152 -> 4194303    : 243      |                              =
          |
>> >> >    4194304 -> 8388607    : 3        |                              =
          |
>> >> >
>> >> > avg =3D 144410 nsecs, total: 150281196195 nsecs, count: 1040651
>> >> >
>> >> > The zone->lock contention becomes severe when there is only a singl=
e NUMA
>> >> > node. The average latency is approximately 144us, with the maximum
>> >> > latency exceeding 4ms.
>> >> >
>> >> > - Value set to 0
>> >> >
>> >> >      nsecs               : count     distribution
>> >> >          0 -> 1          : 0        |                              =
          |
>> >> >          2 -> 3          : 0        |                              =
          |
>> >> >          4 -> 7          : 0        |                              =
          |
>> >> >          8 -> 15         : 0        |                              =
          |
>> >> >         16 -> 31         : 0        |                              =
          |
>> >> >         32 -> 63         : 0        |                              =
          |
>> >> >         64 -> 127        : 0        |                              =
          |
>> >> >        128 -> 255        : 0        |                              =
          |
>> >> >        256 -> 511        : 24       |                              =
          |
>> >> >        512 -> 1023       : 2686     |                              =
          |
>> >> >       1024 -> 2047       : 10246    |                              =
          |
>> >> >       2048 -> 4095       : 4061529  |*********                     =
          |
>> >> >       4096 -> 8191       : 16894971 |******************************=
**********|
>> >> >       8192 -> 16383      : 6279310  |**************                =
          |
>> >> >      16384 -> 32767      : 1658240  |***                           =
          |
>> >> >      32768 -> 65535      : 445760   |*                             =
          |
>> >> >      65536 -> 131071     : 110817   |                              =
          |
>> >> >     131072 -> 262143     : 20279    |                              =
          |
>> >> >     262144 -> 524287     : 4176     |                              =
          |
>> >> >     524288 -> 1048575    : 436      |                              =
          |
>> >> >    1048576 -> 2097151    : 8        |                              =
          |
>> >> >    2097152 -> 4194303    : 2        |                              =
          |
>> >> >
>> >> > avg =3D 8401 nsecs, total: 247739809022 nsecs, count: 29488508
>> >> >
>> >> > After setting it to 0, the avg latency is reduced to around 8us, an=
d the
>> >> > max latency is less than 4ms.
>> >> >
>> >> > - Conclusion
>> >> >
>> >> > On this Intel CPU, this tuning doesn't help much. Latency-sensitive
>> >> > applications work well with the default setting.
>> >> >
>> >> > It is worth noting that all the above data were tested using the up=
stream
>> >> > kernel.
>> >> >
>> >> > Why introduce a systl knob?
>> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D
>> >> >
>> >> > From the above data, it's clear that different CPU types have varyi=
ng
>> >> > allocation latencies concerning zone->lock contention. Typically, p=
eople
>> >> > don't release individual kernel packages for each type of x86_64 CP=
U.
>> >> >
>> >> > Furthermore, for latency-insensitive applications, we can keep the =
default
>> >> > setting for better throughput. In our production environment, we se=
t this
>> >> > value to 0 for applications running on Kubernetes servers while kee=
ping it
>> >> > at the default value of 5 for other applications like big data. It'=
s not
>> >> > common to release individual kernel packages for each application.
>> >>
>> >> Thanks for detailed performance data!
>> >>
>> >> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to 0=
 in
>> >> your environment?  If not, I suggest to use 0 as default for
>> >> CONFIG_PCP_BATCH_SCALE_MAX.  Because we have clear evidence that
>> >> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads.  After
>> >> that, if someone found some other workloads need larger
>> >> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.
>> >>
>> >
>> > The decision doesn=E2=80=99t rest with us, the kernel team at our comp=
any.
>> > It=E2=80=99s made by the system administrators who manage a large numb=
er of
>> > servers. The latency spikes only occur on the Kubernetes (k8s)
>> > servers, not in other environments like big data servers. We have
>> > informed other system administrators, such as those managing the big
>> > data servers, about the latency spike issues, but they are unwilling
>> > to make the change.
>> >
>> > No one wants to make changes unless there is evidence showing that the
>> > old settings will negatively impact them. However, as you know,
>> > latency is not a critical concern for big data; throughput is more
>> > important. If we keep the current settings, we will have to release
>> > different kernel packages for different environments, which is a
>> > significant burden for us.
>>
>> Totally understand your requirements.  And, I think that this is better
>> to be resolved in your downstream kernel.  If there are clear evidences
>> to prove small batch number hurts throughput for some workloads, we can
>> make the change in the upstream kernel.
>>
>
> Please don't make this more complicated. We are at an impasse.
>
> The key issue here is that the upstream kernel has a default value of
> 5, not 0. If you can change it to 0, we can persuade our users to
> follow the upstream changes. They currently set it to 5, not because
> you, the author, chose this value, but because it is the default in
> Linus's tree. Since it's in Linus's tree, kernel developers worldwide
> support it. It's not just your decision as the author, but the entire
> community supports this default.
>
> If, in the future, we find that the value of 0 is not suitable, you'll
> tell us, "It is an issue in your downstream kernel, not in the
> upstream kernel, so we won't accept it."  PANIC.

I don't think so.  I suggest you to change the default value to 0.  If
someone reported that his workloads need some other value, then we have
evidence that different workloads need different value.  At that time,
we can suggest to add an user tunable knob.

--
Best Regards,
Huang, Ying