From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 49AD8C3DA61 for ; Mon, 29 Jul 2024 06:18:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A80E26B00A0; Mon, 29 Jul 2024 02:18:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A31E66B00A2; Mon, 29 Jul 2024 02:18:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8AA586B00A3; Mon, 29 Jul 2024 02:18:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 64D3F6B00A0 for ; Mon, 29 Jul 2024 02:18:33 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 0A521A1D3A for ; Mon, 29 Jul 2024 06:18:33 +0000 (UTC) X-FDA: 82391786106.10.C1DF61D Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) by imf29.hostedemail.com (Postfix) with ESMTP id 0C05812001E for ; Mon, 29 Jul 2024 06:18:29 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=WcweZG61; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.7 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722233857; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=iwOyCqHiTIfEBOhyxrlOuTbpqvDFrvxcIEBeMKbZaFM=; b=FWOCEoETgSI0ILbY4XzyW6EwzAegEEw0W8ZkYWZzs9sic/HgaDF06bGeUW/UOlE+CAXJ+s 9J0o3p6iLeHAnaqwsgp+4ueFs5DrzsA4f2pJuiRBlPGDQpR+tPIZNGrCcaXWxVlPCH5QAW KK7F2Cz9fswKpRMd0mlTnOcsorHsJ+U= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722233857; a=rsa-sha256; cv=none; b=VpaWYYcu7xdagNNQ43WgVujUfWAlmK4SHZKoS7k/typFdHa/RAXv9kgcYxoQ+CEAzbGaX/ JsOrJ97a4qvuRWwVWvlVpglqzTUVgJTezCPwvNAwWtlcfFGOvKKeQzuzUVGmVjHkNWlUlX aAau0GSt2kEMv96U6b2K5DN80+fYxnk= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=WcweZG61; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.7 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1722233910; x=1753769910; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=wTMoCXYfqTfchbVu2w6DMZ8Uehuw0YJHVNNqp9hEK8w=; b=WcweZG61C6pf2s7rdmbdlApZa9Zk5RspxOuWNmlxm63JGXiDCHtDP3fQ Bva1622aVU8DQtNz+E6I2WuAEmy3SG/2wG/L4/qdSIoE3X9h9sQfwPdaz qs7dtEuh+/YxA6r6AKa80QpGhaW21eRKI8Hv14glEz5zm+CPjn0/93ss/ VYbAPp8TIws5M0tQOGmz+7Fy2VOmusbc5pUkE3ZbWNubHnx/hhDRzMX1a m/j2kyh2N1eH+Xr+YsAL9Y0SZPy25ZfYG0t28D+zKPUlH6TtAjMvBCs/E zuaNs2cgowWqu95TIkRC7j7m2ZHUXTIhpwxHEwCGjFwTXjevheJBL8D4+ Q==; X-CSE-ConnectionGUID: 3751QmbjQeSBpSaWGh/UUw== X-CSE-MsgGUID: R3PYbAWiT6+y8m8unuQjjw== X-IronPort-AV: E=McAfee;i="6700,10204,11147"; a="45379927" X-IronPort-AV: E=Sophos;i="6.09,245,1716274800"; d="scan'208";a="45379927" Received: from fmviesa001.fm.intel.com ([10.60.135.141]) by fmvoesa101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2024 23:18:29 -0700 X-CSE-ConnectionGUID: Skge7SkPQl+dWuVnLEVP5Q== X-CSE-MsgGUID: pBTU9Ha1R0K721wUb0hAIw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,245,1716274800"; d="scan'208";a="84823554" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jul 2024 23:18:27 -0700 From: "Huang, Ying" To: Yafang Shao Cc: akpm@linux-foundation.org, mgorman@techsingularity.net, linux-mm@kvack.org, Matthew Wilcox , David Rientjes Subject: Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max In-Reply-To: (Yafang Shao's message of "Mon, 29 Jul 2024 14:13:07 +0800") References: <20240729023532.1555-1-laoar.shao@gmail.com> <20240729023532.1555-4-laoar.shao@gmail.com> <878qxkyjfr.fsf@yhuang6-desk2.ccr.corp.intel.com> <874j88ye65.fsf@yhuang6-desk2.ccr.corp.intel.com> <87zfq0wxub.fsf@yhuang6-desk2.ccr.corp.intel.com> <87v80owxd8.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Mon, 29 Jul 2024 14:14:54 +0800 Message-ID: <87r0bcwwpt.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 0C05812001E X-Stat-Signature: tukbggjka3rt7j1853jxnwo5xtejrsgs X-HE-Tag: 1722233909-33220 X-HE-Meta: U2FsdGVkX1/3ykRfUJ7vD/J/ANxNVUArYvsSrCogO7cKofih8eJ0i6GwPfuXErSNn1SMNv2Ls2bULEDV7VmuBcpWmrvxT2ZTeQZY3LeZNxJyREWYaaJo9hEq11UojkJIMqyqJINSTcwLKlzOVz59nFhWvmCp42K1cAw7kP+8NCKISWK6BsjV3P+d1ct6i/9Lez8AjYDe718SW1Q/sbwSNIX51ART94lO2hzhNFS1fSPksI5ZUbIVjDuRz60hL8gQhCIvkqxgejcZ7WGmf48IZXQlMQOgitipwT6qnVeYvsUtWZz1KHwvowPtdqWoPNdWANohlOlxdeADvL2VKXjEl2YoVAaadaXskqJm34ecg147ww5k5v4QoNNNbijc5jXqh2NErcxcpZKPNz9i1ikIfb1cuZxhKOmMfPQhRPmonGg6jmN2D4gqtNcQppos8ei7EeGx+8UV9cYIxFIQjq3GfClAbGMUqO/BOvh4taeGNPtGS+6T3q4vKw4e9UCyDli0gX5SW7ubYNvMVkJ2qR7DpqLoqPwRIOBrBi2ltfJiKH0MhBCRXP7DSyo2RdUx44WXqE21/RlN2g+ER1ySvnrrysqFeCK/Cn6MQ/w/XdV428DBg9vsuQHD2ZNbZuY6iQajUbTCUX4EhpTuz81wJ0R9Bpauy+IQoTO2TBxZtmhS1vBa8TjxTjKrfcfYDSudM9As3POS/EaOYdgxBVV0bftMhm2H9SNfV70vSC1rHVKEdcBL4QMjz3L5kci/PvOSqk7y121xBgY/UiXtTRuv7xB4AnN4sIaR3ZdYr/CpT3kJ6Z8Uja4LP4oO0bv3XQzNneXEHmUnL8GanseG47yGFIYzKXgQGkuULPr4waYwYOXguQKLHZ1H9zvLA/d/TJpHhOpadoHEXCA0U0rNInJlu7SFvF7UWCRmUuv1rM0mEVNiNThup0/cH9gMth63U6Qgolc/9/C1amU2qOnUzxIXkMx WRA7lwOh OEQo+ODma6MEBcKOKksbygPGNFisQGaAZ2kgMVQDpgD0qSSCeXggerfGi9vt77WtSE/ulG/lEXAN9IKXFqFR5EQq+w1HktV9AXuWo2a2v9rP1/ccB0duHs3ZOWjA73JzeSGs1T2NcUrpdVZeskpAvKY0/PvympvZr2ALP559zJnvO3IliurPGgOG7P3alPpkY+DiHWLW83zvYCdTbGVELXCkneNpFeVGexa7szwXXeWg/4tltVfY4htNxT+ZVrx3kNdDk4x8E3s02ZhEZPCbvX4LpH6c+ladsLoUqR8kGJ2JVf3br84Z+u+k1Vn7rmIxfBETLSO/XOLxZt04WKjLz+olfUTnn3QEyDqO6l0ovXZgkO1EtZagenR5oMw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Yafang Shao writes: > On Mon, Jul 29, 2024 at 2:04=E2=80=AFPM Huang, Ying wrote: >> >> Yafang Shao writes: >> >> > On Mon, Jul 29, 2024 at 1:54=E2=80=AFPM Huang, Ying wrote: >> >> >> >> Yafang Shao writes: >> >> >> >> > On Mon, Jul 29, 2024 at 1:16=E2=80=AFPM Huang, Ying wrote: >> >> >> >> >> >> Yafang Shao writes: >> >> >> >> >> >> > On Mon, Jul 29, 2024 at 11:22=E2=80=AFAM Huang, Ying wrote: >> >> >> >> >> >> >> >> Hi, Yafang, >> >> >> >> >> >> >> >> Yafang Shao writes: >> >> >> >> >> >> >> >> > During my recent work to resolve latency spikes caused by zon= e->lock >> >> >> >> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is dif= ficult to use >> >> >> >> > in practice. >> >> >> >> >> >> >> >> As we discussed before [1], I still feel confusing about the de= scription >> >> >> >> about zone->lock contention. How about change the description = to >> >> >> >> something like, >> >> >> > >> >> >> > Sure, I will change it. >> >> >> > >> >> >> >> >> >> >> >> Larger page allocation/freeing batch number may cause longer ru= n time of >> >> >> >> code holding zone->lock. If zone->lock is heavily contended at= the same >> >> >> >> time, latency spikes may occur even for casual page allocation/= freeing. >> >> >> >> Although reducing the batch number cannot make zone->lock conte= nded >> >> >> >> lighter, it can reduce the latency spikes effectively. >> >> >> >> >> >> >> >> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-des= k2.ccr.corp.intel.com/ >> >> >> >> >> >> >> >> > To demonstrate this, I wrote a Python script: >> >> >> >> > >> >> >> >> > import mmap >> >> >> >> > >> >> >> >> > size =3D 6 * 1024**3 >> >> >> >> > >> >> >> >> > while True: >> >> >> >> > mm =3D mmap.mmap(-1, size) >> >> >> >> > mm[:] =3D b'\xff' * size >> >> >> >> > mm.close() >> >> >> >> > >> >> >> >> > Run this script 10 times in parallel and measure the allocati= on latency by >> >> >> >> > measuring the duration of rmqueue_bulk() with the BCC tools >> >> >> >> > funclatency[1]: >> >> >> >> > >> >> >> >> > funclatency -T -i 600 rmqueue_bulk >> >> >> >> > >> >> >> >> > Here are the results for both AMD and Intel CPUs. >> >> >> >> > >> >> >> >> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtua= l server >> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> >> >> > >> >> >> >> > - Default value of 5 >> >> >> >> > >> >> >> >> > nsecs : count distribution >> >> >> >> > 0 -> 1 : 0 | = | >> >> >> >> > 2 -> 3 : 0 | = | >> >> >> >> > 4 -> 7 : 0 | = | >> >> >> >> > 8 -> 15 : 0 | = | >> >> >> >> > 16 -> 31 : 0 | = | >> >> >> >> > 32 -> 63 : 0 | = | >> >> >> >> > 64 -> 127 : 0 | = | >> >> >> >> > 128 -> 255 : 0 | = | >> >> >> >> > 256 -> 511 : 0 | = | >> >> >> >> > 512 -> 1023 : 12 | = | >> >> >> >> > 1024 -> 2047 : 9116 | = | >> >> >> >> > 2048 -> 4095 : 2004 | = | >> >> >> >> > 4096 -> 8191 : 2497 | = | >> >> >> >> > 8192 -> 16383 : 2127 | = | >> >> >> >> > 16384 -> 32767 : 2483 | = | >> >> >> >> > 32768 -> 65535 : 10102 | = | >> >> >> >> > 65536 -> 131071 : 212730 |******************* = | >> >> >> >> > 131072 -> 262143 : 314692 |************************= ***** | >> >> >> >> > 262144 -> 524287 : 430058 |************************= ****************| >> >> >> >> > 524288 -> 1048575 : 224032 |******************** = | >> >> >> >> > 1048576 -> 2097151 : 73567 |****** = | >> >> >> >> > 2097152 -> 4194303 : 17079 |* = | >> >> >> >> > 4194304 -> 8388607 : 3900 | = | >> >> >> >> > 8388608 -> 16777215 : 750 | = | >> >> >> >> > 16777216 -> 33554431 : 88 | = | >> >> >> >> > 33554432 -> 67108863 : 2 | = | >> >> >> >> > >> >> >> >> > avg =3D 449775 nsecs, total: 587066511229 nsecs, count: 13052= 42 >> >> >> >> > >> >> >> >> > The avg alloc latency can be 449us, and the max latency can b= e higher >> >> >> >> > than 30ms. >> >> >> >> > >> >> >> >> > - Value set to 0 >> >> >> >> > >> >> >> >> > nsecs : count distribution >> >> >> >> > 0 -> 1 : 0 | = | >> >> >> >> > 2 -> 3 : 0 | = | >> >> >> >> > 4 -> 7 : 0 | = | >> >> >> >> > 8 -> 15 : 0 | = | >> >> >> >> > 16 -> 31 : 0 | = | >> >> >> >> > 32 -> 63 : 0 | = | >> >> >> >> > 64 -> 127 : 0 | = | >> >> >> >> > 128 -> 255 : 0 | = | >> >> >> >> > 256 -> 511 : 0 | = | >> >> >> >> > 512 -> 1023 : 92 | = | >> >> >> >> > 1024 -> 2047 : 8594 | = | >> >> >> >> > 2048 -> 4095 : 2042818 |****** = | >> >> >> >> > 4096 -> 8191 : 8737624 |************************= ** | >> >> >> >> > 8192 -> 16383 : 13147872 |************************= ****************| >> >> >> >> > 16384 -> 32767 : 8799951 |************************= ** | >> >> >> >> > 32768 -> 65535 : 2879715 |******** = | >> >> >> >> > 65536 -> 131071 : 659600 |** = | >> >> >> >> > 131072 -> 262143 : 204004 | = | >> >> >> >> > 262144 -> 524287 : 78246 | = | >> >> >> >> > 524288 -> 1048575 : 30800 | = | >> >> >> >> > 1048576 -> 2097151 : 12251 | = | >> >> >> >> > 2097152 -> 4194303 : 2950 | = | >> >> >> >> > 4194304 -> 8388607 : 78 | = | >> >> >> >> > >> >> >> >> > avg =3D 19359 nsecs, total: 708638369918 nsecs, count: 366046= 36 >> >> >> >> > >> >> >> >> > The avg was reduced significantly to 19us, and the max latenc= y is reduced >> >> >> >> > to less than 8ms. >> >> >> >> > >> >> >> >> > - Conclusion >> >> >> >> > >> >> >> >> > On this AMD CPU, reducing vm.pcp_batch_scale_max significantl= y helps reduce >> >> >> >> > latency. Latency-sensitive applications will benefit from thi= s tuning. >> >> >> >> > >> >> >> >> > However, I don't have access to other types of AMD CPUs, so I= was unable to >> >> >> >> > test it on different AMD models. >> >> >> >> > >> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes >> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> >> >> > >> >> >> >> > - Default value of 5 >> >> >> >> > >> >> >> >> > nsecs : count distribution >> >> >> >> > 0 -> 1 : 0 | = | >> >> >> >> > 2 -> 3 : 0 | = | >> >> >> >> > 4 -> 7 : 0 | = | >> >> >> >> > 8 -> 15 : 0 | = | >> >> >> >> > 16 -> 31 : 0 | = | >> >> >> >> > 32 -> 63 : 0 | = | >> >> >> >> > 64 -> 127 : 0 | = | >> >> >> >> > 128 -> 255 : 0 | = | >> >> >> >> > 256 -> 511 : 0 | = | >> >> >> >> > 512 -> 1023 : 2419 | = | >> >> >> >> > 1024 -> 2047 : 34499 |* = | >> >> >> >> > 2048 -> 4095 : 4272 | = | >> >> >> >> > 4096 -> 8191 : 9035 | = | >> >> >> >> > 8192 -> 16383 : 4374 | = | >> >> >> >> > 16384 -> 32767 : 2963 | = | >> >> >> >> > 32768 -> 65535 : 6407 | = | >> >> >> >> > 65536 -> 131071 : 884806 |************************= ****************| >> >> >> >> > 131072 -> 262143 : 145931 |****** = | >> >> >> >> > 262144 -> 524287 : 13406 | = | >> >> >> >> > 524288 -> 1048575 : 1874 | = | >> >> >> >> > 1048576 -> 2097151 : 249 | = | >> >> >> >> > 2097152 -> 4194303 : 28 | = | >> >> >> >> > >> >> >> >> > avg =3D 96173 nsecs, total: 106778157925 nsecs, count: 1110263 >> >> >> >> > >> >> >> >> > - Conclusion >> >> >> >> > >> >> >> >> > This Intel CPU works fine with the default setting. >> >> >> >> > >> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node >> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> >> >> > >> >> >> >> > Using the cpuset cgroup, we can restrict the test script to r= un on NUMA >> >> >> >> > node 0 only. >> >> >> >> > >> >> >> >> > - Default value of 5 >> >> >> >> > >> >> >> >> > nsecs : count distribution >> >> >> >> > 0 -> 1 : 0 | = | >> >> >> >> > 2 -> 3 : 0 | = | >> >> >> >> > 4 -> 7 : 0 | = | >> >> >> >> > 8 -> 15 : 0 | = | >> >> >> >> > 16 -> 31 : 0 | = | >> >> >> >> > 32 -> 63 : 0 | = | >> >> >> >> > 64 -> 127 : 0 | = | >> >> >> >> > 128 -> 255 : 0 | = | >> >> >> >> > 256 -> 511 : 46 | = | >> >> >> >> > 512 -> 1023 : 695 | = | >> >> >> >> > 1024 -> 2047 : 19950 |* = | >> >> >> >> > 2048 -> 4095 : 1788 | = | >> >> >> >> > 4096 -> 8191 : 3392 | = | >> >> >> >> > 8192 -> 16383 : 2569 | = | >> >> >> >> > 16384 -> 32767 : 2619 | = | >> >> >> >> > 32768 -> 65535 : 3809 | = | >> >> >> >> > 65536 -> 131071 : 616182 |************************= ****************| >> >> >> >> > 131072 -> 262143 : 295587 |******************* = | >> >> >> >> > 262144 -> 524287 : 75357 |**** = | >> >> >> >> > 524288 -> 1048575 : 15471 |* = | >> >> >> >> > 1048576 -> 2097151 : 2939 | = | >> >> >> >> > 2097152 -> 4194303 : 243 | = | >> >> >> >> > 4194304 -> 8388607 : 3 | = | >> >> >> >> > >> >> >> >> > avg =3D 144410 nsecs, total: 150281196195 nsecs, count: 10406= 51 >> >> >> >> > >> >> >> >> > The zone->lock contention becomes severe when there is only a= single NUMA >> >> >> >> > node. The average latency is approximately 144us, with the ma= ximum >> >> >> >> > latency exceeding 4ms. >> >> >> >> > >> >> >> >> > - Value set to 0 >> >> >> >> > >> >> >> >> > nsecs : count distribution >> >> >> >> > 0 -> 1 : 0 | = | >> >> >> >> > 2 -> 3 : 0 | = | >> >> >> >> > 4 -> 7 : 0 | = | >> >> >> >> > 8 -> 15 : 0 | = | >> >> >> >> > 16 -> 31 : 0 | = | >> >> >> >> > 32 -> 63 : 0 | = | >> >> >> >> > 64 -> 127 : 0 | = | >> >> >> >> > 128 -> 255 : 0 | = | >> >> >> >> > 256 -> 511 : 24 | = | >> >> >> >> > 512 -> 1023 : 2686 | = | >> >> >> >> > 1024 -> 2047 : 10246 | = | >> >> >> >> > 2048 -> 4095 : 4061529 |********* = | >> >> >> >> > 4096 -> 8191 : 16894971 |************************= ****************| >> >> >> >> > 8192 -> 16383 : 6279310 |************** = | >> >> >> >> > 16384 -> 32767 : 1658240 |*** = | >> >> >> >> > 32768 -> 65535 : 445760 |* = | >> >> >> >> > 65536 -> 131071 : 110817 | = | >> >> >> >> > 131072 -> 262143 : 20279 | = | >> >> >> >> > 262144 -> 524287 : 4176 | = | >> >> >> >> > 524288 -> 1048575 : 436 | = | >> >> >> >> > 1048576 -> 2097151 : 8 | = | >> >> >> >> > 2097152 -> 4194303 : 2 | = | >> >> >> >> > >> >> >> >> > avg =3D 8401 nsecs, total: 247739809022 nsecs, count: 29488508 >> >> >> >> > >> >> >> >> > After setting it to 0, the avg latency is reduced to around 8= us, and the >> >> >> >> > max latency is less than 4ms. >> >> >> >> > >> >> >> >> > - Conclusion >> >> >> >> > >> >> >> >> > On this Intel CPU, this tuning doesn't help much. Latency-sen= sitive >> >> >> >> > applications work well with the default setting. >> >> >> >> > >> >> >> >> > It is worth noting that all the above data were tested using = the upstream >> >> >> >> > kernel. >> >> >> >> > >> >> >> >> > Why introduce a systl knob? >> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D >> >> >> >> > >> >> >> >> > From the above data, it's clear that different CPU types have= varying >> >> >> >> > allocation latencies concerning zone->lock contention. Typica= lly, people >> >> >> >> > don't release individual kernel packages for each type of x86= _64 CPU. >> >> >> >> > >> >> >> >> > Furthermore, for latency-insensitive applications, we can kee= p the default >> >> >> >> > setting for better throughput. In our production environment,= we set this >> >> >> >> > value to 0 for applications running on Kubernetes servers whi= le keeping it >> >> >> >> > at the default value of 5 for other applications like big dat= a. It's not >> >> >> >> > common to release individual kernel packages for each applica= tion. >> >> >> >> >> >> >> >> Thanks for detailed performance data! >> >> >> >> >> >> >> >> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MA= X to 0 in >> >> >> >> your environment? If not, I suggest to use 0 as default for >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX. Because we have clear evidence that >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads. A= fter >> >> >> >> that, if someone found some other workloads need larger >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically. >> >> >> >> >> >> >> > >> >> >> > The decision doesn=E2=80=99t rest with us, the kernel team at ou= r company. >> >> >> > It=E2=80=99s made by the system administrators who manage a larg= e number of >> >> >> > servers. The latency spikes only occur on the Kubernetes (k8s) >> >> >> > servers, not in other environments like big data servers. We have >> >> >> > informed other system administrators, such as those managing the= big >> >> >> > data servers, about the latency spike issues, but they are unwil= ling >> >> >> > to make the change. >> >> >> > >> >> >> > No one wants to make changes unless there is evidence showing th= at the >> >> >> > old settings will negatively impact them. However, as you know, >> >> >> > latency is not a critical concern for big data; throughput is mo= re >> >> >> > important. If we keep the current settings, we will have to rele= ase >> >> >> > different kernel packages for different environments, which is a >> >> >> > significant burden for us. >> >> >> >> >> >> Totally understand your requirements. And, I think that this is b= etter >> >> >> to be resolved in your downstream kernel. If there are clear evid= ences >> >> >> to prove small batch number hurts throughput for some workloads, w= e can >> >> >> make the change in the upstream kernel. >> >> >> >> >> > >> >> > Please don't make this more complicated. We are at an impasse. >> >> > >> >> > The key issue here is that the upstream kernel has a default value = of >> >> > 5, not 0. If you can change it to 0, we can persuade our users to >> >> > follow the upstream changes. They currently set it to 5, not because >> >> > you, the author, chose this value, but because it is the default in >> >> > Linus's tree. Since it's in Linus's tree, kernel developers worldwi= de >> >> > support it. It's not just your decision as the author, but the enti= re >> >> > community supports this default. >> >> > >> >> > If, in the future, we find that the value of 0 is not suitable, you= 'll >> >> > tell us, "It is an issue in your downstream kernel, not in the >> >> > upstream kernel, so we won't accept it." PANIC. >> >> >> >> I don't think so. I suggest you to change the default value to 0. If >> >> someone reported that his workloads need some other value, then we ha= ve >> >> evidence that different workloads need different value. At that time, >> >> we can suggest to add an user tunable knob. >> >> >> > >> > The problem is that others are unaware we've set it to 0, and I can't >> > constantly monitor the linux-mm mailing list. Additionally, it's >> > possible that you can't always keep an eye on it either. >> >> IIUC, they will use the default value. Then, if there is any >> performance regression, they can report it. > > Now we report it. What is your replyment? "Keep it in your downstream > kernel." Wow, PANIC again. This is not all of my reply. I suggested you to change the default value too. > >> >> > I believe we should hear Andrew's suggestion. Andrew, what is your opi= nion? >> -- Best Regards, Huang, Ying