From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BBC0FC3DA61 for ; Mon, 29 Jul 2024 06:00:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3E2446B009C; Mon, 29 Jul 2024 02:00:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 36AAA6B009E; Mon, 29 Jul 2024 02:00:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1BD636B00A2; Mon, 29 Jul 2024 02:00:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id E97FD6B009C for ; Mon, 29 Jul 2024 02:00:44 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 8BD3D12011D for ; Mon, 29 Jul 2024 06:00:44 +0000 (UTC) X-FDA: 82391741208.12.2CD78A5 Received: from mail-qv1-f52.google.com (mail-qv1-f52.google.com [209.85.219.52]) by imf18.hostedemail.com (Postfix) with ESMTP id B066B1C000E for ; Mon, 29 Jul 2024 06:00:42 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=UW7Kc+Ee; spf=pass (imf18.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.52 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722232816; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=rTlVHJjWjasf7H8MTnkRA+H48CYgoQNAQKR5Hr8e2K8=; b=iqeO3+zhBbY3x4Xk0YcHAT1KRaUzo/45uwN/oW0zn0QE5JlnsJlKadcqNzx1LTyc7eZCv6 sX7HY8WYk7m3B0ZZADDlPw3FViekCv5bXM1ObB0BEL3/ZYyd3dHamhf9Wl5Iwqs+9ChUFh GoqD1ePFPOfpCOSOQVt1wGoJTyvVP7w= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=UW7Kc+Ee; spf=pass (imf18.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.52 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722232816; a=rsa-sha256; cv=none; b=mzy4/R6MdBXlzaCsONAomfalo/vLSgL5dNakQXt8dA4hST9VipALJP2G+lZmmx6e5dkD1i MPUc/6qh66wudzcJMPWQcEkRd8TFZytXzqyEVI35BjpKlHP7QDb0RvGg/kRm44PD23xykv Sq0uGXQWmVU3coga0kGp3eFdcvWkEkw= Received: by mail-qv1-f52.google.com with SMTP id 6a1803df08f44-6b97097f7fdso18644326d6.0 for ; Sun, 28 Jul 2024 23:00:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1722232842; x=1722837642; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=rTlVHJjWjasf7H8MTnkRA+H48CYgoQNAQKR5Hr8e2K8=; b=UW7Kc+Eeb5M/enABxODJKYu52b2oRIAOfOjWjLscoPMakN8kZlNUVsO3EvdTv2z0M+ mZpUyfHuBG+s5H6SfzGk1LQFJp/XKfNj/F8T1gF+odYrCDb45FOCT2khLlApjp8R6xns 0iaN8VwVCKR2LXBo5S0VBvbiFUNhv1NkkZoD68HjckidHcpGO995noXziWSO8r96QWh9 5x+sKCz/2lNt3iW19stNCc9SOGr8vDgeMJXbjkNS6AGJhivTd6oiI97OwQgvHMJxweU/ 1lTBJAFByH6x9sx/+6HywZUzJSLyGYLsTGYl+EPCxOa7+4pEn0RrJ+8ckm7NKorQgd6G dkLw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722232842; x=1722837642; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=rTlVHJjWjasf7H8MTnkRA+H48CYgoQNAQKR5Hr8e2K8=; b=SSVzl5ji/MaU5R6CBRNm4yg2IanbAbSmu7s1rcDJScTG5JhuUxkugVDbv74QKZoiwn 9ugo1PO67E41Cl/6RfD8+K/2+eOXKikblgR7B0NUirbtOMi5VTGXDt/+mHLOwyw79SBk QoXfiYFyirfkZ9c9WY0WcgPrMBzFKVwxi0myL2QY2TEjhIkXqlVDHDumBgD2phDwxCoe YayjJIE7Vno27konLmDPW+Lb5qTbp2Lde7nl259r2hySyoyaTqzwoTxP/GttXey1o+6A V9hK+LNLo9NbRQah/H9sauhlfvsCXogZdVJdjBSmrWJdV5JhyGx/UNAQXm4NwJtZYuwl jerg== X-Forwarded-Encrypted: i=1; AJvYcCWz/zxeg/6tHskE1sEGiVCwWzifImav1Ep9agMLjOkGrQKNvCf8F3Djgi7QjkSSBLO/tQCzw4KWtQgPG0oFpbnuoMw= X-Gm-Message-State: AOJu0YzJdCozEVmCuGI6vkkXgX8teNqW3rCABfRuEmX2QQ+Qs+XhG6/p dyPtQBQY4ARRAJgo127qGTLkkVMLmp7aZtp6fXJDeIqJpLsTO8+qyUd2ZVeS7EBnT4nwfd7vriY qWAF9A5OPebv5uV37syPmdxTeK9A= X-Google-Smtp-Source: AGHT+IHa7a4WqW05Y10k7aVBfrTISk48yU9tAQAnsUa+3X8fUCdjcqP7EYJ4qtOBRfiSdFgWVbvpPe6cZo3xYtvqE0Q= X-Received: by 2002:a05:6214:2607:b0:6b5:40d:c2d9 with SMTP id 6a1803df08f44-6bb559baf6cmr112876876d6.19.1722232841401; Sun, 28 Jul 2024 23:00:41 -0700 (PDT) MIME-Version: 1.0 References: <20240729023532.1555-1-laoar.shao@gmail.com> <20240729023532.1555-4-laoar.shao@gmail.com> <878qxkyjfr.fsf@yhuang6-desk2.ccr.corp.intel.com> <874j88ye65.fsf@yhuang6-desk2.ccr.corp.intel.com> <87zfq0wxub.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87zfq0wxub.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Yafang Shao Date: Mon, 29 Jul 2024 14:00:04 +0800 Message-ID: Subject: Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max To: "Huang, Ying" Cc: akpm@linux-foundation.org, mgorman@techsingularity.net, linux-mm@kvack.org, Matthew Wilcox , David Rientjes Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: B066B1C000E X-Stat-Signature: y9qqxp4o6pohnse3jc9563mmjggwdnmh X-HE-Tag: 1722232842-353778 X-HE-Meta: U2FsdGVkX1/2jzQUWlHyLo5shcUCrT3gFQsoa2lj7PC7L0RDnHUwAXeGgOUVgQHeFfAvbK4nMbYtf4wpYozJbR73355kupKayIqQMNQZsA3qg1gmZ/S2WsNw4vZvvjoi4xRA8KqrbYdgXkcFYMTESCYlBTeiNMkEprIYKtMfIeG1F6UGCKEkGzzdDt557PFRbc8nkUo4oDEUhXestovGO2OUBmjtP0Sdi9BNmXyVjLWu4AcQngH6bKITaWM6tJoqg8UBTrA32Wbvy7DC58t0C60W0WtPvDgT9SEwI0t33ID5HaRKt13vIBakwQrgugXrm6hjQ9sEjNoCJ+OrzffgVSuqMnKwWotG6aO36j5VRfl5IDqc2raaVT2NSYaMXfJiAxRFepJZBLLXbAbr9R6cRVzeb+U3KengudlSp2r2PfRShuFfsgYRJd7xgX/XHHjX8n8GsgQfiaPLkPu32y614/uQ81FdPhMN1ExHq9LPT642r0GDANme4eHzPwuOxAYueGAcJ/pTXd8gb6vRkz2awLcWPVNWir7eoxXXYsV1hmMpJSa3LOdIMTClF7EwSWNtbjj6raBmcl6DAQ/6yYnjgBYNUoNIQFSkSkcusIaYwqgrmP519afOYIsTpAyB5u8sLyWMHXJaEuqs6wrT5gMeHAf/ySrM82U6ttVjSobn4+r/ofDqiI8NpxP2WWDtI3OI9WryIgqD82oA2RGF9pn/XdGwQG27VZ/+q/1Dg0j7NL3ZfZv/SJc1m5I33vE9WojhHy4Bt1fbxap0ylcG9JUV/G2H8SqOgyaihQdpaqS4E9BeJcJmhQmrQYJl6hpSXJ5S5Ps9fYwjpHRu5eHSoQ2eoeWtXRQO5XBbaChEUDy1RLKCpCusLg2fa1w8DAnu4tTlZKwibonWtTevl+nwZntpw/aZh8Bnl67eaVm9C19PO0d5ZFTBwVFjKxpQYxWEX3YHRGHiCAlRHqpNIt3WKiy nqwSVnN3 JkrlTFCGZrhbZZF5NdqSaHtZIeiHG5E9sByUP9a7fFVbS+ZYEqynfioIAXJiUV8nHu726iHOCBQg6Z1LcYWqw5fIBEX0cnSc7uysufTV/gftkXMIiY/zwRN6Sng3ws8AoV19JccjvxaJSJhRmM4hafuHSR56jZb3s1WvB7joF2dBhHxQh2bZ/h3BgyUXPzkxZzwM+qxGtzF9mDkgz6lyYEH1qHkXz6r7YNwBFKpo25U/XVJj4D5YbTbMLfMQzILd5vDQtGT56gO10dZICh8Qiw3caHGZ+wj7pfPVQuxeoFE/75GgUG35ySmy79ucruFoB1UmyTX/L2uogMC7YRa2iuZ7HOBffhdu2K1TlHz/+aydCUcFJDwdo48H/ysR0GciiJq7skXmJcQH9s6KKcg1bek859ZCjWHjwwjRWebO5O2Jv+S4A7ehiZz9FkQteZGDOcpE2uPRNaEoEmCErMwGTFiT1Y3nEClU00Lvi X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jul 29, 2024 at 1:54=E2=80=AFPM Huang, Ying = wrote: > > Yafang Shao writes: > > > On Mon, Jul 29, 2024 at 1:16=E2=80=AFPM Huang, Ying wrote: > >> > >> Yafang Shao writes: > >> > >> > On Mon, Jul 29, 2024 at 11:22=E2=80=AFAM Huang, Ying wrote: > >> >> > >> >> Hi, Yafang, > >> >> > >> >> Yafang Shao writes: > >> >> > >> >> > During my recent work to resolve latency spikes caused by zone->l= ock > >> >> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficu= lt to use > >> >> > in practice. > >> >> > >> >> As we discussed before [1], I still feel confusing about the descri= ption > >> >> about zone->lock contention. How about change the description to > >> >> something like, > >> > > >> > Sure, I will change it. > >> > > >> >> > >> >> Larger page allocation/freeing batch number may cause longer run ti= me of > >> >> code holding zone->lock. If zone->lock is heavily contended at the= same > >> >> time, latency spikes may occur even for casual page allocation/free= ing. > >> >> Although reducing the batch number cannot make zone->lock contended > >> >> lighter, it can reduce the latency spikes effectively. > >> >> > >> >> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-desk2.c= cr.corp.intel.com/ > >> >> > >> >> > To demonstrate this, I wrote a Python script: > >> >> > > >> >> > import mmap > >> >> > > >> >> > size =3D 6 * 1024**3 > >> >> > > >> >> > while True: > >> >> > mm =3D mmap.mmap(-1, size) > >> >> > mm[:] =3D b'\xff' * size > >> >> > mm.close() > >> >> > > >> >> > Run this script 10 times in parallel and measure the allocation l= atency by > >> >> > measuring the duration of rmqueue_bulk() with the BCC tools > >> >> > funclatency[1]: > >> >> > > >> >> > funclatency -T -i 600 rmqueue_bulk > >> >> > > >> >> > Here are the results for both AMD and Intel CPUs. > >> >> > > >> >> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual se= rver > >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >> >> > > >> >> > - Default value of 5 > >> >> > > >> >> > nsecs : count distribution > >> >> > 0 -> 1 : 0 | = | > >> >> > 2 -> 3 : 0 | = | > >> >> > 4 -> 7 : 0 | = | > >> >> > 8 -> 15 : 0 | = | > >> >> > 16 -> 31 : 0 | = | > >> >> > 32 -> 63 : 0 | = | > >> >> > 64 -> 127 : 0 | = | > >> >> > 128 -> 255 : 0 | = | > >> >> > 256 -> 511 : 0 | = | > >> >> > 512 -> 1023 : 12 | = | > >> >> > 1024 -> 2047 : 9116 | = | > >> >> > 2048 -> 4095 : 2004 | = | > >> >> > 4096 -> 8191 : 2497 | = | > >> >> > 8192 -> 16383 : 2127 | = | > >> >> > 16384 -> 32767 : 2483 | = | > >> >> > 32768 -> 65535 : 10102 | = | > >> >> > 65536 -> 131071 : 212730 |******************* = | > >> >> > 131072 -> 262143 : 314692 |****************************= * | > >> >> > 262144 -> 524287 : 430058 |****************************= ************| > >> >> > 524288 -> 1048575 : 224032 |******************** = | > >> >> > 1048576 -> 2097151 : 73567 |****** = | > >> >> > 2097152 -> 4194303 : 17079 |* = | > >> >> > 4194304 -> 8388607 : 3900 | = | > >> >> > 8388608 -> 16777215 : 750 | = | > >> >> > 16777216 -> 33554431 : 88 | = | > >> >> > 33554432 -> 67108863 : 2 | = | > >> >> > > >> >> > avg =3D 449775 nsecs, total: 587066511229 nsecs, count: 1305242 > >> >> > > >> >> > The avg alloc latency can be 449us, and the max latency can be hi= gher > >> >> > than 30ms. > >> >> > > >> >> > - Value set to 0 > >> >> > > >> >> > nsecs : count distribution > >> >> > 0 -> 1 : 0 | = | > >> >> > 2 -> 3 : 0 | = | > >> >> > 4 -> 7 : 0 | = | > >> >> > 8 -> 15 : 0 | = | > >> >> > 16 -> 31 : 0 | = | > >> >> > 32 -> 63 : 0 | = | > >> >> > 64 -> 127 : 0 | = | > >> >> > 128 -> 255 : 0 | = | > >> >> > 256 -> 511 : 0 | = | > >> >> > 512 -> 1023 : 92 | = | > >> >> > 1024 -> 2047 : 8594 | = | > >> >> > 2048 -> 4095 : 2042818 |****** = | > >> >> > 4096 -> 8191 : 8737624 |************************** = | > >> >> > 8192 -> 16383 : 13147872 |****************************= ************| > >> >> > 16384 -> 32767 : 8799951 |************************** = | > >> >> > 32768 -> 65535 : 2879715 |******** = | > >> >> > 65536 -> 131071 : 659600 |** = | > >> >> > 131072 -> 262143 : 204004 | = | > >> >> > 262144 -> 524287 : 78246 | = | > >> >> > 524288 -> 1048575 : 30800 | = | > >> >> > 1048576 -> 2097151 : 12251 | = | > >> >> > 2097152 -> 4194303 : 2950 | = | > >> >> > 4194304 -> 8388607 : 78 | = | > >> >> > > >> >> > avg =3D 19359 nsecs, total: 708638369918 nsecs, count: 36604636 > >> >> > > >> >> > The avg was reduced significantly to 19us, and the max latency is= reduced > >> >> > to less than 8ms. > >> >> > > >> >> > - Conclusion > >> >> > > >> >> > On this AMD CPU, reducing vm.pcp_batch_scale_max significantly he= lps reduce > >> >> > latency. Latency-sensitive applications will benefit from this tu= ning. > >> >> > > >> >> > However, I don't have access to other types of AMD CPUs, so I was= unable to > >> >> > test it on different AMD models. > >> >> > > >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes > >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >> >> > > >> >> > - Default value of 5 > >> >> > > >> >> > nsecs : count distribution > >> >> > 0 -> 1 : 0 | = | > >> >> > 2 -> 3 : 0 | = | > >> >> > 4 -> 7 : 0 | = | > >> >> > 8 -> 15 : 0 | = | > >> >> > 16 -> 31 : 0 | = | > >> >> > 32 -> 63 : 0 | = | > >> >> > 64 -> 127 : 0 | = | > >> >> > 128 -> 255 : 0 | = | > >> >> > 256 -> 511 : 0 | = | > >> >> > 512 -> 1023 : 2419 | = | > >> >> > 1024 -> 2047 : 34499 |* = | > >> >> > 2048 -> 4095 : 4272 | = | > >> >> > 4096 -> 8191 : 9035 | = | > >> >> > 8192 -> 16383 : 4374 | = | > >> >> > 16384 -> 32767 : 2963 | = | > >> >> > 32768 -> 65535 : 6407 | = | > >> >> > 65536 -> 131071 : 884806 |****************************= ************| > >> >> > 131072 -> 262143 : 145931 |****** = | > >> >> > 262144 -> 524287 : 13406 | = | > >> >> > 524288 -> 1048575 : 1874 | = | > >> >> > 1048576 -> 2097151 : 249 | = | > >> >> > 2097152 -> 4194303 : 28 | = | > >> >> > > >> >> > avg =3D 96173 nsecs, total: 106778157925 nsecs, count: 1110263 > >> >> > > >> >> > - Conclusion > >> >> > > >> >> > This Intel CPU works fine with the default setting. > >> >> > > >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node > >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >> >> > > >> >> > Using the cpuset cgroup, we can restrict the test script to run o= n NUMA > >> >> > node 0 only. > >> >> > > >> >> > - Default value of 5 > >> >> > > >> >> > nsecs : count distribution > >> >> > 0 -> 1 : 0 | = | > >> >> > 2 -> 3 : 0 | = | > >> >> > 4 -> 7 : 0 | = | > >> >> > 8 -> 15 : 0 | = | > >> >> > 16 -> 31 : 0 | = | > >> >> > 32 -> 63 : 0 | = | > >> >> > 64 -> 127 : 0 | = | > >> >> > 128 -> 255 : 0 | = | > >> >> > 256 -> 511 : 46 | = | > >> >> > 512 -> 1023 : 695 | = | > >> >> > 1024 -> 2047 : 19950 |* = | > >> >> > 2048 -> 4095 : 1788 | = | > >> >> > 4096 -> 8191 : 3392 | = | > >> >> > 8192 -> 16383 : 2569 | = | > >> >> > 16384 -> 32767 : 2619 | = | > >> >> > 32768 -> 65535 : 3809 | = | > >> >> > 65536 -> 131071 : 616182 |****************************= ************| > >> >> > 131072 -> 262143 : 295587 |******************* = | > >> >> > 262144 -> 524287 : 75357 |**** = | > >> >> > 524288 -> 1048575 : 15471 |* = | > >> >> > 1048576 -> 2097151 : 2939 | = | > >> >> > 2097152 -> 4194303 : 243 | = | > >> >> > 4194304 -> 8388607 : 3 | = | > >> >> > > >> >> > avg =3D 144410 nsecs, total: 150281196195 nsecs, count: 1040651 > >> >> > > >> >> > The zone->lock contention becomes severe when there is only a sin= gle NUMA > >> >> > node. The average latency is approximately 144us, with the maximu= m > >> >> > latency exceeding 4ms. > >> >> > > >> >> > - Value set to 0 > >> >> > > >> >> > nsecs : count distribution > >> >> > 0 -> 1 : 0 | = | > >> >> > 2 -> 3 : 0 | = | > >> >> > 4 -> 7 : 0 | = | > >> >> > 8 -> 15 : 0 | = | > >> >> > 16 -> 31 : 0 | = | > >> >> > 32 -> 63 : 0 | = | > >> >> > 64 -> 127 : 0 | = | > >> >> > 128 -> 255 : 0 | = | > >> >> > 256 -> 511 : 24 | = | > >> >> > 512 -> 1023 : 2686 | = | > >> >> > 1024 -> 2047 : 10246 | = | > >> >> > 2048 -> 4095 : 4061529 |********* = | > >> >> > 4096 -> 8191 : 16894971 |****************************= ************| > >> >> > 8192 -> 16383 : 6279310 |************** = | > >> >> > 16384 -> 32767 : 1658240 |*** = | > >> >> > 32768 -> 65535 : 445760 |* = | > >> >> > 65536 -> 131071 : 110817 | = | > >> >> > 131072 -> 262143 : 20279 | = | > >> >> > 262144 -> 524287 : 4176 | = | > >> >> > 524288 -> 1048575 : 436 | = | > >> >> > 1048576 -> 2097151 : 8 | = | > >> >> > 2097152 -> 4194303 : 2 | = | > >> >> > > >> >> > avg =3D 8401 nsecs, total: 247739809022 nsecs, count: 29488508 > >> >> > > >> >> > After setting it to 0, the avg latency is reduced to around 8us, = and the > >> >> > max latency is less than 4ms. > >> >> > > >> >> > - Conclusion > >> >> > > >> >> > On this Intel CPU, this tuning doesn't help much. Latency-sensiti= ve > >> >> > applications work well with the default setting. > >> >> > > >> >> > It is worth noting that all the above data were tested using the = upstream > >> >> > kernel. > >> >> > > >> >> > Why introduce a systl knob? > >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D > >> >> > > >> >> > From the above data, it's clear that different CPU types have var= ying > >> >> > allocation latencies concerning zone->lock contention. Typically,= people > >> >> > don't release individual kernel packages for each type of x86_64 = CPU. > >> >> > > >> >> > Furthermore, for latency-insensitive applications, we can keep th= e default > >> >> > setting for better throughput. In our production environment, we = set this > >> >> > value to 0 for applications running on Kubernetes servers while k= eeping it > >> >> > at the default value of 5 for other applications like big data. I= t's not > >> >> > common to release individual kernel packages for each application= . > >> >> > >> >> Thanks for detailed performance data! > >> >> > >> >> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to= 0 in > >> >> your environment? If not, I suggest to use 0 as default for > >> >> CONFIG_PCP_BATCH_SCALE_MAX. Because we have clear evidence that > >> >> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads. After > >> >> that, if someone found some other workloads need larger > >> >> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically. > >> >> > >> > > >> > The decision doesn=E2=80=99t rest with us, the kernel team at our co= mpany. > >> > It=E2=80=99s made by the system administrators who manage a large nu= mber of > >> > servers. The latency spikes only occur on the Kubernetes (k8s) > >> > servers, not in other environments like big data servers. We have > >> > informed other system administrators, such as those managing the big > >> > data servers, about the latency spike issues, but they are unwilling > >> > to make the change. > >> > > >> > No one wants to make changes unless there is evidence showing that t= he > >> > old settings will negatively impact them. However, as you know, > >> > latency is not a critical concern for big data; throughput is more > >> > important. If we keep the current settings, we will have to release > >> > different kernel packages for different environments, which is a > >> > significant burden for us. > >> > >> Totally understand your requirements. And, I think that this is bette= r > >> to be resolved in your downstream kernel. If there are clear evidence= s > >> to prove small batch number hurts throughput for some workloads, we ca= n > >> make the change in the upstream kernel. > >> > > > > Please don't make this more complicated. We are at an impasse. > > > > The key issue here is that the upstream kernel has a default value of > > 5, not 0. If you can change it to 0, we can persuade our users to > > follow the upstream changes. They currently set it to 5, not because > > you, the author, chose this value, but because it is the default in > > Linus's tree. Since it's in Linus's tree, kernel developers worldwide > > support it. It's not just your decision as the author, but the entire > > community supports this default. > > > > If, in the future, we find that the value of 0 is not suitable, you'll > > tell us, "It is an issue in your downstream kernel, not in the > > upstream kernel, so we won't accept it." PANIC. > > I don't think so. I suggest you to change the default value to 0. If > someone reported that his workloads need some other value, then we have > evidence that different workloads need different value. At that time, > we can suggest to add an user tunable knob. > The problem is that others are unaware we've set it to 0, and I can't constantly monitor the linux-mm mailing list. Additionally, it's possible that you can't always keep an eye on it either. I believe we should hear Andrew's suggestion. Andrew, what is your opinion? --=20 Regards Yafang