From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 788C5C3DA61
	for <linux-mm@archiver.kernel.org>; Mon, 29 Jul 2024 06:13:47 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 12DBB6B00A1; Mon, 29 Jul 2024 02:13:47 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0B70F6B00A3; Mon, 29 Jul 2024 02:13:47 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E4B586B00A4; Mon, 29 Jul 2024 02:13:46 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id BDFB06B00A1
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 02:13:46 -0400 (EDT)
Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 7243D160117
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 06:13:46 +0000 (UTC)
X-FDA: 82391774052.24.36F35DE
Received: from mail-qv1-f50.google.com (mail-qv1-f50.google.com [209.85.219.50])
	by imf08.hostedemail.com (Postfix) with ESMTP id 9D087160008
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 06:13:44 +0000 (UTC)
Authentication-Results: imf08.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=cjOO8LHc;
	spf=pass (imf08.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.50 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1722233554;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=GCgPE3kNtY7KBit1eYXWKYw92Bl3u1oDkswo9fAr0NY=;
	b=qQVA/boZ8/bTLWwXvPJV5PJ7BcXBv4CApUgG0lUL2hL8lBbeA0ADXMFAOKM1bheMXy92Uf
	EWjwAJZq5fHM+M6N6B7KSCWBXuttd8GfDFde3enxe/c3nfX/R9hG6xalkwO8tJvOg1EiEb
	4ZegJxH6SC2U5KbP/KZd9Ob5di4/F/o=
ARC-Authentication-Results: i=1;
	imf08.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=cjOO8LHc;
	spf=pass (imf08.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.50 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722233554; a=rsa-sha256;
	cv=none;
	b=tQt1OIIHNhkehX6wuRsBG1b7X4sZUIpV9Q9OaznDrUjESVVrpehqhkktuZ8rcx6SrW0suo
	be9s6B/mxfKHu4sFmIF5OPGoy11QR278FAKVTncgKrLiYi8xHuKdu5TU4OeE3/mdjpGpF/
	k9YmgPTOeY+V8sUWLwXisKRk2yUrYaw=
Received: by mail-qv1-f50.google.com with SMTP id 6a1803df08f44-6b79c27dd01so17043136d6.3
        for <linux-mm@kvack.org>; Sun, 28 Jul 2024 23:13:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1722233624; x=1722838424; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=GCgPE3kNtY7KBit1eYXWKYw92Bl3u1oDkswo9fAr0NY=;
        b=cjOO8LHcYIVCTtlcbqaIKzMF1P0gmFBBFLkCVbNIBidtygk4QcpvqL6W09zMOAjsSp
         3M9Q+nL6H65iTj98+3Pw7SD8z9txwrPDr/IjtDxWDRrieYwMS+iv+Ametw9wEoqDi6/W
         PZ/heyr2TSLwV4aol6exZwb3CWx7OXKyszkiHl3N+ftsQZe/BkTaz/R2qX05rg40UgNe
         1tG9PmdNj6hmVrjw9qR7u4JEG1zJ86ZivkrV5v0XbskpGZ+MfciIq5YQc/5KpYhegFmC
         6ZTZyaxRQYgG3SIMvKFoB2KnKdMOTzngRyNS3tMlI9aKgyt5UQhecfDfuDH/0XvfUP2M
         4uvQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1722233624; x=1722838424;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=GCgPE3kNtY7KBit1eYXWKYw92Bl3u1oDkswo9fAr0NY=;
        b=W9hmqimm6Q0pcZIUjxGHcL5cwRfRl9dztUbpq2vUL/ebduP2aiX1gru22Dyn3qK7kJ
         yXr6TLWtUT+CRBxBVAf+LZVxgyQGMWjP3TOdvG9mqtKsFKdqgop/wKKD/iB75f8FDHwQ
         ZL5zJ3DKISAT+/SgtzS0uJW1ZvjQJoYaNohbWwnmLpnpmlvGRwMDGQYE043ePe2ZyjgE
         7DrMDk02PutgHdsGXH0jfrLGIIaQNRGQU2ogakruqW7ZAGlSY1RxyKc1Oqkj7d02Zaf6
         58LOsJQO0ZJG2ZyYse02HSQmegvfF8SYhkBZzoxgQx051Nw4Q3uNZZuZ/31AWEhzdEWr
         C47A==
X-Forwarded-Encrypted: i=1; AJvYcCW9Z+UTEvmfn/3RmrXdB2ocGslhyEfgi36p54pKQ8WkWfps4aWosaG17qiJw6AvgXyaVviUdn/Y09OHxMPKbpJGzOY=
X-Gm-Message-State: AOJu0YzI6heVvRHX2BTNT/NFV+TlHY6SfKKGvizR0jChQH24czp+64Px
	z9XEjopEN28tTPSaxdQ1TS7BNBFUTeHD67J/opRjBLi6hQi7RnKvyFQIUiTrmkOMVxw5AENxSXu
	hV81Jucc9bHkyoFm+NWhe0KTFmcI=
X-Google-Smtp-Source: AGHT+IHDuuMvPNaxDGjn8JovRetFz+aQgEvq7JunedPNWN1Erv9XgpWpk/UDrb4vrGV2+oYc/q5Jqw1MFlOIcYV8r7s=
X-Received: by 2002:a05:6214:4019:b0:6b5:5f1:4811 with SMTP id
 6a1803df08f44-6bb55ad1303mr76678786d6.46.1722233623573; Sun, 28 Jul 2024
 23:13:43 -0700 (PDT)
MIME-Version: 1.0
References: <20240729023532.1555-1-laoar.shao@gmail.com> <20240729023532.1555-4-laoar.shao@gmail.com>
 <878qxkyjfr.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbBm+Mdhu9qcJkoOcLhVUoLKF50dLfhpTd_w_uR8h3yy6g@mail.gmail.com>
 <874j88ye65.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbCi3E61PWHKBJneWT-A-X_C9ezBdjZn0zrLM9gig5XJ+Q@mail.gmail.com>
 <87zfq0wxub.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbDF3veOjfLuoo8ufznvn2w1qxZR18iz3MASOZSiG3jB_A@mail.gmail.com>
 <87v80owxd8.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87v80owxd8.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Mon, 29 Jul 2024 14:13:07 +0800
Message-ID: <CALOAHbCGBVfeiyy5+03u3qqJyXjdNjoQQJbeAhUGKpvtFByb=A@mail.gmail.com>
Subject: Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
To: "Huang, Ying" <ying.huang@intel.com>
Cc: akpm@linux-foundation.org, mgorman@techsingularity.net, linux-mm@kvack.org, 
	Matthew Wilcox <willy@infradead.org>, David Rientjes <rientjes@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 9D087160008
X-Stat-Signature: 5ctsxnfca18os1p8eeeqs351wj8f6ifz
X-Rspam-User: 
X-HE-Tag: 1722233624-484815
X-HE-Meta: U2FsdGVkX18ZiSBYJ76zYclNHQ8fzPYRboenyhHEKcsxhwIFEyJufy8cFz9aKs+en4ubH48Ba3skaHTkZepwP83YVLNb1nzqXZ+yyTwm6MNOOnOUc/K1ZF0wpqWzWz9Qd6UwFUQ0HP4wE30Eo3ECu9YY0Og6MkXOqOCg8XRne0O/n9vxKc5PLCBsVzg1lx4j8VJHwjVDV2/54ccE/+188L11Y7QU8iTtNP7QJIkTWS8xYxNn+QjzVaS6OrB72/anfePgB9fjzhVEKlkhliPXazurcfbvWCWYyc3/lO0jq0mbPR81xHHPn8yZwuOQvuo6KVVNtR4Lgl86dkQ8pR1sS6r+y2hJ4MBkG/h0ooRyYKKhpM4cVqvTeOJhtlEm5y2MiWExrfg147NFpSIHylCuDrGd8T/agMDrhA/5o5JZe65REyzePW03gQsayfEkMYKRCafwxA1r75eVyHyo8yOdpM+3AcVPuo1W0O9+3Z0ucvif0q/hDc2l+fz118dAHUldj0MxZ76KPDNG+jYFkstd7M8vdcc2b8IbfloVi8iwRZTq92TZcmnFjpZ6bM5oOT6bqdGuFCZ8+hJ8kuRzQCDwGP301ZUzu8CsszkxK62fYDHyZAYsEU2r+bOlkwrVnq7mheQf0tkdFKfaddeXIIhbvZeVWi7/8X9QHawI2lUTiz2ZVojj7/+kt7t9TU5qt7PkWQQz+Q0LE2RetunkIxQWoD+AZOrGffOK+zlK+G3+QSHqJe98iikGw7im06jG7EghFNjvINySjemAxrhaI9Eoa5D0awcwIim1BDV1J6bDMBXvexDNEk8v2UkLx8fUuytOCSXgic5TbX6QNYK0amL8aPWILz/Y8jYwkOuqkNWc+A6con2wnlCT+8DPScbyciTA0E2FBdHGyZf+rp1tLq1TMqAr6yhfdUagVa6uFF1tPABpKah4MTc7YoSe8Ny2WOkhE7wRK6DdCsHSQjmHIYP
 aKzaJZqb
 nbXlK11UhaWvcWUIJbbHKHUH4lsfzOSvkBaQMewzcqmROzABPUf23phvECOsYyjXinRrgkeuooWiP+AyrsN9K4Hn9hs2w/0wjIU0piCKL3yC98axGAbSbJ6RdQhdG5ZiQEV/PCF19fgbZzzaNpcXItDzD0i24E+vl1uSQkV+tfFvN4oHm64Fj0ibjuv62LgziHcQ69AyueRLdLOZ6ypsJmV4WzbRvoVBDZdyrGacsk5v2MXQuAR3uTDnXrjW/abThmQ4zvxaEUuBDY92TL2Y0juromhK7N65W+aQsO16+U3qgCM8S+M+ldVMBwtIOHc6R4yEG4RtkO9TQ3MVF6TTLEF+GMznuJtQ4KJwKgWwzW8uR5mFs4gyiEH4U7QUDBGReFEBHLIlQ/J4PzeokL/YIvF9uIkQfNbAHU3PdDpQmgSmj//++SLEOJQJb5tABPn7UxczbxOOqTf+SPILsdCfnbL2W9C2l3OeN7L2k
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000037, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Jul 29, 2024 at 2:04=E2=80=AFPM Huang, Ying <ying.huang@intel.com> =
wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Mon, Jul 29, 2024 at 1:54=E2=80=AFPM Huang, Ying <ying.huang@intel.c=
om> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Mon, Jul 29, 2024 at 1:16=E2=80=AFPM Huang, Ying <ying.huang@inte=
l.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Mon, Jul 29, 2024 at 11:22=E2=80=AFAM Huang, Ying <ying.huang@=
intel.com> wrote:
> >> >> >>
> >> >> >> Hi, Yafang,
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > During my recent work to resolve latency spikes caused by zone=
->lock
> >> >> >> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is diff=
icult to use
> >> >> >> > in practice.
> >> >> >>
> >> >> >> As we discussed before [1], I still feel confusing about the des=
cription
> >> >> >> about zone->lock contention.  How about change the description t=
o
> >> >> >> something like,
> >> >> >
> >> >> > Sure, I will change it.
> >> >> >
> >> >> >>
> >> >> >> Larger page allocation/freeing batch number may cause longer run=
 time of
> >> >> >> code holding zone->lock.  If zone->lock is heavily contended at =
the same
> >> >> >> time, latency spikes may occur even for casual page allocation/f=
reeing.
> >> >> >> Although reducing the batch number cannot make zone->lock conten=
ded
> >> >> >> lighter, it can reduce the latency spikes effectively.
> >> >> >>
> >> >> >> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-desk=
2.ccr.corp.intel.com/
> >> >> >>
> >> >> >> > To demonstrate this, I wrote a Python script:
> >> >> >> >
> >> >> >> >   import mmap
> >> >> >> >
> >> >> >> >   size =3D 6 * 1024**3
> >> >> >> >
> >> >> >> >   while True:
> >> >> >> >       mm =3D mmap.mmap(-1, size)
> >> >> >> >       mm[:] =3D b'\xff' * size
> >> >> >> >       mm.close()
> >> >> >> >
> >> >> >> > Run this script 10 times in parallel and measure the allocatio=
n latency by
> >> >> >> > measuring the duration of rmqueue_bulk() with the BCC tools
> >> >> >> > funclatency[1]:
> >> >> >> >
> >> >> >> >   funclatency -T -i 600 rmqueue_bulk
> >> >> >> >
> >> >> >> > Here are the results for both AMD and Intel CPUs.
> >> >> >> >
> >> >> >> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual=
 server
> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >> >> >> >
> >> >> >> > - Default value of 5
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                         =
               |
> >> >> >> >          2 -> 3          : 0        |                         =
               |
> >> >> >> >          4 -> 7          : 0        |                         =
               |
> >> >> >> >          8 -> 15         : 0        |                         =
               |
> >> >> >> >         16 -> 31         : 0        |                         =
               |
> >> >> >> >         32 -> 63         : 0        |                         =
               |
> >> >> >> >         64 -> 127        : 0        |                         =
               |
> >> >> >> >        128 -> 255        : 0        |                         =
               |
> >> >> >> >        256 -> 511        : 0        |                         =
               |
> >> >> >> >        512 -> 1023       : 12       |                         =
               |
> >> >> >> >       1024 -> 2047       : 9116     |                         =
               |
> >> >> >> >       2048 -> 4095       : 2004     |                         =
               |
> >> >> >> >       4096 -> 8191       : 2497     |                         =
               |
> >> >> >> >       8192 -> 16383      : 2127     |                         =
               |
> >> >> >> >      16384 -> 32767      : 2483     |                         =
               |
> >> >> >> >      32768 -> 65535      : 10102    |                         =
               |
> >> >> >> >      65536 -> 131071     : 212730   |*******************      =
               |
> >> >> >> >     131072 -> 262143     : 314692   |*************************=
****           |
> >> >> >> >     262144 -> 524287     : 430058   |*************************=
***************|
> >> >> >> >     524288 -> 1048575    : 224032   |********************     =
               |
> >> >> >> >    1048576 -> 2097151    : 73567    |******                   =
               |
> >> >> >> >    2097152 -> 4194303    : 17079    |*                        =
               |
> >> >> >> >    4194304 -> 8388607    : 3900     |                         =
               |
> >> >> >> >    8388608 -> 16777215   : 750      |                         =
               |
> >> >> >> >   16777216 -> 33554431   : 88       |                         =
               |
> >> >> >> >   33554432 -> 67108863   : 2        |                         =
               |
> >> >> >> >
> >> >> >> > avg =3D 449775 nsecs, total: 587066511229 nsecs, count: 130524=
2
> >> >> >> >
> >> >> >> > The avg alloc latency can be 449us, and the max latency can be=
 higher
> >> >> >> > than 30ms.
> >> >> >> >
> >> >> >> > - Value set to 0
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                         =
               |
> >> >> >> >          2 -> 3          : 0        |                         =
               |
> >> >> >> >          4 -> 7          : 0        |                         =
               |
> >> >> >> >          8 -> 15         : 0        |                         =
               |
> >> >> >> >         16 -> 31         : 0        |                         =
               |
> >> >> >> >         32 -> 63         : 0        |                         =
               |
> >> >> >> >         64 -> 127        : 0        |                         =
               |
> >> >> >> >        128 -> 255        : 0        |                         =
               |
> >> >> >> >        256 -> 511        : 0        |                         =
               |
> >> >> >> >        512 -> 1023       : 92       |                         =
               |
> >> >> >> >       1024 -> 2047       : 8594     |                         =
               |
> >> >> >> >       2048 -> 4095       : 2042818  |******                   =
               |
> >> >> >> >       4096 -> 8191       : 8737624  |*************************=
*              |
> >> >> >> >       8192 -> 16383      : 13147872 |*************************=
***************|
> >> >> >> >      16384 -> 32767      : 8799951  |*************************=
*              |
> >> >> >> >      32768 -> 65535      : 2879715  |********                 =
               |
> >> >> >> >      65536 -> 131071     : 659600   |**                       =
               |
> >> >> >> >     131072 -> 262143     : 204004   |                         =
               |
> >> >> >> >     262144 -> 524287     : 78246    |                         =
               |
> >> >> >> >     524288 -> 1048575    : 30800    |                         =
               |
> >> >> >> >    1048576 -> 2097151    : 12251    |                         =
               |
> >> >> >> >    2097152 -> 4194303    : 2950     |                         =
               |
> >> >> >> >    4194304 -> 8388607    : 78       |                         =
               |
> >> >> >> >
> >> >> >> > avg =3D 19359 nsecs, total: 708638369918 nsecs, count: 3660463=
6
> >> >> >> >
> >> >> >> > The avg was reduced significantly to 19us, and the max latency=
 is reduced
> >> >> >> > to less than 8ms.
> >> >> >> >
> >> >> >> > - Conclusion
> >> >> >> >
> >> >> >> > On this AMD CPU, reducing vm.pcp_batch_scale_max significantly=
 helps reduce
> >> >> >> > latency. Latency-sensitive applications will benefit from this=
 tuning.
> >> >> >> >
> >> >> >> > However, I don't have access to other types of AMD CPUs, so I =
was unable to
> >> >> >> > test it on different AMD models.
> >> >> >> >
> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >> >> >> >
> >> >> >> > - Default value of 5
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                         =
               |
> >> >> >> >          2 -> 3          : 0        |                         =
               |
> >> >> >> >          4 -> 7          : 0        |                         =
               |
> >> >> >> >          8 -> 15         : 0        |                         =
               |
> >> >> >> >         16 -> 31         : 0        |                         =
               |
> >> >> >> >         32 -> 63         : 0        |                         =
               |
> >> >> >> >         64 -> 127        : 0        |                         =
               |
> >> >> >> >        128 -> 255        : 0        |                         =
               |
> >> >> >> >        256 -> 511        : 0        |                         =
               |
> >> >> >> >        512 -> 1023       : 2419     |                         =
               |
> >> >> >> >       1024 -> 2047       : 34499    |*                        =
               |
> >> >> >> >       2048 -> 4095       : 4272     |                         =
               |
> >> >> >> >       4096 -> 8191       : 9035     |                         =
               |
> >> >> >> >       8192 -> 16383      : 4374     |                         =
               |
> >> >> >> >      16384 -> 32767      : 2963     |                         =
               |
> >> >> >> >      32768 -> 65535      : 6407     |                         =
               |
> >> >> >> >      65536 -> 131071     : 884806   |*************************=
***************|
> >> >> >> >     131072 -> 262143     : 145931   |******                   =
               |
> >> >> >> >     262144 -> 524287     : 13406    |                         =
               |
> >> >> >> >     524288 -> 1048575    : 1874     |                         =
               |
> >> >> >> >    1048576 -> 2097151    : 249      |                         =
               |
> >> >> >> >    2097152 -> 4194303    : 28       |                         =
               |
> >> >> >> >
> >> >> >> > avg =3D 96173 nsecs, total: 106778157925 nsecs, count: 1110263
> >> >> >> >
> >> >> >> > - Conclusion
> >> >> >> >
> >> >> >> > This Intel CPU works fine with the default setting.
> >> >> >> >
> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >> >> >> >
> >> >> >> > Using the cpuset cgroup, we can restrict the test script to ru=
n on NUMA
> >> >> >> > node 0 only.
> >> >> >> >
> >> >> >> > - Default value of 5
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                         =
               |
> >> >> >> >          2 -> 3          : 0        |                         =
               |
> >> >> >> >          4 -> 7          : 0        |                         =
               |
> >> >> >> >          8 -> 15         : 0        |                         =
               |
> >> >> >> >         16 -> 31         : 0        |                         =
               |
> >> >> >> >         32 -> 63         : 0        |                         =
               |
> >> >> >> >         64 -> 127        : 0        |                         =
               |
> >> >> >> >        128 -> 255        : 0        |                         =
               |
> >> >> >> >        256 -> 511        : 46       |                         =
               |
> >> >> >> >        512 -> 1023       : 695      |                         =
               |
> >> >> >> >       1024 -> 2047       : 19950    |*                        =
               |
> >> >> >> >       2048 -> 4095       : 1788     |                         =
               |
> >> >> >> >       4096 -> 8191       : 3392     |                         =
               |
> >> >> >> >       8192 -> 16383      : 2569     |                         =
               |
> >> >> >> >      16384 -> 32767      : 2619     |                         =
               |
> >> >> >> >      32768 -> 65535      : 3809     |                         =
               |
> >> >> >> >      65536 -> 131071     : 616182   |*************************=
***************|
> >> >> >> >     131072 -> 262143     : 295587   |*******************      =
               |
> >> >> >> >     262144 -> 524287     : 75357    |****                     =
               |
> >> >> >> >     524288 -> 1048575    : 15471    |*                        =
               |
> >> >> >> >    1048576 -> 2097151    : 2939     |                         =
               |
> >> >> >> >    2097152 -> 4194303    : 243      |                         =
               |
> >> >> >> >    4194304 -> 8388607    : 3        |                         =
               |
> >> >> >> >
> >> >> >> > avg =3D 144410 nsecs, total: 150281196195 nsecs, count: 104065=
1
> >> >> >> >
> >> >> >> > The zone->lock contention becomes severe when there is only a =
single NUMA
> >> >> >> > node. The average latency is approximately 144us, with the max=
imum
> >> >> >> > latency exceeding 4ms.
> >> >> >> >
> >> >> >> > - Value set to 0
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                         =
               |
> >> >> >> >          2 -> 3          : 0        |                         =
               |
> >> >> >> >          4 -> 7          : 0        |                         =
               |
> >> >> >> >          8 -> 15         : 0        |                         =
               |
> >> >> >> >         16 -> 31         : 0        |                         =
               |
> >> >> >> >         32 -> 63         : 0        |                         =
               |
> >> >> >> >         64 -> 127        : 0        |                         =
               |
> >> >> >> >        128 -> 255        : 0        |                         =
               |
> >> >> >> >        256 -> 511        : 24       |                         =
               |
> >> >> >> >        512 -> 1023       : 2686     |                         =
               |
> >> >> >> >       1024 -> 2047       : 10246    |                         =
               |
> >> >> >> >       2048 -> 4095       : 4061529  |*********                =
               |
> >> >> >> >       4096 -> 8191       : 16894971 |*************************=
***************|
> >> >> >> >       8192 -> 16383      : 6279310  |**************           =
               |
> >> >> >> >      16384 -> 32767      : 1658240  |***                      =
               |
> >> >> >> >      32768 -> 65535      : 445760   |*                        =
               |
> >> >> >> >      65536 -> 131071     : 110817   |                         =
               |
> >> >> >> >     131072 -> 262143     : 20279    |                         =
               |
> >> >> >> >     262144 -> 524287     : 4176     |                         =
               |
> >> >> >> >     524288 -> 1048575    : 436      |                         =
               |
> >> >> >> >    1048576 -> 2097151    : 8        |                         =
               |
> >> >> >> >    2097152 -> 4194303    : 2        |                         =
               |
> >> >> >> >
> >> >> >> > avg =3D 8401 nsecs, total: 247739809022 nsecs, count: 29488508
> >> >> >> >
> >> >> >> > After setting it to 0, the avg latency is reduced to around 8u=
s, and the
> >> >> >> > max latency is less than 4ms.
> >> >> >> >
> >> >> >> > - Conclusion
> >> >> >> >
> >> >> >> > On this Intel CPU, this tuning doesn't help much. Latency-sens=
itive
> >> >> >> > applications work well with the default setting.
> >> >> >> >
> >> >> >> > It is worth noting that all the above data were tested using t=
he upstream
> >> >> >> > kernel.
> >> >> >> >
> >> >> >> > Why introduce a systl knob?
> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D
> >> >> >> >
> >> >> >> > From the above data, it's clear that different CPU types have =
varying
> >> >> >> > allocation latencies concerning zone->lock contention. Typical=
ly, people
> >> >> >> > don't release individual kernel packages for each type of x86_=
64 CPU.
> >> >> >> >
> >> >> >> > Furthermore, for latency-insensitive applications, we can keep=
 the default
> >> >> >> > setting for better throughput. In our production environment, =
we set this
> >> >> >> > value to 0 for applications running on Kubernetes servers whil=
e keeping it
> >> >> >> > at the default value of 5 for other applications like big data=
. It's not
> >> >> >> > common to release individual kernel packages for each applicat=
ion.
> >> >> >>
> >> >> >> Thanks for detailed performance data!
> >> >> >>
> >> >> >> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX=
 to 0 in
> >> >> >> your environment?  If not, I suggest to use 0 as default for
> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX.  Because we have clear evidence that
> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads.  Af=
ter
> >> >> >> that, if someone found some other workloads need larger
> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.
> >> >> >>
> >> >> >
> >> >> > The decision doesn=E2=80=99t rest with us, the kernel team at our=
 company.
> >> >> > It=E2=80=99s made by the system administrators who manage a large=
 number of
> >> >> > servers. The latency spikes only occur on the Kubernetes (k8s)
> >> >> > servers, not in other environments like big data servers. We have
> >> >> > informed other system administrators, such as those managing the =
big
> >> >> > data servers, about the latency spike issues, but they are unwill=
ing
> >> >> > to make the change.
> >> >> >
> >> >> > No one wants to make changes unless there is evidence showing tha=
t the
> >> >> > old settings will negatively impact them. However, as you know,
> >> >> > latency is not a critical concern for big data; throughput is mor=
e
> >> >> > important. If we keep the current settings, we will have to relea=
se
> >> >> > different kernel packages for different environments, which is a
> >> >> > significant burden for us.
> >> >>
> >> >> Totally understand your requirements.  And, I think that this is be=
tter
> >> >> to be resolved in your downstream kernel.  If there are clear evide=
nces
> >> >> to prove small batch number hurts throughput for some workloads, we=
 can
> >> >> make the change in the upstream kernel.
> >> >>
> >> >
> >> > Please don't make this more complicated. We are at an impasse.
> >> >
> >> > The key issue here is that the upstream kernel has a default value o=
f
> >> > 5, not 0. If you can change it to 0, we can persuade our users to
> >> > follow the upstream changes. They currently set it to 5, not because
> >> > you, the author, chose this value, but because it is the default in
> >> > Linus's tree. Since it's in Linus's tree, kernel developers worldwid=
e
> >> > support it. It's not just your decision as the author, but the entir=
e
> >> > community supports this default.
> >> >
> >> > If, in the future, we find that the value of 0 is not suitable, you'=
ll
> >> > tell us, "It is an issue in your downstream kernel, not in the
> >> > upstream kernel, so we won't accept it."  PANIC.
> >>
> >> I don't think so.  I suggest you to change the default value to 0.  If
> >> someone reported that his workloads need some other value, then we hav=
e
> >> evidence that different workloads need different value.  At that time,
> >> we can suggest to add an user tunable knob.
> >>
> >
> > The problem is that others are unaware we've set it to 0, and I can't
> > constantly monitor the linux-mm mailing list. Additionally, it's
> > possible that you can't always keep an eye on it either.
>
> IIUC, they will use the default value.  Then, if there is any
> performance regression, they can report it.

Now we report it. What is your replyment? "Keep it in your downstream
kernel." Wow, PANIC again.


>
> > I believe we should hear Andrew's suggestion. Andrew, what is your opin=
ion?
>
> --
> Best Regards,
> Huang, Ying


--
Regards
Yafang