From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5BBA8C3DA4A
	for <linux-mm@archiver.kernel.org>; Mon, 29 Jul 2024 07:51:39 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E93006B0089; Mon, 29 Jul 2024 03:51:38 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E423C6B008C; Mon, 29 Jul 2024 03:51:38 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CE2BD6B0092; Mon, 29 Jul 2024 03:51:38 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id A992E6B0089
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 03:51:38 -0400 (EDT)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 58FB78018C
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 07:51:36 +0000 (UTC)
X-FDA: 82392020592.22.7D84718
Received: from mail-qv1-f42.google.com (mail-qv1-f42.google.com [209.85.219.42])
	by imf03.hostedemail.com (Postfix) with ESMTP id 3C6D52001B
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 07:51:34 +0000 (UTC)
Authentication-Results: imf03.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=K6zzb0JK;
	spf=pass (imf03.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.42 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1722239490;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=PWBvOTPv231m7PjaWFXnSEHiInZLMFlll3VBucNHFwQ=;
	b=fOT+8lydl89H0By4Y1PIvotMethj5rxVw5Pq69vRe1b+C5uBm6KyOTXJCxc7aBu+VFKEcN
	scxGage6y/zJz8f+6G12mfRcQr0CdOu5gJ9K5+jOBU8HhMP1FuZAHN9SP/u1m4tV5uRqti
	6N2zwkXDxIIfAD1t7YX4oq9QRtR1oc4=
ARC-Authentication-Results: i=1;
	imf03.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=K6zzb0JK;
	spf=pass (imf03.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.42 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722239490; a=rsa-sha256;
	cv=none;
	b=u0zsppNyUZYENSBiqd93pIzDvucq7zdn3bY2BWq6ad909LKQKXBnC9ZSWSUqUI+fSOS+za
	gcbbqCi3GV2ExNMUYxmcER7nmxQiAXLUtXc5xgGNfKQxAU6itl7FWmbGlOzgOazeDMZD9d
	jfYjHdqO1uxQeEnlfGOb59l5NuAarCI=
Received: by mail-qv1-f42.google.com with SMTP id 6a1803df08f44-6b7b349a98aso15973686d6.3
        for <linux-mm@kvack.org>; Mon, 29 Jul 2024 00:51:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1722239493; x=1722844293; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=PWBvOTPv231m7PjaWFXnSEHiInZLMFlll3VBucNHFwQ=;
        b=K6zzb0JKG8kyN1kLFExbvj4Zd1yA3Zc4ROEv+oaTZYFbLcZe5+cyz+q0Q0UWnq7WNY
         MPyzGxykcG/G0qb9K+/9UX0YLn4w6egYWfh3n7Ik2FtitOpbFB6BCgDsUN6sgJttHree
         VtJVimOG7YhQ3/Y4vTziXuuN2Wx5Mf0ePigh9aYOjKG27VLBURXS6LfoWClBMnQYkqTE
         anZk81/fIhMbLRRTQLUPEmxUhyA99iXqumPVoQLX0P7WICL+x1KNc0sOy5A/RXjtIs6q
         KNRVMcemXST1wpNUy9Uzm2JuuYYtITwR3mu7IJnCl6yJSzFt9XQH6o/Xuurl+aEv1/o8
         A3AQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1722239493; x=1722844293;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=PWBvOTPv231m7PjaWFXnSEHiInZLMFlll3VBucNHFwQ=;
        b=NgVJpOwhkzL2vTtwCNpZlwkWhIxGidbaFxUW4S6VtlfTE5NliahLHxUaTFj1eCk4Ht
         SeuElbFD3unC/EAcXllT7H+nIVFc/Dy7JsF192kQdBYgB5aqUiAMOLwqqsfxJc2xw2vY
         sr+DKKIbBMvmB6wRXY30HhxOJhtrPVc1pbGFhU/FkKkCpKES8PmeUjDlS8N2Tp0MCXR4
         SOK8i77M4TaTznOC73GJc5hS2nIXKSgmAmgt0rzvNvVLSEcCSJP5IKCz3XQZ+SAOMr+U
         wdyW+9NURq/SQlehlvWPBw6714cYGB4rGRSNUpNEw+zv6AflZ7Pig8oaST7nz1EuY+Vo
         mJ/A==
X-Forwarded-Encrypted: i=1; AJvYcCXUTMyL1diLKaPauPD1xqm4OWZ2nUwNf1LhaqFQrFKH0NoewEqL64IMyt7jL1oRk7YczyogdmeMMpxmcJ3bqCwOym0=
X-Gm-Message-State: AOJu0Yw2WDyf1YkV3tGm+KbNgzT9sra9NbLK0t+VZPmi+jsB3FXqSRRh
	oYyVYVfKhuwhRcyJHngwP/FtYqdvZwb0aU+lLV3a4Sg6xtQJG75LZGdM1lBPqavV0ly9IfMmKfS
	bqpeZ747C+jZei82gcnXd60ijXTw=
X-Google-Smtp-Source: AGHT+IEzvd/HueEJNJ5ZkzCi3sEsVkNU7LrjIGDEx8DtFdYBtIBJB3FZMy2uhkFRQr8nAgPp3haVaJRT3uh4p/bN5IA=
X-Received: by 2002:a05:6214:c48:b0:6b4:e702:56ee with SMTP id
 6a1803df08f44-6bb559a0816mr61849536d6.11.1722239493134; Mon, 29 Jul 2024
 00:51:33 -0700 (PDT)
MIME-Version: 1.0
References: <20240729023532.1555-1-laoar.shao@gmail.com> <20240729023532.1555-4-laoar.shao@gmail.com>
 <878qxkyjfr.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbBm+Mdhu9qcJkoOcLhVUoLKF50dLfhpTd_w_uR8h3yy6g@mail.gmail.com>
 <874j88ye65.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbCi3E61PWHKBJneWT-A-X_C9ezBdjZn0zrLM9gig5XJ+Q@mail.gmail.com>
 <87zfq0wxub.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbDF3veOjfLuoo8ufznvn2w1qxZR18iz3MASOZSiG3jB_A@mail.gmail.com>
 <87v80owxd8.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbCGBVfeiyy5+03u3qqJyXjdNjoQQJbeAhUGKpvtFByb=A@mail.gmail.com>
 <87r0bcwwpt.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87r0bcwwpt.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Mon, 29 Jul 2024 15:50:56 +0800
Message-ID: <CALOAHbCO2eqOtL91LSTsBqirfsiPTui==-4W30npaVLSg16fzw@mail.gmail.com>
Subject: Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
To: "Huang, Ying" <ying.huang@intel.com>
Cc: akpm@linux-foundation.org, mgorman@techsingularity.net, linux-mm@kvack.org, 
	Matthew Wilcox <willy@infradead.org>, David Rientjes <rientjes@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Stat-Signature: ofdz3bjwnqpsufc4hd61z57qebuwp1g4
X-Rspamd-Queue-Id: 3C6D52001B
X-Rspamd-Server: rspam11
X-HE-Tag: 1722239494-801331
X-HE-Meta: U2FsdGVkX1+Q8LGcA4DM1wloM16eWV24YjKZuJM65hGHyUMh+xH0uWz9hRDFoxF+AO02EjWTTJ7LpIx1Qobq9aLmvkDrQdmFfH1Rh4JuztoEBmdd0ICHFCti83lRf5z4yF1Sn5r8OjFzVxDkI8EbSKN8VvH1tK7DLuNflEcV+81K/dl81yyZw8HRpHz8FhMga0hrd3Lg2Wi7q+0OcL7qTfSZfZb07QkapUW6JwPStZYBQ/RIO5ZIKUCnKYlcEqPIHvkWtSoLk+QaijzePDvOnAR1d//dQJ8FS/TFCrNaoVEwSCnr7e294lv0mGf/SAXeC35CLGHBTkgLgn9UAp2FlErTx+qtUKQiPLhA8CaLL2ugvGMyxP14MasR6EKatChW70bwBcU6d7u6CgRmDi74ELDDjyb5Y4bgJFuxKhNCPDhhCq6JmS1TrpYmNaCuUhCD2xpUQrIMYyJYGbl0VbyU2oCMYn/G84Fgn21KdgXQkTWd2/ahZUbciclmExFctoNvwR7NNHSEPrPOI9Z604H35rNqwE0tOO412Hch1ZAvi4jpBE87/CPI3GA3N6z8BBVdZBhxQEECBRAEWMOUR0VCvxJ8ykFM2tVhRNrBg6aA39NV9wNYRos3A/p7JalHxvxSmCWo/kVmIVHWTb0kbiCE0BUn2+H4uzpuLXcC9nbo6W20cDPATHSERPZfAd+a6hS2NP7eu2YewiWdIs1Xqti6ih+mSEtKhrTT+sR9mrywNZjS8QeMYIzslKHJnhJCZ6s33iBCugSRWHV5eyjhVZ20TweK81h9JpxgSTEi/WX820EmHXH4b2S7ctYkt2U9VGCjhKAqa1KMUwGjZkQjg1DfM86wZvewGQTFDoEQpVE3WjXZLAAi2LKJNaJkxnVPaLX1vDqAkiU0fx6Jh0qOCr7Hi4veQa8eQAwvJlFyvCZwQLSRCpgh8ijD5sLphAqT435Ftds9bDReboq88wNqVKs
 8DlCQzF1
 B9rg/BuAMWUBBuFlJdYEd5VGdcrlq+QSTbBqiiElaiZ5L6cCTiUFFzYg6jfkmXW/mu0+HAf79TbC/xjuwX38rGG4kZMWjP3J6xjOoDNqRRXnsVe46s9/yHRVuz+vfPGQBmm+VHIOFCA1T0Exrmdxj4SBdxNOOwxjb/UzgHcbnA+ddqk2nFgEAMn31wio5rfuR/orhmmq9jjPXVoSMb+feZUT9noqmoWowjk2EDRSWCOCVe1o42x1tTrOmPXN6a7HC/+24ENFFQeyOt3YqHZRAxbHQ5JBdiR9wqTuV7d7rMTvKJS6CLUzabjfwzgaqFxS5CZZPHlxb4bPYV9Uh9qwA5lpnL1YV9602PAMUUCCx9mV+tDMEwfmu8BflHDrY7q/zdAyZhiY7/m0g9meQafo2T6DEfeCBBcyy3TRBFlNIpEiz8T76QoDtiDoSuJGA5s87iLLgV6oSL9RzmjsLkJN70IIJKK2LhOWTAdZJ
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000052, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Jul 29, 2024 at 2:18=E2=80=AFPM Huang, Ying <ying.huang@intel.com> =
wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Mon, Jul 29, 2024 at 2:04=E2=80=AFPM Huang, Ying <ying.huang@intel.c=
om> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Mon, Jul 29, 2024 at 1:54=E2=80=AFPM Huang, Ying <ying.huang@inte=
l.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Mon, Jul 29, 2024 at 1:16=E2=80=AFPM Huang, Ying <ying.huang@i=
ntel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Mon, Jul 29, 2024 at 11:22=E2=80=AFAM Huang, Ying <ying.hua=
ng@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Hi, Yafang,
> >> >> >> >>
> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > During my recent work to resolve latency spikes caused by z=
one->lock
> >> >> >> >> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is d=
ifficult to use
> >> >> >> >> > in practice.
> >> >> >> >>
> >> >> >> >> As we discussed before [1], I still feel confusing about the =
description
> >> >> >> >> about zone->lock contention.  How about change the descriptio=
n to
> >> >> >> >> something like,
> >> >> >> >
> >> >> >> > Sure, I will change it.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> Larger page allocation/freeing batch number may cause longer =
run time of
> >> >> >> >> code holding zone->lock.  If zone->lock is heavily contended =
at the same
> >> >> >> >> time, latency spikes may occur even for casual page allocatio=
n/freeing.
> >> >> >> >> Although reducing the batch number cannot make zone->lock con=
tended
> >> >> >> >> lighter, it can reduce the latency spikes effectively.
> >> >> >> >>
> >> >> >> >> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-d=
esk2.ccr.corp.intel.com/
> >> >> >> >>
> >> >> >> >> > To demonstrate this, I wrote a Python script:
> >> >> >> >> >
> >> >> >> >> >   import mmap
> >> >> >> >> >
> >> >> >> >> >   size =3D 6 * 1024**3
> >> >> >> >> >
> >> >> >> >> >   while True:
> >> >> >> >> >       mm =3D mmap.mmap(-1, size)
> >> >> >> >> >       mm[:] =3D b'\xff' * size
> >> >> >> >> >       mm.close()
> >> >> >> >> >
> >> >> >> >> > Run this script 10 times in parallel and measure the alloca=
tion latency by
> >> >> >> >> > measuring the duration of rmqueue_bulk() with the BCC tools
> >> >> >> >> > funclatency[1]:
> >> >> >> >> >
> >> >> >> >> >   funclatency -T -i 600 rmqueue_bulk
> >> >> >> >> >
> >> >> >> >> > Here are the results for both AMD and Intel CPUs.
> >> >> >> >> >
> >> >> >> >> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virt=
ual server
> >> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >> >> >> >> >
> >> >> >> >> > - Default value of 5
> >> >> >> >> >
> >> >> >> >> >      nsecs               : count     distribution
> >> >> >> >> >          0 -> 1          : 0        |                      =
                  |
> >> >> >> >> >          2 -> 3          : 0        |                      =
                  |
> >> >> >> >> >          4 -> 7          : 0        |                      =
                  |
> >> >> >> >> >          8 -> 15         : 0        |                      =
                  |
> >> >> >> >> >         16 -> 31         : 0        |                      =
                  |
> >> >> >> >> >         32 -> 63         : 0        |                      =
                  |
> >> >> >> >> >         64 -> 127        : 0        |                      =
                  |
> >> >> >> >> >        128 -> 255        : 0        |                      =
                  |
> >> >> >> >> >        256 -> 511        : 0        |                      =
                  |
> >> >> >> >> >        512 -> 1023       : 12       |                      =
                  |
> >> >> >> >> >       1024 -> 2047       : 9116     |                      =
                  |
> >> >> >> >> >       2048 -> 4095       : 2004     |                      =
                  |
> >> >> >> >> >       4096 -> 8191       : 2497     |                      =
                  |
> >> >> >> >> >       8192 -> 16383      : 2127     |                      =
                  |
> >> >> >> >> >      16384 -> 32767      : 2483     |                      =
                  |
> >> >> >> >> >      32768 -> 65535      : 10102    |                      =
                  |
> >> >> >> >> >      65536 -> 131071     : 212730   |*******************   =
                  |
> >> >> >> >> >     131072 -> 262143     : 314692   |**********************=
*******           |
> >> >> >> >> >     262144 -> 524287     : 430058   |**********************=
******************|
> >> >> >> >> >     524288 -> 1048575    : 224032   |********************  =
                  |
> >> >> >> >> >    1048576 -> 2097151    : 73567    |******                =
                  |
> >> >> >> >> >    2097152 -> 4194303    : 17079    |*                     =
                  |
> >> >> >> >> >    4194304 -> 8388607    : 3900     |                      =
                  |
> >> >> >> >> >    8388608 -> 16777215   : 750      |                      =
                  |
> >> >> >> >> >   16777216 -> 33554431   : 88       |                      =
                  |
> >> >> >> >> >   33554432 -> 67108863   : 2        |                      =
                  |
> >> >> >> >> >
> >> >> >> >> > avg =3D 449775 nsecs, total: 587066511229 nsecs, count: 130=
5242
> >> >> >> >> >
> >> >> >> >> > The avg alloc latency can be 449us, and the max latency can=
 be higher
> >> >> >> >> > than 30ms.
> >> >> >> >> >
> >> >> >> >> > - Value set to 0
> >> >> >> >> >
> >> >> >> >> >      nsecs               : count     distribution
> >> >> >> >> >          0 -> 1          : 0        |                      =
                  |
> >> >> >> >> >          2 -> 3          : 0        |                      =
                  |
> >> >> >> >> >          4 -> 7          : 0        |                      =
                  |
> >> >> >> >> >          8 -> 15         : 0        |                      =
                  |
> >> >> >> >> >         16 -> 31         : 0        |                      =
                  |
> >> >> >> >> >         32 -> 63         : 0        |                      =
                  |
> >> >> >> >> >         64 -> 127        : 0        |                      =
                  |
> >> >> >> >> >        128 -> 255        : 0        |                      =
                  |
> >> >> >> >> >        256 -> 511        : 0        |                      =
                  |
> >> >> >> >> >        512 -> 1023       : 92       |                      =
                  |
> >> >> >> >> >       1024 -> 2047       : 8594     |                      =
                  |
> >> >> >> >> >       2048 -> 4095       : 2042818  |******                =
                  |
> >> >> >> >> >       4096 -> 8191       : 8737624  |**********************=
****              |
> >> >> >> >> >       8192 -> 16383      : 13147872 |**********************=
******************|
> >> >> >> >> >      16384 -> 32767      : 8799951  |**********************=
****              |
> >> >> >> >> >      32768 -> 65535      : 2879715  |********              =
                  |
> >> >> >> >> >      65536 -> 131071     : 659600   |**                    =
                  |
> >> >> >> >> >     131072 -> 262143     : 204004   |                      =
                  |
> >> >> >> >> >     262144 -> 524287     : 78246    |                      =
                  |
> >> >> >> >> >     524288 -> 1048575    : 30800    |                      =
                  |
> >> >> >> >> >    1048576 -> 2097151    : 12251    |                      =
                  |
> >> >> >> >> >    2097152 -> 4194303    : 2950     |                      =
                  |
> >> >> >> >> >    4194304 -> 8388607    : 78       |                      =
                  |
> >> >> >> >> >
> >> >> >> >> > avg =3D 19359 nsecs, total: 708638369918 nsecs, count: 3660=
4636
> >> >> >> >> >
> >> >> >> >> > The avg was reduced significantly to 19us, and the max late=
ncy is reduced
> >> >> >> >> > to less than 8ms.
> >> >> >> >> >
> >> >> >> >> > - Conclusion
> >> >> >> >> >
> >> >> >> >> > On this AMD CPU, reducing vm.pcp_batch_scale_max significan=
tly helps reduce
> >> >> >> >> > latency. Latency-sensitive applications will benefit from t=
his tuning.
> >> >> >> >> >
> >> >> >> >> > However, I don't have access to other types of AMD CPUs, so=
 I was unable to
> >> >> >> >> > test it on different AMD models.
> >> >> >> >> >
> >> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA node=
s
> >> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >> >> >> >> >
> >> >> >> >> > - Default value of 5
> >> >> >> >> >
> >> >> >> >> >      nsecs               : count     distribution
> >> >> >> >> >          0 -> 1          : 0        |                      =
                  |
> >> >> >> >> >          2 -> 3          : 0        |                      =
                  |
> >> >> >> >> >          4 -> 7          : 0        |                      =
                  |
> >> >> >> >> >          8 -> 15         : 0        |                      =
                  |
> >> >> >> >> >         16 -> 31         : 0        |                      =
                  |
> >> >> >> >> >         32 -> 63         : 0        |                      =
                  |
> >> >> >> >> >         64 -> 127        : 0        |                      =
                  |
> >> >> >> >> >        128 -> 255        : 0        |                      =
                  |
> >> >> >> >> >        256 -> 511        : 0        |                      =
                  |
> >> >> >> >> >        512 -> 1023       : 2419     |                      =
                  |
> >> >> >> >> >       1024 -> 2047       : 34499    |*                     =
                  |
> >> >> >> >> >       2048 -> 4095       : 4272     |                      =
                  |
> >> >> >> >> >       4096 -> 8191       : 9035     |                      =
                  |
> >> >> >> >> >       8192 -> 16383      : 4374     |                      =
                  |
> >> >> >> >> >      16384 -> 32767      : 2963     |                      =
                  |
> >> >> >> >> >      32768 -> 65535      : 6407     |                      =
                  |
> >> >> >> >> >      65536 -> 131071     : 884806   |**********************=
******************|
> >> >> >> >> >     131072 -> 262143     : 145931   |******                =
                  |
> >> >> >> >> >     262144 -> 524287     : 13406    |                      =
                  |
> >> >> >> >> >     524288 -> 1048575    : 1874     |                      =
                  |
> >> >> >> >> >    1048576 -> 2097151    : 249      |                      =
                  |
> >> >> >> >> >    2097152 -> 4194303    : 28       |                      =
                  |
> >> >> >> >> >
> >> >> >> >> > avg =3D 96173 nsecs, total: 106778157925 nsecs, count: 1110=
263
> >> >> >> >> >
> >> >> >> >> > - Conclusion
> >> >> >> >> >
> >> >> >> >> > This Intel CPU works fine with the default setting.
> >> >> >> >> >
> >> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA n=
ode
> >> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >> >> >> >> >
> >> >> >> >> > Using the cpuset cgroup, we can restrict the test script to=
 run on NUMA
> >> >> >> >> > node 0 only.
> >> >> >> >> >
> >> >> >> >> > - Default value of 5
> >> >> >> >> >
> >> >> >> >> >      nsecs               : count     distribution
> >> >> >> >> >          0 -> 1          : 0        |                      =
                  |
> >> >> >> >> >          2 -> 3          : 0        |                      =
                  |
> >> >> >> >> >          4 -> 7          : 0        |                      =
                  |
> >> >> >> >> >          8 -> 15         : 0        |                      =
                  |
> >> >> >> >> >         16 -> 31         : 0        |                      =
                  |
> >> >> >> >> >         32 -> 63         : 0        |                      =
                  |
> >> >> >> >> >         64 -> 127        : 0        |                      =
                  |
> >> >> >> >> >        128 -> 255        : 0        |                      =
                  |
> >> >> >> >> >        256 -> 511        : 46       |                      =
                  |
> >> >> >> >> >        512 -> 1023       : 695      |                      =
                  |
> >> >> >> >> >       1024 -> 2047       : 19950    |*                     =
                  |
> >> >> >> >> >       2048 -> 4095       : 1788     |                      =
                  |
> >> >> >> >> >       4096 -> 8191       : 3392     |                      =
                  |
> >> >> >> >> >       8192 -> 16383      : 2569     |                      =
                  |
> >> >> >> >> >      16384 -> 32767      : 2619     |                      =
                  |
> >> >> >> >> >      32768 -> 65535      : 3809     |                      =
                  |
> >> >> >> >> >      65536 -> 131071     : 616182   |**********************=
******************|
> >> >> >> >> >     131072 -> 262143     : 295587   |*******************   =
                  |
> >> >> >> >> >     262144 -> 524287     : 75357    |****                  =
                  |
> >> >> >> >> >     524288 -> 1048575    : 15471    |*                     =
                  |
> >> >> >> >> >    1048576 -> 2097151    : 2939     |                      =
                  |
> >> >> >> >> >    2097152 -> 4194303    : 243      |                      =
                  |
> >> >> >> >> >    4194304 -> 8388607    : 3        |                      =
                  |
> >> >> >> >> >
> >> >> >> >> > avg =3D 144410 nsecs, total: 150281196195 nsecs, count: 104=
0651
> >> >> >> >> >
> >> >> >> >> > The zone->lock contention becomes severe when there is only=
 a single NUMA
> >> >> >> >> > node. The average latency is approximately 144us, with the =
maximum
> >> >> >> >> > latency exceeding 4ms.
> >> >> >> >> >
> >> >> >> >> > - Value set to 0
> >> >> >> >> >
> >> >> >> >> >      nsecs               : count     distribution
> >> >> >> >> >          0 -> 1          : 0        |                      =
                  |
> >> >> >> >> >          2 -> 3          : 0        |                      =
                  |
> >> >> >> >> >          4 -> 7          : 0        |                      =
                  |
> >> >> >> >> >          8 -> 15         : 0        |                      =
                  |
> >> >> >> >> >         16 -> 31         : 0        |                      =
                  |
> >> >> >> >> >         32 -> 63         : 0        |                      =
                  |
> >> >> >> >> >         64 -> 127        : 0        |                      =
                  |
> >> >> >> >> >        128 -> 255        : 0        |                      =
                  |
> >> >> >> >> >        256 -> 511        : 24       |                      =
                  |
> >> >> >> >> >        512 -> 1023       : 2686     |                      =
                  |
> >> >> >> >> >       1024 -> 2047       : 10246    |                      =
                  |
> >> >> >> >> >       2048 -> 4095       : 4061529  |*********             =
                  |
> >> >> >> >> >       4096 -> 8191       : 16894971 |**********************=
******************|
> >> >> >> >> >       8192 -> 16383      : 6279310  |**************        =
                  |
> >> >> >> >> >      16384 -> 32767      : 1658240  |***                   =
                  |
> >> >> >> >> >      32768 -> 65535      : 445760   |*                     =
                  |
> >> >> >> >> >      65536 -> 131071     : 110817   |                      =
                  |
> >> >> >> >> >     131072 -> 262143     : 20279    |                      =
                  |
> >> >> >> >> >     262144 -> 524287     : 4176     |                      =
                  |
> >> >> >> >> >     524288 -> 1048575    : 436      |                      =
                  |
> >> >> >> >> >    1048576 -> 2097151    : 8        |                      =
                  |
> >> >> >> >> >    2097152 -> 4194303    : 2        |                      =
                  |
> >> >> >> >> >
> >> >> >> >> > avg =3D 8401 nsecs, total: 247739809022 nsecs, count: 29488=
508
> >> >> >> >> >
> >> >> >> >> > After setting it to 0, the avg latency is reduced to around=
 8us, and the
> >> >> >> >> > max latency is less than 4ms.
> >> >> >> >> >
> >> >> >> >> > - Conclusion
> >> >> >> >> >
> >> >> >> >> > On this Intel CPU, this tuning doesn't help much. Latency-s=
ensitive
> >> >> >> >> > applications work well with the default setting.
> >> >> >> >> >
> >> >> >> >> > It is worth noting that all the above data were tested usin=
g the upstream
> >> >> >> >> > kernel.
> >> >> >> >> >
> >> >> >> >> > Why introduce a systl knob?
> >> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D
> >> >> >> >> >
> >> >> >> >> > From the above data, it's clear that different CPU types ha=
ve varying
> >> >> >> >> > allocation latencies concerning zone->lock contention. Typi=
cally, people
> >> >> >> >> > don't release individual kernel packages for each type of x=
86_64 CPU.
> >> >> >> >> >
> >> >> >> >> > Furthermore, for latency-insensitive applications, we can k=
eep the default
> >> >> >> >> > setting for better throughput. In our production environmen=
t, we set this
> >> >> >> >> > value to 0 for applications running on Kubernetes servers w=
hile keeping it
> >> >> >> >> > at the default value of 5 for other applications like big d=
ata. It's not
> >> >> >> >> > common to release individual kernel packages for each appli=
cation.
> >> >> >> >>
> >> >> >> >> Thanks for detailed performance data!
> >> >> >> >>
> >> >> >> >> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_=
MAX to 0 in
> >> >> >> >> your environment?  If not, I suggest to use 0 as default for
> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX.  Because we have clear evidence t=
hat
> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads. =
 After
> >> >> >> >> that, if someone found some other workloads need larger
> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamicall=
y.
> >> >> >> >>
> >> >> >> >
> >> >> >> > The decision doesn=E2=80=99t rest with us, the kernel team at =
our company.
> >> >> >> > It=E2=80=99s made by the system administrators who manage a la=
rge number of
> >> >> >> > servers. The latency spikes only occur on the Kubernetes (k8s)
> >> >> >> > servers, not in other environments like big data servers. We h=
ave
> >> >> >> > informed other system administrators, such as those managing t=
he big
> >> >> >> > data servers, about the latency spike issues, but they are unw=
illing
> >> >> >> > to make the change.
> >> >> >> >
> >> >> >> > No one wants to make changes unless there is evidence showing =
that the
> >> >> >> > old settings will negatively impact them. However, as you know=
,
> >> >> >> > latency is not a critical concern for big data; throughput is =
more
> >> >> >> > important. If we keep the current settings, we will have to re=
lease
> >> >> >> > different kernel packages for different environments, which is=
 a
> >> >> >> > significant burden for us.
> >> >> >>
> >> >> >> Totally understand your requirements.  And, I think that this is=
 better
> >> >> >> to be resolved in your downstream kernel.  If there are clear ev=
idences
> >> >> >> to prove small batch number hurts throughput for some workloads,=
 we can
> >> >> >> make the change in the upstream kernel.
> >> >> >>
> >> >> >
> >> >> > Please don't make this more complicated. We are at an impasse.
> >> >> >
> >> >> > The key issue here is that the upstream kernel has a default valu=
e of
> >> >> > 5, not 0. If you can change it to 0, we can persuade our users to
> >> >> > follow the upstream changes. They currently set it to 5, not beca=
use
> >> >> > you, the author, chose this value, but because it is the default =
in
> >> >> > Linus's tree. Since it's in Linus's tree, kernel developers world=
wide
> >> >> > support it. It's not just your decision as the author, but the en=
tire
> >> >> > community supports this default.
> >> >> >
> >> >> > If, in the future, we find that the value of 0 is not suitable, y=
ou'll
> >> >> > tell us, "It is an issue in your downstream kernel, not in the
> >> >> > upstream kernel, so we won't accept it."  PANIC.
> >> >>
> >> >> I don't think so.  I suggest you to change the default value to 0. =
 If
> >> >> someone reported that his workloads need some other value, then we =
have
> >> >> evidence that different workloads need different value.  At that ti=
me,
> >> >> we can suggest to add an user tunable knob.
> >> >>
> >> >
> >> > The problem is that others are unaware we've set it to 0, and I can'=
t
> >> > constantly monitor the linux-mm mailing list. Additionally, it's
> >> > possible that you can't always keep an eye on it either.
> >>
> >> IIUC, they will use the default value.  Then, if there is any
> >> performance regression, they can report it.
> >
> > Now we report it. What is your replyment? "Keep it in your downstream
> > kernel." Wow, PANIC again.
>
> This is not all of my reply.  I suggested you to change the default
> value too.

For the upstream kernel, I don't have a strong justification to change
the default value from 5 to 0. That's why I'm proposing to introduce a
sysctl.

For our downstream kernel, some system administrators want us to keep
this value the same as the upstream because it works fine with the old
default value of 5. Therefore, we can't set the default value of our
downstream kernel to 0.

Let's wait for Andrew's suggestion.

--=20
Regards
Yafang