From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 34A06C3DA49
	for <linux-mm@archiver.kernel.org>; Mon, 15 Jul 2024 01:11:40 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 1AE246B007B; Sun, 14 Jul 2024 21:11:40 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 15E696B0083; Sun, 14 Jul 2024 21:11:40 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id F197F6B0085; Sun, 14 Jul 2024 21:11:39 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id D0B336B007B
	for <linux-mm@kvack.org>; Sun, 14 Jul 2024 21:11:39 -0400 (EDT)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 475F214134F
	for <linux-mm@kvack.org>; Mon, 15 Jul 2024 01:11:39 +0000 (UTC)
X-FDA: 82340209518.11.482821C
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.20])
	by imf19.hostedemail.com (Postfix) with ESMTP id 738951A000A
	for <linux-mm@kvack.org>; Mon, 15 Jul 2024 01:11:36 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=cszBCh9v;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1721005878;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=dq8VvdXnNFv6z7hsMnwS2y9OU6Xrb3v7DXjQQn5Nlno=;
	b=LrQVbqCdThZ/mFBJLFPGMpmvYmFlbbwdMI+0ywAp4zhMS0AUhiLsyo9p+y5dk3BfTcUsOa
	7Si9vAXTavL30qRGuwuuZTt92Tv8maIaZ0aEM4k8ZMwnaxSmHPsPZSxmjvZRWEw0d7diJ5
	uauPdg4p0jG/y/1WtOdi1c6KpCi7IwE=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=cszBCh9v;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721005878; a=rsa-sha256;
	cv=none;
	b=6xoBfLXdNFoefJ6ueualbFlw5FF/dAbXwkDjcE8LHc7aNMFSLouZcN0udQp40Lycw9EP0O
	f0AUim6rQ0/kVgwcVaigOeEJgWolAeFkc1JpHmctyReIrG8K3YQv3MNhXnXeM7ca7w+VXs
	BIih6XvnDpeZo735iU/6CR9qdQV2gtI=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1721005897; x=1752541897;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=qM5yMDGl2ea+ioJHqaviEbqspKIKiU+gn5k6r8R+QUA=;
  b=cszBCh9vH5nuoXRkzD2PEPMxDqjJu+5sC4DyLRUVwhIY0OcnrGigoH7/
   pDR5OU7M8rzfX0TOh96DLMlU+oAFrTrprn7/IOLjRddNE1BbDCK8B86lG
   xBROkueIde8LFFVK2B7D2Q4iSv7jf+7c8vyq6t00IiOYFDHn3gqFPHPZD
   LcqzH+T7WuYPn8tHfZ3LzDshErNOBjyM3PRw4BybbC7YAF1e3ZIoupo6w
   ExcJoQccpCwCXa/qX2WaIh9FUjm7Au7WiPVy2poYPu8BTYBt5i/mE7eg1
   vfab6FfR6WRjr43+vgrtzosSajtYcq9I9yfzGtdJtFvHXhbpSk2aYvajz
   Q==;
X-CSE-ConnectionGUID: 50f4rwyyRaCqGuc9UkhrAQ==
X-CSE-MsgGUID: ajaIM+MwQvuXiqhhBrqd8w==
X-IronPort-AV: E=McAfee;i="6700,10204,11133"; a="18181398"
X-IronPort-AV: E=Sophos;i="6.09,209,1716274800"; 
   d="scan'208";a="18181398"
Received: from orviesa010.jf.intel.com ([10.64.159.150])
  by orvoesa112.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Jul 2024 18:11:35 -0700
X-CSE-ConnectionGUID: CAULZ0+wQjeNYTOKTHdobA==
X-CSE-MsgGUID: OSKXuIGKQ3SRQisri2yBMQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.09,209,1716274800"; 
   d="scan'208";a="49340726"
Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orviesa010-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Jul 2024 18:11:33 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: akpm@linux-foundation.org,  Matthew Wilcox <willy@infradead.org>,
  mgorman@techsingularity.net,  linux-mm@kvack.org,  David Rientjes
 <rientjes@google.com>
Subject: Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob
 vm.pcp_batch_scale_max
In-Reply-To: <CALOAHbCWdNvYcQM9WC_ZCiRtgQkPqinx23RUCO88jtgmxYX=bA@mail.gmail.com>
	(Yafang Shao's message of "Fri, 12 Jul 2024 17:46:54 +0800")
References: <20240707094956.94654-1-laoar.shao@gmail.com>
	<CALOAHbDKvkFQO_WST0PTy=LvnfFTXTcnRWMtwWAzQy4eWSsphQ@mail.gmail.com>
	<87frsg9waa.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbBEsmF3_udHNzpOTrRWscweysBgrGweH1s0SSueMhYP7A@mail.gmail.com>
	<877cds9pa2.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbBXuY0v8LRBoAjmp8TjzLAzOZ8BbcSFmRZPMqfSey5aWA@mail.gmail.com>
	<87y1678l0f.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbDG12P4LZN=ngenArGxmQFHcNADXfcEBLvXzPg9AdCRCA@mail.gmail.com>
	<87plrj8g42.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbDT-d+nZCTUFo6fGTk1HKQCqCMEE-_CuZSxEkQGR0veKQ@mail.gmail.com>
	<87h6cv89n4.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbA2PWXSUL4eo0Wc2YgPrMD9mgZda-MMJt9XukqSmhFG9w@mail.gmail.com>
	<87cynj878z.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbBDDdGpx1Q=c8-nDgfJ6HHfTZ__rO+4+vn05oybqJBq3g@mail.gmail.com>
	<874j8v851a.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbCKzEXW6-8ApzYNsh=Ert+a0=GOS6k1enOMzMTVXg2Uqw@mail.gmail.com>
	<87zfqn6mr5.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbC+kPU9-9wQ2CoaBr49vBaUT+O-c5pbtRsxEQO6FvENog@mail.gmail.com>
	<87v81b6kmd.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbDavYuiS6GmR8uXp14fAR8SSsJ-_dXrsK13XjHsetiuog@mail.gmail.com>
	<CALOAHbCWdNvYcQM9WC_ZCiRtgQkPqinx23RUCO88jtgmxYX=bA@mail.gmail.com>
Date: Mon, 15 Jul 2024 09:09:41 +0800
Message-ID: <87r0bv7962.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam11
X-Rspamd-Queue-Id: 738951A000A
X-Stat-Signature: cyiukcqprecexcnnzufkbm5ws7m7go5m
X-Rspam-User: 
X-HE-Tag: 1721005896-727342
X-HE-Meta: U2FsdGVkX19yw1cbHnk1iJqgKUOqDGyD1Uu8zYD4Lug2wBVkDNjIwxDaldjIzLvFYEiQ7sx2fBpe1VAYArcgrYw3pD+FT9X8bQ5Qtux8iQx6000q4H+L4s8uBtpUnGN/EDGxZ+vOwNHXHkmAIhMcxuXgV73O9A+//ck6pEU8FyrZZqyO/tP4dlyrswfvq/o7rDkR9BtKzDwDNwJiMHbkXF5pajdZ4Zq6CfiwGHueMMLM01VIFkJBxLh/gM37UxdXPNb/Tk2XPdThKnAPDzlWAgK0yKN3fBTJx6upbArxaBPhYD5a4nUSKOoTu6pk/QY5LTmpCmzBefdZ+QRHkMBzomvRBAxRIirsjMSxLjuN3d6VEfgRkZLp4RvIdKA8S3pH4sTcs+EXK/bzaZwomPleKxTQ/KKdC68jh75KqxZ6ZveX/bICSACjTLB9MCJFNhQOmN9hwIAh3iC7GD6NQ470vIuWLny2L9d68Cx18vWGW6cpaIGUuEUcH0cFBXn+UMBkSS5ZKy+HF9j31jshSX9+7bxzdgcOAYd/McGfZUp0JnePy+1HC87rryOvtMUbDWMpDnihy/NuSNQzYoXy6qppjzXHUJUR0Gut9eAJhRUmG8HT3h8HEZxBcrYmHr1/a+Hzp8NwAixFpDCfH8uoKL4ijFk9hNcRH7mbfhPVEtthuoGm14NUKfceHcv/mbM0c6n23UM6ks7QUePCbbO8pIKFyrW66EN5Ch0WTSsivK/SWixwHrLFkHbK+egvnbsje6spmfEwbj4Zk+JrMskepLyewXMsFHTSiIeCw14RtPNytuydj9JqUU+lRlEDu9zI3NcNf3BzEGexltGyqD+ugnrr11pOE9XLhk23SnD6Jp7oBEPEevId1mSEdGhMsI0EmcIWMlcTZgpGxBVbxe+weaQJmOlQl4gEfBHMPmoOuHqI0Wlq21YKidRXXrYxSFNeG1pn0/AN89LR/Z2YlMoj3QB
 mTRJRw5H
 iOou2DKSFXZJdkIVR0ciZr6swF47rJGDJjSNw/wDObdrcgXhhkdimzpoS5Lmfwiz7aQcftyirD9/lGXkVocwCC/uxA3OLwaQChqCCAv5yrGRtQMwl/vI9lYKX+YYS160sqBdZqi+q/PJRFdMZwbgycLUwZoFE5Tbf9qBmAOZk8+edqlw5gxZSMppWHIOaTAIzY/lNkf02NEIWH5G1Hp2EJkWhWncIb5ku3VNIzGWnTxo8TbGBiJLuW0xGTfS0aNdFvs45NIltwZ/cAZEafLp3efiiN1hrIGv1XHwHvu0NnSJDwH00393507Fk1Q==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Yafang Shao <laoar.shao@gmail.com> writes:

> On Fri, Jul 12, 2024 at 5:24=E2=80=AFPM Yafang Shao <laoar.shao@gmail.com=
> wrote:
>>
>> On Fri, Jul 12, 2024 at 5:13=E2=80=AFPM Huang, Ying <ying.huang@intel.co=
m> wrote:
>> >
>> > Yafang Shao <laoar.shao@gmail.com> writes:
>> >
>> > > On Fri, Jul 12, 2024 at 4:26=E2=80=AFPM Huang, Ying <ying.huang@inte=
l.com> wrote:
>> > >>
>> > >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >>
>> > >> > On Fri, Jul 12, 2024 at 3:06=E2=80=AFPM Huang, Ying <ying.huang@i=
ntel.com> wrote:
>> > >> >>
>> > >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >> >>
>> > >> >> > On Fri, Jul 12, 2024 at 2:18=E2=80=AFPM Huang, Ying <ying.huan=
g@intel.com> wrote:
>> > >> >> >>
>> > >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >> >> >>
>> > >> >> >> > On Fri, Jul 12, 2024 at 1:26=E2=80=AFPM Huang, Ying <ying.h=
uang@intel.com> wrote:
>> > >> >> >> >>
>> > >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >> >> >> >>
>> > >> >> >> >> > On Fri, Jul 12, 2024 at 11:07=E2=80=AFAM Huang, Ying <yi=
ng.huang@intel.com> wrote:
>> > >> >> >> >> >>
>> > >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >> >> >> >> >>
>> > >> >> >> >> >> > On Fri, Jul 12, 2024 at 9:21=E2=80=AFAM Huang, Ying <=
ying.huang@intel.com> wrote:
>> > >> >> >> >> >> >>
>> > >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >> >> >> >> >> >>
>> > >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51=E2=80=AFPM Huang, Yin=
g <ying.huang@intel.com> wrote:
>> > >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20=E2=80=AFPM Huang, =
Ying <ying.huang@intel.com> wrote:
>> > >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44=E2=80=AFPM Huan=
g, Ying <ying.huang@intel.com> wrote:
>> > >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> > >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51=E2=80=AFAM =
Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> write=
s:
>> > >> >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_=
SCALE_MAX poses challenges for
>> > >> >> >> >> >> >> >> >> >> >> > quickly experimenting with specific wo=
rkloads in a production environment,
>> > >> >> >> >> >> >> >> >> >> >> > particularly when monitoring latency s=
pikes caused by contention on the
>> > >> >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sys=
ctl parameter vm.pcp_batch_scale_max
>> > >> >> >> >> >> >> >> >> >> >> > is introduced as a more practical alte=
rnative.
>> > >> >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change.  =
I can understand that kernel
>> > >> >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysct=
l knob.  But, sysctl knob is ABI
>> > >> >> >> >> >> >> >> >> >> >> too.
>> > >> >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock =
contention issue, several suggestions
>> > >> >> >> >> >> >> >> >> >> >> > have been proposed. One approach invol=
ves dividing large zones into multi
>> > >> >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew=
[0], while another entails splitting
>> > >> >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism simil=
ar to memory arenas and shifting away
>> > >> >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to iden=
tify the range of free lists a
>> > >> >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However=
, implementing these solutions is
>> > >> >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended =
development effort.
>> > >> >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hu=
rt instead of improve zone->lock
>> > >> >> >> >> >> >> >> >> >> >> contention.  Instead, it will reduce pag=
e allocation/freeing latency.
>> > >> >> >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent commen=
t. You introduced a
>> > >> >> >> >> >> >> >> >> >> > configuration that has proven to be diffi=
cult to use, and you have
>> > >> >> >> >> >> >> >> >> >> > been resistant to suggestions for modifyi=
ng it to a more user-friendly
>> > >> >> >> >> >> >> >> >> >> > and practical tuning approach. May I inqu=
ire about the rationale
>> > >> >> >> >> >> >> >> >> >> > behind introducing this configuration in =
the beginning?
>> > >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> Sorry, I don't understand your words.  Do y=
ou need me to explain what is
>> > >> >> >> >> >> >> >> >> >> "neutral"?
>> > >> >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> >> > No, thanks.
>> > >> >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a =
clear and comprehensive
>> > >> >> >> >> >> >> >> >> > explanation of what "neutral" means, providi=
ng me with a better
>> > >> >> >> >> >> >> >> >> > understanding of the concept.
>> > >> >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> >> > So, can you explain why you introduced it as=
 a config in the beginning ?
>> > >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> I think that I have explained it in the commit=
 log of commit
>> > >> >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scal=
e factor to avoid too long
>> > >> >> >> >> >> >> >> >> latency").  Which introduces the config.
>> > >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> > What specifically are your expectations for how=
 users should utilize
>> > >> >> >> >> >> >> >> > this config in real production workload?
>> > >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintain=
ed forever.  Can you
>> > >> >> >> >> >> >> >> >> explain why you need it?  Why cannot you use a=
 fixed value after initial
>> > >> >> >> >> >> >> >> >> experiments.
>> > >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> > Given the extensive scale of our production env=
ironment, with hundreds
>> > >> >> >> >> >> >> >> > of thousands of servers, it begs the question: =
how do you propose we
>> > >> >> >> >> >> >> >> > efficiently manage the various workloads that r=
emain unaffected by the
>> > >> >> >> >> >> >> >> > sysctl change implemented on just a few thousan=
d servers? Is it
>> > >> >> >> >> >> >> >> > feasible to expect us to recompile and release =
a new kernel for every
>> > >> >> >> >> >> >> >> > instance where the default value falls short? S=
urely, there must be
>> > >> >> >> >> >> >> >> > more practical and efficient approaches we can =
explore together to
>> > >> >> >> >> >> >> >> > ensure optimal performance across all workloads.
>> > >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> > When making improvements or modifications, kind=
ly ensure that they are
>> > >> >> >> >> >> >> >> > not solely confined to a test or lab environmen=
t. It's vital to also
>> > >> >> >> >> >> >> >> > consider the needs and requirements of our actu=
al users, along with
>> > >> >> >> >> >> >> >> > the diverse workloads they encounter in their d=
aily operations.
>> > >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> Have you found that your different systems requir=
es different
>> > >> >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
>> > >> >> >> >> >> >> >
>> > >> >> >> >> >> >> > For specific workloads that introduce latency, we =
set the value to 0.
>> > >> >> >> >> >> >> > For other workloads, we keep it unchanged until we=
 determine that the
>> > >> >> >> >> >> >> > default value is also suboptimal. What is the issu=
e with this
>> > >> >> >> >> >> >> > approach?
>> > >> >> >> >> >> >>
>> > >> >> >> >> >> >> Firstly, this is a system wide configuration, not wo=
rkload specific.
>> > >> >> >> >> >> >> So, other workloads run on the same system will be i=
mpacted too.  Will
>> > >> >> >> >> >> >> you run one workload only on one system?
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > It seems we're living on different planets. You're ha=
ppily working in
>> > >> >> >> >> >> > your lab environment, while I'm struggling with real-=
world production
>> > >> >> >> >> >> > issues.
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > For servers:
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max =3D 0
>> > >> >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max =
=3D 5
>> > >> >> >> >> >> > Server 1,000,001 and beyond: Happy with all values
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > Is this hard to understand?
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > In other words:
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > For applications:
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max =3D 0
>> > >> >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_m=
ax =3D 5
>> > >> >> >> >> >> > Application 1,000,001 and beyond: Happy with all valu=
es
>> > >> >> >> >> >>
>> > >> >> >> >> >> Good to know this.  Thanks!
>> > >> >> >> >> >>
>> > >> >> >> >> >> >>
>> > >> >> >> >> >> >> Secondly, we need some evidences to introduce a new =
system ABI.  For
>> > >> >> >> >> >> >> example, we need to use different configuration on d=
ifferent systems
>> > >> >> >> >> >> >> otherwise some workloads will be hurt.  Can you prov=
ide some evidences
>> > >> >> >> >> >> >> to support your change?  IMHO, it's not good enough =
to say I don't know
>> > >> >> >> >> >> >> why I just don't want to change existing systems.  I=
f so, it may be
>> > >> >> >> >> >> >> better to wait until you have more evidences.
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > It seems the community encourages developers to exper=
iment with their
>> > >> >> >> >> >> > improvements in lab environments using meticulously d=
esigned test
>> > >> >> >> >> >> > cases A, B, C, and as many others as they can imagine=
, ultimately
>> > >> >> >> >> >> > obtaining perfect data. However, it discourages devel=
opers from
>> > >> >> >> >> >> > directly addressing real-world workloads. Sigh.
>> > >> >> >> >> >>
>> > >> >> >> >> >> You cannot know whether your workloads benefit or hurt =
for the different
>> > >> >> >> >> >> batch number and how in your production environment?  I=
f you cannot, how
>> > >> >> >> >> >> do you decide which workload deploys on which system (w=
ith different
>> > >> >> >> >> >> batch number configuration).  If you can, can you provi=
de such
>> > >> >> >> >> >> information to support your patch?
>> > >> >> >> >> >
>> > >> >> >> >> > We leverage a meticulous selection of network metrics, p=
articularly
>> > >> >> >> >> > focusing on TcpExt indicators, to keep a close eye on ap=
plication
>> > >> >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeout=
s,
>> > >> >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlo=
wStartRetrans,
>> > >> >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.
>> > >> >> >> >> >
>> > >> >> >> >> > In instances where a problematic container terminates, w=
e've noticed a
>> > >> >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occu=
rrences per
>> > >> >> >> >> > second, which serves as a clear indication that other ap=
plications are
>> > >> >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_b=
atch_scale_max
>> > >> >> >> >> > parameter to 0, we've been able to drastically reduce th=
e maximum
>> > >> >> >> >> > frequency of these timeouts to less than one per second.
>> > >> >> >> >>
>> > >> >> >> >> Thanks a lot for sharing this.  I learned much from it!
>> > >> >> >> >>
>> > >> >> >> >> > At present, we're selectively applying this adjustment t=
o clusters
>> > >> >> >> >> > that exclusively host the identified problematic applica=
tions, and
>> > >> >> >> >> > we're closely monitoring their performance to ensure sta=
bility. To
>> > >> >> >> >> > date, we've observed no network latency issues as a resu=
lt of this
>> > >> >> >> >> > change. However, we remain cautious about extending this=
 optimization
>> > >> >> >> >> > to other clusters, as the decision ultimately depends on=
 a variety of
>> > >> >> >> >> > factors.
>> > >> >> >> >> >
>> > >> >> >> >> > It's important to note that we're not eager to implement=
 this change
>> > >> >> >> >> > across our entire fleet, as we recognize the potential f=
or unforeseen
>> > >> >> >> >> > consequences. Instead, we're taking a cautious approach =
by initially
>> > >> >> >> >> > applying it to a limited number of servers. This allows =
us to assess
>> > >> >> >> >> > its impact and make informed decisions about whether or =
not to expand
>> > >> >> >> >> > its use in the future.
>> > >> >> >> >>
>> > >> >> >> >> So, you haven't observed any performance hurt yet.  Right?
>> > >> >> >> >
>> > >> >> >> > Right.
>> > >> >> >> >
>> > >> >> >> >> If you
>> > >> >> >> >> haven't, I suggest you to keep the patch in your downstrea=
m kernel for a
>> > >> >> >> >> while.  In the future, if you find the performance of some=
 workloads
>> > >> >> >> >> hurts because of the new batch number, you can repost the =
patch with the
>> > >> >> >> >> supporting data.  If in the end, the performance of more a=
nd more
>> > >> >> >> >> workloads is good with the new batch number.  You may cons=
ider to make 0
>> > >> >> >> >> the default value :-)
>> > >> >> >> >
>> > >> >> >> > That is not how the real world works.
>> > >> >> >> >
>> > >> >> >> > In the real world:
>> > >> >> >> >
>> > >> >> >> > - No one knows what may happen in the future.
>> > >> >> >> >   Therefore, if possible, we should make systems flexible, =
unless
>> > >> >> >> > there is a strong justification for using a hard-coded valu=
e.
>> > >> >> >> >
>> > >> >> >> > - Minimize changes whenever possible.
>> > >> >> >> >   These systems have been working fine in the past, even if=
 with lower
>> > >> >> >> > performance. Why make changes just for the sake of improving
>> > >> >> >> > performance? Does the key metric of your performance data t=
ruly matter
>> > >> >> >> > for their workload?
>> > >> >> >>
>> > >> >> >> These are good policy in your organization and business.  But=
, it's not
>> > >> >> >> necessary the policy that Linux kernel upstream should take.
>> > >> >> >
>> > >> >> > You mean the Upstream Linux kernel only designed for the lab ?
>> > >> >> >
>> > >> >> >>
>> > >> >> >> Community needs to consider long-term maintenance overhead, s=
o it adds
>> > >> >> >> new ABI (such as sysfs knob) to kernel with the necessary jus=
tification.
>> > >> >> >> In general, it prefer to use a good default value or an autom=
atic
>> > >> >> >> algorithm that works for everyone.  Community tries avoiding =
(or fixing)
>> > >> >> >> regressions as much as possible, but this will not stop kerne=
l from
>> > >> >> >> changing, even if it's big.
>> > >> >> >
>> > >> >> > Please explain to me why the kernel config is not ABI, but the=
 sysctl is ABI.
>> > >> >>
>> > >> >> Linux kernel will not break ABI until the last users stop using =
it.
>> > >> >
>> > >> > However, you haven't given a clear reference why the systl is an =
ABI.
>> > >>
>> > >> TBH, I don't find a formal document said it explicitly after some
>> > >> searching.
>> > >>
>> > >> Hi, Andrew, Matthew,
>> > >>
>> > >> Can you help me on this?  Whether sysctl is considered Linux kernel=
 ABI?
>> > >> Or something similar?
>> > >
>> > > In my experience, we consistently utilize an if-statement to configu=
re
>> > > sysctl settings in our production environments.
>> > >
>> > >     if [ -f ${sysctl_file} ]; then
>> > >         echo ${new_value} > ${sysctl_file}
>> > >     fi
>> > >
>> > > Additionally, you can incorporate this into rc.local to ensure the
>> > > configuration is applied upon system reboot.
>> > >
>> > > Even if you add it to the sysctl.conf without the if-statement, it
>> > > won't break anything.
>> > >
>> > > The pcp-related sysctl parameter, vm.percpu_pagelist_high_fraction,
>> > > underwent a naming change along with a functional update from its
>> > > predecessor, vm.percpu_pagelist_fraction, in commit 74f44822097c
>> > > ("mm/page_alloc: introduce vm.percpu_pagelist_high_fraction"). Despi=
te
>> > > this significant change, there have been no reported issues or
>> > > complaints, suggesting that the renaming and functional update have
>> > > not negatively impacted the system's functionality.
>> >
>> > Thanks for your information.  From the commit, sysctl isn't considered
>> > as the kernel ABI.
>> >
>> > Even if so, IMHO, we shouldn't introduce a user tunable knob without a
>> > real world requirements except more flexibility.
>>
>> Indeed, I do not reside in the physical realm but within a virtualized
>> universe. (Of course, that is your perspective.)
>
> One final note: you explained very well what "neutral" means. Thank
> you for your comments.

Originally, my opinion to the change is neutral.  But, after more
thoughts, I changed my opinion to "we need more evidence to prove the
knob is necessary".

--
Best Regards,
Huang, Ying