From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 63C2BC3DA4A
	for <linux-mm@archiver.kernel.org>; Fri, 12 Jul 2024 09:47:37 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id AD4446B0088; Fri, 12 Jul 2024 05:47:36 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A84C56B0089; Fri, 12 Jul 2024 05:47:36 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 924A46B008A; Fri, 12 Jul 2024 05:47:36 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 709666B0088
	for <linux-mm@kvack.org>; Fri, 12 Jul 2024 05:47:36 -0400 (EDT)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 246D11C1A30
	for <linux-mm@kvack.org>; Fri, 12 Jul 2024 09:47:36 +0000 (UTC)
X-FDA: 82330623312.15.386A0AA
Received: from mail-qv1-f48.google.com (mail-qv1-f48.google.com [209.85.219.48])
	by imf29.hostedemail.com (Postfix) with ESMTP id 5363112001D
	for <linux-mm@kvack.org>; Fri, 12 Jul 2024 09:47:34 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=YJz6OKQD;
	spf=pass (imf29.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.48 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1720777637;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=GI6l8L1vNAyHYk3WldqcWeqNz7Yv8jUCylp/heRe2pc=;
	b=AkOdJZY+g7pr3uE1K7gQ4T+AZPaEi3fc7V+sXFKUMVqX/2+wMa0+pB4Drfmrshguh7xE3O
	wUz9UwcPUAa4XHaCH4+ON7sNtdXTELqO3hIKQkhepWs9f2kMvCjKeBcFa3azWkXM0up+US
	Q8pAqkp4A8f95yKKIxOpslGtFGAMkPQ=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=YJz6OKQD;
	spf=pass (imf29.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.48 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720777637; a=rsa-sha256;
	cv=none;
	b=L+r/mT6UbWcHvLwch0EBij5WK38rRVq9KZcYBaICnFoSk744/MEUwIUakpfEhSqXW/1HAu
	7AvR5lJfEkjLLnIb9/rHBafY2yIcAkMocC0Gg5n+ecmMiysNUc7sq2S0Xw02QcULANWaMv
	SNhiDwAILxpgBV+0d7knl48pV4ymKns=
Received: by mail-qv1-f48.google.com with SMTP id 6a1803df08f44-6b5f3348f05so10562366d6.0
        for <linux-mm@kvack.org>; Fri, 12 Jul 2024 02:47:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1720777653; x=1721382453; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=GI6l8L1vNAyHYk3WldqcWeqNz7Yv8jUCylp/heRe2pc=;
        b=YJz6OKQDb2OImpoJAhifOUaTerBhNpa0aSZUDJ2RYIbH5EUbOCCweJsGpxp65oCyLD
         fE8nZAt8QNP3Dz6tPgjo8N1KOPb5EctHZQ7MLdakgWIEaGkdSx0YghT576QXmmV507F1
         D7qOcFTffZ/RupVXb9bHhsC3cqZkTWBJUzCOu1+8mNtNxjfm879byFUJeGlxYKb2Gkyx
         o9/SsraI4xH4MCB3mOpMkoI2aeKYOpVvUonruHi9bQB6sDdYwJik8uk18GN45kHxxUWT
         6YfqZLFirP/kIQKAkvdTrz0IY8Ki2w5vqfEuZtzhv36soPvOo55Rvxnwywl2OKeUw1IZ
         GeTA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1720777653; x=1721382453;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=GI6l8L1vNAyHYk3WldqcWeqNz7Yv8jUCylp/heRe2pc=;
        b=pz6v+STLb6t4KPfvnI3PlQyyajdqoy5j6KPeH0vRHFhr2cwKv8oSGedu6/uoLAWRH6
         mTwaCFzA8rYzDWGNJmsRUqDUwgUpwHaLq6VHsb7r1kKzAEPD7/bKuGyT5LmXFwOG7L5I
         RlcdtBWyZVDg8kdbROg48QSFTFiuVJ6on+n7aH7wgl38BA1mmAx93dkXLti2AiiFVT/i
         CgA/PwnTw/iVXCdoiIwfOsYGy8F7wjnLK/wSjftRQWXBq3lu4L/nhaV9vxNRf604OHuI
         nx1nyhKeD45JwLzq62r2NsI2N1JAfDE2QRp0PNRDvZV2nnukYgCZ/88z2HyM/ENMtWfg
         nMpQ==
X-Forwarded-Encrypted: i=1; AJvYcCXKv9dBoh3A4FJOXUEwsk+Qcy7wSIUMbwyXyz6e8nl0k+F+YUzlLLDdYkG0vn99yV0SU9H3kaQhsPW6bwhsKG2zbqg=
X-Gm-Message-State: AOJu0YxMtB595tnYkpXB/ky5mLUkxYoArh635W0BdrILgUheAMQE/Zs/
	Prdp8Q9iA4+dv/tHPAlY0CSpEymgQrck1ozPEj+OI/+2OnqPGraFC4ICxRKMw7+yvC9tqTU7Rmg
	MGfETRfTTzLvKMsiJ2J3bYjt2amM=
X-Google-Smtp-Source: AGHT+IF9Ce0CPvRUQaogXbOGwyipaoV5rph9RabsQsGuIVHLN7U9MtwE/YulESOBqU/Jytg3jZhy9t8pTqXeAMem0eE=
X-Received: by 2002:a05:6214:248d:b0:6b5:e3b7:46f5 with SMTP id
 6a1803df08f44-6b61bca3acbmr121736566d6.20.1720777653339; Fri, 12 Jul 2024
 02:47:33 -0700 (PDT)
MIME-Version: 1.0
References: <20240707094956.94654-1-laoar.shao@gmail.com> <878qyaarm6.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <CALOAHbAnAPkYu4oDHgW48N_e+OSqzo6AzOnGY-LNPTDa48RU9Q@mail.gmail.com>
 <87o774a0pv.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbDKvkFQO_WST0PTy=LvnfFTXTcnRWMtwWAzQy4eWSsphQ@mail.gmail.com>
 <87frsg9waa.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbBEsmF3_udHNzpOTrRWscweysBgrGweH1s0SSueMhYP7A@mail.gmail.com>
 <877cds9pa2.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbBXuY0v8LRBoAjmp8TjzLAzOZ8BbcSFmRZPMqfSey5aWA@mail.gmail.com>
 <87y1678l0f.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbDG12P4LZN=ngenArGxmQFHcNADXfcEBLvXzPg9AdCRCA@mail.gmail.com>
 <87plrj8g42.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbDT-d+nZCTUFo6fGTk1HKQCqCMEE-_CuZSxEkQGR0veKQ@mail.gmail.com>
 <87h6cv89n4.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbA2PWXSUL4eo0Wc2YgPrMD9mgZda-MMJt9XukqSmhFG9w@mail.gmail.com>
 <87cynj878z.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbBDDdGpx1Q=c8-nDgfJ6HHfTZ__rO+4+vn05oybqJBq3g@mail.gmail.com>
 <874j8v851a.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbCKzEXW6-8ApzYNsh=Ert+a0=GOS6k1enOMzMTVXg2Uqw@mail.gmail.com>
 <87zfqn6mr5.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbC+kPU9-9wQ2CoaBr49vBaUT+O-c5pbtRsxEQO6FvENog@mail.gmail.com>
 <87v81b6kmd.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbDavYuiS6GmR8uXp14fAR8SSsJ-_dXrsK13XjHsetiuog@mail.gmail.com>
In-Reply-To: <CALOAHbDavYuiS6GmR8uXp14fAR8SSsJ-_dXrsK13XjHsetiuog@mail.gmail.com>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Fri, 12 Jul 2024 17:46:54 +0800
Message-ID: <CALOAHbCWdNvYcQM9WC_ZCiRtgQkPqinx23RUCO88jtgmxYX=bA@mail.gmail.com>
Subject: Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
To: "Huang, Ying" <ying.huang@intel.com>
Cc: akpm@linux-foundation.org, Matthew Wilcox <willy@infradead.org>, 
	mgorman@techsingularity.net, linux-mm@kvack.org, 
	David Rientjes <rientjes@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam11
X-Rspamd-Queue-Id: 5363112001D
X-Stat-Signature: eokxa9y1y5eu6nuhcec48c9dumn9otd9
X-Rspam-User: 
X-HE-Tag: 1720777654-253167
X-HE-Meta: U2FsdGVkX1/M6smc+0/6FnUjprblVoEctySA4iG+agnUpVs8qq0PkLt36rgDrZQU4EkDHhTj5+Bj/ZhJ7P8C7PdiRJ9PBGYJFZpscEYMP0Yc6baPme8UPpFRtMTJCmkJ1U01khKuYomb2FGVQaNCGTs5q5lZqQTIbdazQBLdGzhLNlI4j+U5/VQB7NzVd2hX0FPOQF+/wc9O7BAOEc0+Gmu3AfrunuqHclDxCxX1v/tqoV56XzVG1pK2itOb51KXFFOmwGX6MSqZwtZVGY/Q7op5jfodNHZng+YlLzJQD+GExg3OY0jp+eI8iX10ahh6tyRl3dXpyw497YX9DUOTfVsl/ehnE4vAmXXhPUAiCZFGg4vWm0MfUzhywv1xxPH5guuvJwyu/0jBspmUJbUZz/ck46coQBtyaPperRD1PvqJLwBLZXYHtQReAmZfQvtYY47bFIeAAQCR7yDEUc2xEt6Y1Dx9r8cl32aNjMZanqyNK0sHhRv7j4foZWTHA8svKk17mirKhtYPtPelODXUZagssFmggIJ4C3MKaIaSXm/DbC14IjkFinMp1d9VSVxv/vsCCHbY7+pNNJ3ie4MkTAIYUNXOiwBFD0WwBW+lQU1pIXpFDziO4k4MUTEO+0Q8g7owIGZtf3AzZrWKhjpPe/P52v4xxbKRDaPmMXn9feqUffWyqX8JVhz0mYJf/L1Xp9RyvzkKNlZh+LrWXA6oRys00vYVicweAvVYq5uaJ+7cG0GRQuQkGJZiEWlC3+wRx+ta9DwgP8krQFDMcajsOQUggWKsoxicWuWUZm0OjSfYAqQfvz5l0djuGEvXcLsbUofO2T44VAFUc1kwM7M5lr5OG4x9h79izZSzoP2FYHWmCnRjjevkEcPXg+HH9VihJHUQDgM93dMEneYCqJ7pyQAvZ9WoWj5LY+a0r1JIRQdFuIRrTZTsa5si4o+MNKZ3I8Xn+Tk+OobPa5nsY87
 1HEJ6kuP
 sC4ShLm2pTUSzP6Ns3PjVlPVcjsw13Xwqwm4BoR3jQmWbiFuonRhQS2Z//Ejryt9wKhQxWTcySDF7fqKmokJvIDS25aK+BWezuojhzkAeSE240Ibj3XuYGfp2Ye+odLrToTpL5knDmc7jgSQLS/uAmND7ULwMfLWYYmfNN40tjIzMr+zrc/5169G4hWdlTYekp3fR0I4VF8vSJjakTUEUQwYrlUTZcPxLyY6CcXLrFdta6vWuUDIjTLrhUbIq0yfcnTm4T1bjnjJNUdLzShlkRIuoO2IuC6b9iZ0RPCSM2BHaSCXoh5i5e1rmqVxmrS4xkm58Afw7KYOxwhEg1Oohum9cqlAZZU/0BfFSp3uxcX6vPJuaPwlWNSkgBIQBPSbOhEbC3d77UhpD9oxYcZTN6OyzypisFm73vFgs
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Jul 12, 2024 at 5:24=E2=80=AFPM Yafang Shao <laoar.shao@gmail.com> =
wrote:
>
> On Fri, Jul 12, 2024 at 5:13=E2=80=AFPM Huang, Ying <ying.huang@intel.com=
> wrote:
> >
> > Yafang Shao <laoar.shao@gmail.com> writes:
> >
> > > On Fri, Jul 12, 2024 at 4:26=E2=80=AFPM Huang, Ying <ying.huang@intel=
.com> wrote:
> > >>
> > >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >>
> > >> > On Fri, Jul 12, 2024 at 3:06=E2=80=AFPM Huang, Ying <ying.huang@in=
tel.com> wrote:
> > >> >>
> > >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >>
> > >> >> > On Fri, Jul 12, 2024 at 2:18=E2=80=AFPM Huang, Ying <ying.huang=
@intel.com> wrote:
> > >> >> >>
> > >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >> >>
> > >> >> >> > On Fri, Jul 12, 2024 at 1:26=E2=80=AFPM Huang, Ying <ying.hu=
ang@intel.com> wrote:
> > >> >> >> >>
> > >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >> >> >>
> > >> >> >> >> > On Fri, Jul 12, 2024 at 11:07=E2=80=AFAM Huang, Ying <yin=
g.huang@intel.com> wrote:
> > >> >> >> >> >>
> > >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >> >> >> >>
> > >> >> >> >> >> > On Fri, Jul 12, 2024 at 9:21=E2=80=AFAM Huang, Ying <y=
ing.huang@intel.com> wrote:
> > >> >> >> >> >> >>
> > >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >> >> >> >> >>
> > >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51=E2=80=AFPM Huang, Ying=
 <ying.huang@intel.com> wrote:
> > >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20=E2=80=AFPM Huang, Y=
ing <ying.huang@intel.com> wrote:
> > >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44=E2=80=AFPM Huang=
, Ying <ying.huang@intel.com> wrote:
> > >> >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51=E2=80=AFAM H=
uang, Ying <ying.huang@intel.com> wrote:
> > >> >> >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes=
:
> > >> >> >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_S=
CALE_MAX poses challenges for
> > >> >> >> >> >> >> >> >> >> >> > quickly experimenting with specific wor=
kloads in a production environment,
> > >> >> >> >> >> >> >> >> >> >> > particularly when monitoring latency sp=
ikes caused by contention on the
> > >> >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysc=
tl parameter vm.pcp_batch_scale_max
> > >> >> >> >> >> >> >> >> >> >> > is introduced as a more practical alter=
native.
> > >> >> >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change.  I=
 can understand that kernel
> > >> >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl=
 knob.  But, sysctl knob is ABI
> > >> >> >> >> >> >> >> >> >> >> too.
> > >> >> >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock c=
ontention issue, several suggestions
> > >> >> >> >> >> >> >> >> >> >> > have been proposed. One approach involv=
es dividing large zones into multi
> > >> >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[=
0], while another entails splitting
> > >> >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism simila=
r to memory arenas and shifting away
> > >> >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to ident=
ify the range of free lists a
> > >> >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However,=
 implementing these solutions is
> > >> >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended d=
evelopment effort.
> > >> >> >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hur=
t instead of improve zone->lock
> > >> >> >> >> >> >> >> >> >> >> contention.  Instead, it will reduce page=
 allocation/freeing latency.
> > >> >> >> >> >> >> >> >> >> >
> > >> >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment=
. You introduced a
> > >> >> >> >> >> >> >> >> >> > configuration that has proven to be diffic=
ult to use, and you have
> > >> >> >> >> >> >> >> >> >> > been resistant to suggestions for modifyin=
g it to a more user-friendly
> > >> >> >> >> >> >> >> >> >> > and practical tuning approach. May I inqui=
re about the rationale
> > >> >> >> >> >> >> >> >> >> > behind introducing this configuration in t=
he beginning?
> > >> >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> >> Sorry, I don't understand your words.  Do yo=
u need me to explain what is
> > >> >> >> >> >> >> >> >> >> "neutral"?
> > >> >> >> >> >> >> >> >> >
> > >> >> >> >> >> >> >> >> > No, thanks.
> > >> >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a c=
lear and comprehensive
> > >> >> >> >> >> >> >> >> > explanation of what "neutral" means, providin=
g me with a better
> > >> >> >> >> >> >> >> >> > understanding of the concept.
> > >> >> >> >> >> >> >> >> >
> > >> >> >> >> >> >> >> >> > So, can you explain why you introduced it as =
a config in the beginning ?
> > >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> I think that I have explained it in the commit =
log of commit
> > >> >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale=
 factor to avoid too long
> > >> >> >> >> >> >> >> >> latency").  Which introduces the config.
> > >> >> >> >> >> >> >> >
> > >> >> >> >> >> >> >> > What specifically are your expectations for how =
users should utilize
> > >> >> >> >> >> >> >> > this config in real production workload?
> > >> >> >> >> >> >> >> >
> > >> >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintaine=
d forever.  Can you
> > >> >> >> >> >> >> >> >> explain why you need it?  Why cannot you use a =
fixed value after initial
> > >> >> >> >> >> >> >> >> experiments.
> > >> >> >> >> >> >> >> >
> > >> >> >> >> >> >> >> > Given the extensive scale of our production envi=
ronment, with hundreds
> > >> >> >> >> >> >> >> > of thousands of servers, it begs the question: h=
ow do you propose we
> > >> >> >> >> >> >> >> > efficiently manage the various workloads that re=
main unaffected by the
> > >> >> >> >> >> >> >> > sysctl change implemented on just a few thousand=
 servers? Is it
> > >> >> >> >> >> >> >> > feasible to expect us to recompile and release a=
 new kernel for every
> > >> >> >> >> >> >> >> > instance where the default value falls short? Su=
rely, there must be
> > >> >> >> >> >> >> >> > more practical and efficient approaches we can e=
xplore together to
> > >> >> >> >> >> >> >> > ensure optimal performance across all workloads.
> > >> >> >> >> >> >> >> >
> > >> >> >> >> >> >> >> > When making improvements or modifications, kindl=
y ensure that they are
> > >> >> >> >> >> >> >> > not solely confined to a test or lab environment=
. It's vital to also
> > >> >> >> >> >> >> >> > consider the needs and requirements of our actua=
l users, along with
> > >> >> >> >> >> >> >> > the diverse workloads they encounter in their da=
ily operations.
> > >> >> >> >> >> >> >>
> > >> >> >> >> >> >> >> Have you found that your different systems require=
s different
> > >> >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
> > >> >> >> >> >> >> >
> > >> >> >> >> >> >> > For specific workloads that introduce latency, we s=
et the value to 0.
> > >> >> >> >> >> >> > For other workloads, we keep it unchanged until we =
determine that the
> > >> >> >> >> >> >> > default value is also suboptimal. What is the issue=
 with this
> > >> >> >> >> >> >> > approach?
> > >> >> >> >> >> >>
> > >> >> >> >> >> >> Firstly, this is a system wide configuration, not wor=
kload specific.
> > >> >> >> >> >> >> So, other workloads run on the same system will be im=
pacted too.  Will
> > >> >> >> >> >> >> you run one workload only on one system?
> > >> >> >> >> >> >
> > >> >> >> >> >> > It seems we're living on different planets. You're hap=
pily working in
> > >> >> >> >> >> > your lab environment, while I'm struggling with real-w=
orld production
> > >> >> >> >> >> > issues.
> > >> >> >> >> >> >
> > >> >> >> >> >> > For servers:
> > >> >> >> >> >> >
> > >> >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max =3D 0
> > >> >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max =3D=
 5
> > >> >> >> >> >> > Server 1,000,001 and beyond: Happy with all values
> > >> >> >> >> >> >
> > >> >> >> >> >> > Is this hard to understand?
> > >> >> >> >> >> >
> > >> >> >> >> >> > In other words:
> > >> >> >> >> >> >
> > >> >> >> >> >> > For applications:
> > >> >> >> >> >> >
> > >> >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max =3D 0
> > >> >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_ma=
x =3D 5
> > >> >> >> >> >> > Application 1,000,001 and beyond: Happy with all value=
s
> > >> >> >> >> >>
> > >> >> >> >> >> Good to know this.  Thanks!
> > >> >> >> >> >>
> > >> >> >> >> >> >>
> > >> >> >> >> >> >> Secondly, we need some evidences to introduce a new s=
ystem ABI.  For
> > >> >> >> >> >> >> example, we need to use different configuration on di=
fferent systems
> > >> >> >> >> >> >> otherwise some workloads will be hurt.  Can you provi=
de some evidences
> > >> >> >> >> >> >> to support your change?  IMHO, it's not good enough t=
o say I don't know
> > >> >> >> >> >> >> why I just don't want to change existing systems.  If=
 so, it may be
> > >> >> >> >> >> >> better to wait until you have more evidences.
> > >> >> >> >> >> >
> > >> >> >> >> >> > It seems the community encourages developers to experi=
ment with their
> > >> >> >> >> >> > improvements in lab environments using meticulously de=
signed test
> > >> >> >> >> >> > cases A, B, C, and as many others as they can imagine,=
 ultimately
> > >> >> >> >> >> > obtaining perfect data. However, it discourages develo=
pers from
> > >> >> >> >> >> > directly addressing real-world workloads. Sigh.
> > >> >> >> >> >>
> > >> >> >> >> >> You cannot know whether your workloads benefit or hurt f=
or the different
> > >> >> >> >> >> batch number and how in your production environment?  If=
 you cannot, how
> > >> >> >> >> >> do you decide which workload deploys on which system (wi=
th different
> > >> >> >> >> >> batch number configuration).  If you can, can you provid=
e such
> > >> >> >> >> >> information to support your patch?
> > >> >> >> >> >
> > >> >> >> >> > We leverage a meticulous selection of network metrics, pa=
rticularly
> > >> >> >> >> > focusing on TcpExt indicators, to keep a close eye on app=
lication
> > >> >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts=
,
> > >> >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlow=
StartRetrans,
> > >> >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.
> > >> >> >> >> >
> > >> >> >> >> > In instances where a problematic container terminates, we=
've noticed a
> > >> >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occur=
rences per
> > >> >> >> >> > second, which serves as a clear indication that other app=
lications are
> > >> >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_ba=
tch_scale_max
> > >> >> >> >> > parameter to 0, we've been able to drastically reduce the=
 maximum
> > >> >> >> >> > frequency of these timeouts to less than one per second.
> > >> >> >> >>
> > >> >> >> >> Thanks a lot for sharing this.  I learned much from it!
> > >> >> >> >>
> > >> >> >> >> > At present, we're selectively applying this adjustment to=
 clusters
> > >> >> >> >> > that exclusively host the identified problematic applicat=
ions, and
> > >> >> >> >> > we're closely monitoring their performance to ensure stab=
ility. To
> > >> >> >> >> > date, we've observed no network latency issues as a resul=
t of this
> > >> >> >> >> > change. However, we remain cautious about extending this =
optimization
> > >> >> >> >> > to other clusters, as the decision ultimately depends on =
a variety of
> > >> >> >> >> > factors.
> > >> >> >> >> >
> > >> >> >> >> > It's important to note that we're not eager to implement =
this change
> > >> >> >> >> > across our entire fleet, as we recognize the potential fo=
r unforeseen
> > >> >> >> >> > consequences. Instead, we're taking a cautious approach b=
y initially
> > >> >> >> >> > applying it to a limited number of servers. This allows u=
s to assess
> > >> >> >> >> > its impact and make informed decisions about whether or n=
ot to expand
> > >> >> >> >> > its use in the future.
> > >> >> >> >>
> > >> >> >> >> So, you haven't observed any performance hurt yet.  Right?
> > >> >> >> >
> > >> >> >> > Right.
> > >> >> >> >
> > >> >> >> >> If you
> > >> >> >> >> haven't, I suggest you to keep the patch in your downstream=
 kernel for a
> > >> >> >> >> while.  In the future, if you find the performance of some =
workloads
> > >> >> >> >> hurts because of the new batch number, you can repost the p=
atch with the
> > >> >> >> >> supporting data.  If in the end, the performance of more an=
d more
> > >> >> >> >> workloads is good with the new batch number.  You may consi=
der to make 0
> > >> >> >> >> the default value :-)
> > >> >> >> >
> > >> >> >> > That is not how the real world works.
> > >> >> >> >
> > >> >> >> > In the real world:
> > >> >> >> >
> > >> >> >> > - No one knows what may happen in the future.
> > >> >> >> >   Therefore, if possible, we should make systems flexible, u=
nless
> > >> >> >> > there is a strong justification for using a hard-coded value=
.
> > >> >> >> >
> > >> >> >> > - Minimize changes whenever possible.
> > >> >> >> >   These systems have been working fine in the past, even if =
with lower
> > >> >> >> > performance. Why make changes just for the sake of improving
> > >> >> >> > performance? Does the key metric of your performance data tr=
uly matter
> > >> >> >> > for their workload?
> > >> >> >>
> > >> >> >> These are good policy in your organization and business.  But,=
 it's not
> > >> >> >> necessary the policy that Linux kernel upstream should take.
> > >> >> >
> > >> >> > You mean the Upstream Linux kernel only designed for the lab ?
> > >> >> >
> > >> >> >>
> > >> >> >> Community needs to consider long-term maintenance overhead, so=
 it adds
> > >> >> >> new ABI (such as sysfs knob) to kernel with the necessary just=
ification.
> > >> >> >> In general, it prefer to use a good default value or an automa=
tic
> > >> >> >> algorithm that works for everyone.  Community tries avoiding (=
or fixing)
> > >> >> >> regressions as much as possible, but this will not stop kernel=
 from
> > >> >> >> changing, even if it's big.
> > >> >> >
> > >> >> > Please explain to me why the kernel config is not ABI, but the =
sysctl is ABI.
> > >> >>
> > >> >> Linux kernel will not break ABI until the last users stop using i=
t.
> > >> >
> > >> > However, you haven't given a clear reference why the systl is an A=
BI.
> > >>
> > >> TBH, I don't find a formal document said it explicitly after some
> > >> searching.
> > >>
> > >> Hi, Andrew, Matthew,
> > >>
> > >> Can you help me on this?  Whether sysctl is considered Linux kernel =
ABI?
> > >> Or something similar?
> > >
> > > In my experience, we consistently utilize an if-statement to configur=
e
> > > sysctl settings in our production environments.
> > >
> > >     if [ -f ${sysctl_file} ]; then
> > >         echo ${new_value} > ${sysctl_file}
> > >     fi
> > >
> > > Additionally, you can incorporate this into rc.local to ensure the
> > > configuration is applied upon system reboot.
> > >
> > > Even if you add it to the sysctl.conf without the if-statement, it
> > > won't break anything.
> > >
> > > The pcp-related sysctl parameter, vm.percpu_pagelist_high_fraction,
> > > underwent a naming change along with a functional update from its
> > > predecessor, vm.percpu_pagelist_fraction, in commit 74f44822097c
> > > ("mm/page_alloc: introduce vm.percpu_pagelist_high_fraction"). Despit=
e
> > > this significant change, there have been no reported issues or
> > > complaints, suggesting that the renaming and functional update have
> > > not negatively impacted the system's functionality.
> >
> > Thanks for your information.  From the commit, sysctl isn't considered
> > as the kernel ABI.
> >
> > Even if so, IMHO, we shouldn't introduce a user tunable knob without a
> > real world requirements except more flexibility.
>
> Indeed, I do not reside in the physical realm but within a virtualized
> universe. (Of course, that is your perspective.)

One final note: you explained very well what "neutral" means. Thank
you for your comments.

--=20
Regards
Yafang