From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 17920C3DA49
	for <linux-mm@archiver.kernel.org>; Fri, 12 Jul 2024 03:44:57 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A15776B009A; Thu, 11 Jul 2024 23:44:56 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9C68B6B009E; Thu, 11 Jul 2024 23:44:56 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 866436B009F; Thu, 11 Jul 2024 23:44:56 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 65D216B009A
	for <linux-mm@kvack.org>; Thu, 11 Jul 2024 23:44:56 -0400 (EDT)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id B0E56C08FA
	for <linux-mm@kvack.org>; Fri, 12 Jul 2024 03:44:55 +0000 (UTC)
X-FDA: 82329709350.08.DCBF5AA
Received: from mail-qv1-f51.google.com (mail-qv1-f51.google.com [209.85.219.51])
	by imf19.hostedemail.com (Postfix) with ESMTP id E80581A0007
	for <linux-mm@kvack.org>; Fri, 12 Jul 2024 03:44:53 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=L02WeFnE;
	spf=pass (imf19.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.51 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720755876; a=rsa-sha256;
	cv=none;
	b=lIL6Nsv2P4tPjQuGiR6plGAN5ZfY96YrITttJA3ANtBA7/u0Ni32gCnm8KaxCELFAfhQvF
	votE86oW4M5tV/aisylfIRVBQDF1V18rYv2/hJ8fcwHTQTvoq8jvkf4Tc85qnbE9/suNFw
	T6ctYGOLgWBP8eVjbZB85ugYC3ksVDE=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=L02WeFnE;
	spf=pass (imf19.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.51 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1720755876;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=9Yk0p/ScYDME7sGbfwht2rxll8cKDzFzhkNvOHR1jyc=;
	b=lSE9igrrYUQ7cN0TfwPk+hDaiUQ/p8nyGSNCTEAsINoldX1PyPZzoGEYyzgluSQUTje4fv
	SBaPDiR8tIC+3q+Gv9u/6w5oyXtuXN3Hh0/hq9bwM7E8in8mHjAaFeu7HTvoim1MLu5Rkx
	hJdNjMKy1Ytsm+7q3PFlZa25b9AD7Ok=
Received: by mail-qv1-f51.google.com with SMTP id 6a1803df08f44-6b5f128b199so9659706d6.0
        for <linux-mm@kvack.org>; Thu, 11 Jul 2024 20:44:53 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1720755893; x=1721360693; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=9Yk0p/ScYDME7sGbfwht2rxll8cKDzFzhkNvOHR1jyc=;
        b=L02WeFnEFB2ZMMTs2TGs4BzjfwjgX7g0a2ezMtJKK2i7Zy7uFqBtMJg6IssqfvOwTh
         kcQGzQis1dSyg6Buv6aIukxcUjCvsOYbA+mTMz59zKQfTMGssaPuzkaMwd67sJlyQ7Ft
         CoQJIYFMbaK7tJB4uZMLa8KRP4qcX/wQu1asKgmt+/upgbL1CA43mVxbR8UzAPPSbVMW
         UT6ybv0ngSNdalfZSJ9P2zCGcgdI1v+DAI9pPaPFQAf72xoIICsKD2SqxCw1hv5J7Pby
         BWShfwQFMoZyp/cws/RjF3/gPuCohJkTsEAB8CwKbJ5pcAR/OdFjTrEMzMLOowd7BrzW
         409w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1720755893; x=1721360693;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=9Yk0p/ScYDME7sGbfwht2rxll8cKDzFzhkNvOHR1jyc=;
        b=C5fCwl0WF4iAByUqcIceM/UYVXnydmZAFMKxyQWZDuoqlYzccAvjpqQODfEaas8tzB
         jS8xCO9MM4ddbpicFPl5cXrj3XZIiJA24C6lQRBMETZnhLlArlqTjKnA+Z0MC5lWI3aW
         Hx96v+1bYZ5nu90fajqu2GtCJ6MlHgsk0Xiz3E1mMff2m4SYkmxn5kpKOGqpN5K7Gsv4
         eSOiWk4X4l6jC/YEJXns/hT1dWziT14P4fsTYNkdyjAeAixVojA7SjWXLL7RXXVsuQpr
         d516ouBYWi6/0dAO5iz5M+43x+lQ53it2q5tJBxPwDyxwDsTTerqNOZfvOOr8n4J1qs+
         WZOw==
X-Forwarded-Encrypted: i=1; AJvYcCUi78FZZZZDXwpVDlIl8ew13FG5Fu/t6dvbmypJ3mQUroLiAdlxIpM2X3Zn7bRcRee5smuD8BeAu8Fal0iXqZNXfgY=
X-Gm-Message-State: AOJu0Yw6deq6hpeOu/pSSioYX8z2BUSQJMwLvFwwwD23oGKMXdas05Mf
	T9Wfe6Rb4NhMR/0FD1wPI3Vc/f8sSpjLT+7oBhvcCNqt7v3fSQficMxB4yLVaLOfek6gl9prEXx
	Pwqw1RxUb100KEtUYO/Ge8QQX+HE=
X-Google-Smtp-Source: AGHT+IEV3Reu9fUPvnhPQ9lMdeGa9ga4SDd/p4+WktjjZxEx1s6aIkrYuv8Kc1DIR7lsEQg3niT0Tm01czxfZkXmij0=
X-Received: by 2002:a05:6214:1d2c:b0:6b5:3c06:a58b with SMTP id
 6a1803df08f44-6b61c1d4650mr116789196d6.59.1720755893029; Thu, 11 Jul 2024
 20:44:53 -0700 (PDT)
MIME-Version: 1.0
References: <20240707094956.94654-1-laoar.shao@gmail.com> <20240707094956.94654-4-laoar.shao@gmail.com>
 <878qyaarm6.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbAnAPkYu4oDHgW48N_e+OSqzo6AzOnGY-LNPTDa48RU9Q@mail.gmail.com>
 <87o774a0pv.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbDKvkFQO_WST0PTy=LvnfFTXTcnRWMtwWAzQy4eWSsphQ@mail.gmail.com>
 <87frsg9waa.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbBEsmF3_udHNzpOTrRWscweysBgrGweH1s0SSueMhYP7A@mail.gmail.com>
 <877cds9pa2.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbBXuY0v8LRBoAjmp8TjzLAzOZ8BbcSFmRZPMqfSey5aWA@mail.gmail.com>
 <87y1678l0f.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbDG12P4LZN=ngenArGxmQFHcNADXfcEBLvXzPg9AdCRCA@mail.gmail.com>
 <87plrj8g42.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87plrj8g42.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Fri, 12 Jul 2024 11:44:16 +0800
Message-ID: <CALOAHbDT-d+nZCTUFo6fGTk1HKQCqCMEE-_CuZSxEkQGR0veKQ@mail.gmail.com>
Subject: Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
To: "Huang, Ying" <ying.huang@intel.com>
Cc: akpm@linux-foundation.org, mgorman@techsingularity.net, linux-mm@kvack.org, 
	Matthew Wilcox <willy@infradead.org>, David Rientjes <rientjes@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: t5ofhjz3i4g7bzhthjak5u8bewfk73bw
X-Rspamd-Queue-Id: E80581A0007
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-HE-Tag: 1720755893-408245
X-HE-Meta: U2FsdGVkX19Ks4+pKDFsNdGCXLsAIVRAX7s/COOZjoBAtx4Z05eGWLQGTAyamI4z3v7YOF/jGs3CmwWH1cHhmt0B9UMhlaOLs832VbdK0xDG58vBRsD5uMIJuZjoU462DXQepgsn67lpH9F0Q2J6PIoLtHJl6owmO5TeKbsIEtxsuYbWYji8YUgY4NmWHhwAb/PkoJKQ27ouPXr8u1lEQhgRM+6BOjIbCgnQIQDmD1vxiSfpg8zLSzJ0N7+/HP+qKHDnhxEQD+TltHWZktS+LVe0j3FwwQZhgVIc6soScK0tmEQUYNeyKQ4X1mu2H6NPxo0lNtNdomO3ISIDxKAjVAZRCsnGV1Xv6BjMyKyno6Hd5n79mxxa1Wq4PFy09PmaxD87+pOVs3YR8er9HNtEjxFoxxWT5jhuIfWbptINxdKF/iKm6wk2ogZVynyv8twgsHNnPwXqcRaT4ip+crAzhczmZgzzR/z4EE4YZ1QO8p/mRTXpQ7V8Zv8GIitrmGgGN90XKWCw5AfhxlW2hRhuA/q0jqBadc5W2x6YV4wstsHBy8MvN8K+WnDwkm5SHi6eImRXiCbug1hidW0sYo002fGAGbd88sTCNHU5wwjexHD7xgZqt9oDi6sgej/lDWhQO+3NyoXksUGdRvl13ll1nBOVwNMlPk9wpGgBYEZy4+yh9sDgPcRXZ1VfLlw+Zt36uT94EkMA2wmm590jEE4YmYTf11YB7mid41OGzmeAzm9bOrvpc+/2DA7Rf0jp044U/dYGsnANG7eLnPA/XUDm1ERQ+789yH6IowfRElGxbC6eM/66GQaBJDQ7I2yzTyJt60g8aoQxCnjJcmrhqEfEdqlStoIn/BbNyqR8aJ/QtCq1KlAAxeDgZFCIQSqjtMiau9OoXRduXlDSgI2SrKTizYSGiBBpAu1wT2KIjc+B6/BXMiJxTN1bWWGmzoCsCxvCMKQfN/Rg58L84Nfp3Hm
 4U1hVjQl
 jNkNNprk/osjMWUYMxsPxGgswpXw+bYYS4IY9J/1M6oqMV1Oek1D71+uNP21ZKjSf6ykI+12MNILsJgAG2SCU+VDDf8Jpj9IPkCwtoEECaborF2686TIbx4P2s8OZzp20KtlaFsyBc1jSzMaXa462V+ik6d2o3RyDautpI0pl1Vutl1jQZmi1qLzwDrdsTGOg37Cf/Qb51lUNkZzSbGn6r6j1A8LtxBODCD91AQmhXrVBuM5XM23cY3FmbEIheOXtzXZnBAn3SorrgImmaESpcuxrqx9IN/Vg3kWE2TGn4KTpB/d+tyw0OPi7JtqYHC8z+QYMFjaSh+U+ECY7mJAwVK40tp0S81lbz1NwnBacTJS9bJ191ajhVtT4usKTqsMBSI2oJR75p5ythfYJRB5rRatLJXnfwQYobSrv
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000562, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Jul 12, 2024 at 11:07=E2=80=AFAM Huang, Ying <ying.huang@intel.com>=
 wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Fri, Jul 12, 2024 at 9:21=E2=80=AFAM Huang, Ying <ying.huang@intel.c=
om> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Thu, Jul 11, 2024 at 6:51=E2=80=AFPM Huang, Ying <ying.huang@inte=
l.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Thu, Jul 11, 2024 at 4:20=E2=80=AFPM Huang, Ying <ying.huang@i=
ntel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Thu, Jul 11, 2024 at 2:44=E2=80=AFPM Huang, Ying <ying.huan=
g@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51=E2=80=AFAM Huang, Ying <ying.=
huang@intel.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >> >>
> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses ch=
allenges for
> >> >> >> >> >> > quickly experimenting with specific workloads in a produ=
ction environment,
> >> >> >> >> >> > particularly when monitoring latency spikes caused by co=
ntention on the
> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.p=
cp_batch_scale_max
> >> >> >> >> >> > is introduced as a more practical alternative.
> >> >> >> >> >>
> >> >> >> >> >> In general, I'm neutral to the change.  I can understand t=
hat kernel
> >> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysc=
tl knob is ABI
> >> >> >> >> >> too.
> >> >> >> >> >>
> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, =
several suggestions
> >> >> >> >> >> > have been proposed. One approach involves dividing large=
 zones into multi
> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another=
 entails splitting
> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arena=
s and shifting away
> >> >> >> >> >> > from relying solely on zone_id to identify the range of =
free lists a
> >> >> >> >> >> > particular page belongs to[1]. However, implementing the=
se solutions is
> >> >> >> >> >> > likely to necessitate a more extended development effort=
.
> >> >> >> >> >>
> >> >> >> >> >> Per my understanding, the change will hurt instead of impr=
ove zone->lock
> >> >> >> >> >> contention.  Instead, it will reduce page allocation/freei=
ng latency.
> >> >> >> >> >
> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced =
a
> >> >> >> >> > configuration that has proven to be difficult to use, and y=
ou have
> >> >> >> >> > been resistant to suggestions for modifying it to a more us=
er-friendly
> >> >> >> >> > and practical tuning approach. May I inquire about the rati=
onale
> >> >> >> >> > behind introducing this configuration in the beginning?
> >> >> >> >>
> >> >> >> >> Sorry, I don't understand your words.  Do you need me to expl=
ain what is
> >> >> >> >> "neutral"?
> >> >> >> >
> >> >> >> > No, thanks.
> >> >> >> > After consulting with ChatGPT, I received a clear and comprehe=
nsive
> >> >> >> > explanation of what "neutral" means, providing me with a bette=
r
> >> >> >> > understanding of the concept.
> >> >> >> >
> >> >> >> > So, can you explain why you introduced it as a config in the b=
eginning ?
> >> >> >>
> >> >> >> I think that I have explained it in the commit log of commit
> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid =
too long
> >> >> >> latency").  Which introduces the config.
> >> >> >
> >> >> > What specifically are your expectations for how users should util=
ize
> >> >> > this config in real production workload?
> >> >> >
> >> >> >>
> >> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can y=
ou
> >> >> >> explain why you need it?  Why cannot you use a fixed value after=
 initial
> >> >> >> experiments.
> >> >> >
> >> >> > Given the extensive scale of our production environment, with hun=
dreds
> >> >> > of thousands of servers, it begs the question: how do you propose=
 we
> >> >> > efficiently manage the various workloads that remain unaffected b=
y the
> >> >> > sysctl change implemented on just a few thousand servers? Is it
> >> >> > feasible to expect us to recompile and release a new kernel for e=
very
> >> >> > instance where the default value falls short? Surely, there must =
be
> >> >> > more practical and efficient approaches we can explore together t=
o
> >> >> > ensure optimal performance across all workloads.
> >> >> >
> >> >> > When making improvements or modifications, kindly ensure that the=
y are
> >> >> > not solely confined to a test or lab environment. It's vital to a=
lso
> >> >> > consider the needs and requirements of our actual users, along wi=
th
> >> >> > the diverse workloads they encounter in their daily operations.
> >> >>
> >> >> Have you found that your different systems requires different
> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
> >> >
> >> > For specific workloads that introduce latency, we set the value to 0=
.
> >> > For other workloads, we keep it unchanged until we determine that th=
e
> >> > default value is also suboptimal. What is the issue with this
> >> > approach?
> >>
> >> Firstly, this is a system wide configuration, not workload specific.
> >> So, other workloads run on the same system will be impacted too.  Will
> >> you run one workload only on one system?
> >
> > It seems we're living on different planets. You're happily working in
> > your lab environment, while I'm struggling with real-world production
> > issues.
> >
> > For servers:
> >
> > Server 1 to 10,000: vm.pcp_batch_scale_max =3D 0
> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max =3D 5
> > Server 1,000,001 and beyond: Happy with all values
> >
> > Is this hard to understand?
> >
> > In other words:
> >
> > For applications:
> >
> > Application 1 to 10,000: vm.pcp_batch_scale_max =3D 0
> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max =3D 5
> > Application 1,000,001 and beyond: Happy with all values
>
> Good to know this.  Thanks!
>
> >>
> >> Secondly, we need some evidences to introduce a new system ABI.  For
> >> example, we need to use different configuration on different systems
> >> otherwise some workloads will be hurt.  Can you provide some evidences
> >> to support your change?  IMHO, it's not good enough to say I don't kno=
w
> >> why I just don't want to change existing systems.  If so, it may be
> >> better to wait until you have more evidences.
> >
> > It seems the community encourages developers to experiment with their
> > improvements in lab environments using meticulously designed test
> > cases A, B, C, and as many others as they can imagine, ultimately
> > obtaining perfect data. However, it discourages developers from
> > directly addressing real-world workloads. Sigh.
>
> You cannot know whether your workloads benefit or hurt for the different
> batch number and how in your production environment?  If you cannot, how
> do you decide which workload deploys on which system (with different
> batch number configuration).  If you can, can you provide such
> information to support your patch?

We leverage a meticulous selection of network metrics, particularly
focusing on TcpExt indicators, to keep a close eye on application
latency. This includes metrics such as TcpExt.TCPTimeouts,
TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans,
TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.

In instances where a problematic container terminates, we've noticed a
sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per
second, which serves as a clear indication that other applications are
experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max
parameter to 0, we've been able to drastically reduce the maximum
frequency of these timeouts to less than one per second.

At present, we're selectively applying this adjustment to clusters
that exclusively host the identified problematic applications, and
we're closely monitoring their performance to ensure stability. To
date, we've observed no network latency issues as a result of this
change. However, we remain cautious about extending this optimization
to other clusters, as the decision ultimately depends on a variety of
factors.

It's important to note that we're not eager to implement this change
across our entire fleet, as we recognize the potential for unforeseen
consequences. Instead, we're taking a cautious approach by initially
applying it to a limited number of servers. This allows us to assess
its impact and make informed decisions about whether or not to expand
its use in the future.


[0]  'Cluster' refers to a Kubernetes concept, where a single cluster
comprises a specific group of servers designed to work in unison.

--=20
Regards
Yafang