From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 81CA2C2BD09 for ; Fri, 12 Jul 2024 08:50:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 023776B008C; Fri, 12 Jul 2024 04:50:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F152F6B0092; Fri, 12 Jul 2024 04:50:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DB5966B0096; Fri, 12 Jul 2024 04:50:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B72276B008C for ; Fri, 12 Jul 2024 04:50:23 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 5DDAA1C19B2 for ; Fri, 12 Jul 2024 08:50:23 +0000 (UTC) X-FDA: 82330479126.10.FEC1AC3 Received: from mail-qv1-f49.google.com (mail-qv1-f49.google.com [209.85.219.49]) by imf29.hostedemail.com (Postfix) with ESMTP id 9055912000C for ; Fri, 12 Jul 2024 08:50:21 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Q6643f6n; spf=pass (imf29.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.49 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720774187; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=f5iHDTF8De1Afkk2Im67jvVGc51RVSiR4nz7s2VNabc=; b=YqFDfWPxQZbtL1YwSAdoGhgAdYKRSQQAvGyUDWI+eljMe7Z0QI4ClRTukWA7Q//3MfbuJX fj2KF8zecWivYu62VFAYxIYlNq30UCTPZNfEqL9ICHQcz7xd3pdXRz1uSM5sT3k4x8s0AO RgbGZ5TsPJ55YWcrcYXFkbBiu+3la/Q= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720774187; a=rsa-sha256; cv=none; b=InWMGQO9vNoZDlqk7pujahWh8wARge0soV3yaQXFVQU34Z1PhHTq7SvDjWagJ3fO5MM3t7 241Xp4+KqXnjMZiY8FAJXkVov0gWLOrDJSwAiyW3+5WCSz48bnQUFn2rMoUkYaIQRo49ZA eaLHrRWfsvjFjRkBcCm61e6nSN6o9Hc= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Q6643f6n; spf=pass (imf29.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.49 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-qv1-f49.google.com with SMTP id 6a1803df08f44-6b5dd7cd945so10246916d6.1 for ; Fri, 12 Jul 2024 01:50:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1720774220; x=1721379020; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=f5iHDTF8De1Afkk2Im67jvVGc51RVSiR4nz7s2VNabc=; b=Q6643f6njJWYMfZglr2iVFjO0u2qSkIacKuATJ3JcZ6CSRjdvXQoJLe1hUjew7//zy WU232eitzLCb3VsXoiafOH87f+LWAIutfdrZ9EMziJP+TKK1g3I+3OcafYFoecHLd5XU d24sGwH1Iz0u68i+aWjl9hGd0LaFoA90B8hu3XY1hMRRf11MlXgoww7GzGottCdXpSS6 QeGnItakYmMHemwETYVF8giLZswkcwXAMFsP+ttTnMm++HZFkyTkaZyRACd3+4LaRArp /JUtlky+xegB91E1Lpy2SkFfcOA0gVYF7jA9YxbD5FwXZrcLINikU4zzuEGGHsLrnYhL UOKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720774220; x=1721379020; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=f5iHDTF8De1Afkk2Im67jvVGc51RVSiR4nz7s2VNabc=; b=I21cKuwmJFhY3HmtJ4/HtRQ331fuoam2TtwWXaqg428Wg+MXlmkv5MIqJxqZnwDo83 MvjnzctPSXqgaWD3IHVKgCuMkjsM/3Xk/toohlV9uCtAphRjFFhFCO1O77QtRQcGs2// hGkcGTM4WnU+LnIsqCgXzxbXVZUaGsFMqkwiiYV5zhokedHjcjSxr8w8IDwwPrQ+u0p5 C7r3BEjyE9sQe0NQ5Qr/4nGMsjakDAMtxsHaQwYgoYgvxunoOxAif0IJ7yY8c8BAw5aA XrqynIcFRqNxEBfy/NA7wyO8phkV1EOmGqEWsC2TGbJgMs/miyisPZXePSe88ks2yjIy g62Q== X-Forwarded-Encrypted: i=1; AJvYcCUuYlUvqwnZtjqip0XIPuCjaAfI7zQikysbgTxE+J1V51XYyrzKLegWTZH9arH72MG6P+xTr/qVceS9HNQMFhUTeNc= X-Gm-Message-State: AOJu0YzTBHjceDVvdTcWR5a0GuqOahArpS/d5ykVMcF1YcT8pKlHSh4a YSu9nTsrHfu53cCSESWEo8Aw5Y6HDrD1/3NugR8d9ZNfWe7VQIvVa+fA6tlkHVI2Tdu2wS3LRP+ RPFS//hT+rRDJ58xAmH7qdMlzDzs= X-Google-Smtp-Source: AGHT+IE/dTyZq2cWZdH1QiU2Eo4TdJunW5SroILKk1KyjCd/nPOL63vkktCygW2v8ugIBOn8dk+83wPBPG5aoNS+wkM= X-Received: by 2002:ad4:5c88:0:b0:6b5:786e:fd43 with SMTP id 6a1803df08f44-6b61bc804b0mr142268006d6.7.1720774220560; Fri, 12 Jul 2024 01:50:20 -0700 (PDT) MIME-Version: 1.0 References: <20240707094956.94654-1-laoar.shao@gmail.com> <20240707094956.94654-4-laoar.shao@gmail.com> <878qyaarm6.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o774a0pv.fsf@yhuang6-desk2.ccr.corp.intel.com> <87frsg9waa.fsf@yhuang6-desk2.ccr.corp.intel.com> <877cds9pa2.fsf@yhuang6-desk2.ccr.corp.intel.com> <87y1678l0f.fsf@yhuang6-desk2.ccr.corp.intel.com> <87plrj8g42.fsf@yhuang6-desk2.ccr.corp.intel.com> <87h6cv89n4.fsf@yhuang6-desk2.ccr.corp.intel.com> <87cynj878z.fsf@yhuang6-desk2.ccr.corp.intel.com> <874j8v851a.fsf@yhuang6-desk2.ccr.corp.intel.com> <87zfqn6mr5.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87zfqn6mr5.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Yafang Shao Date: Fri, 12 Jul 2024 16:49:44 +0800 Message-ID: Subject: Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max To: "Huang, Ying" Cc: akpm@linux-foundation.org, Matthew Wilcox , mgorman@techsingularity.net, linux-mm@kvack.org, David Rientjes Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 9055912000C X-Stat-Signature: esnainkpa5me4rehdjz33d4j5edgjtfx X-HE-Tag: 1720774221-506038 X-HE-Meta: U2FsdGVkX18P2tOiUWJReq/nhbfXdRHOMpTYDWG205dQCQ8cmjC879RXZMf/ITW/b8v2IjRrh6wiPFil4Tir0UuIkvKa6Z3mKlV4JCD1SZ4gMKgfNGmIUiKk0AAor5/XgIGq1dI3fyxp2cWJ+pBznW8fpmxv4py6DTte+yu29XslPCEZ5a9yD/xbyQzfKBWmfvDWwEn9DdRFiCkEHntKaI81koAJYd7/v8k0FehJaojb65mVeFxRfpSxR05yuHylwjjw3EtwODclD8pumYlEk0QeyibKkp3ecM56JxyngVJMmkIGKQVTmDuzE4kBKlPMd/2Rra1EiJhzgO39TKvimxZBIxyi96UMgFLgsOs5PBM7o7NaP+PeCCpVFT145JKX5Nu3HXyqwMQU9XPcJ/yVU7nmvyX+iOnmCbDblYjxpTdUQrP/F9RqHm9I4k8AhH6c/78ZXsV5A/yPcz5ikHhUkCPzqmCWMOglnxZotK+kXvZFS0vFbrKQAPLrdwXkwCeCn5ju6ZXkYrJwbRa5+N3d1wVtWq9wFGcgcts+XT6Rdv2yxXoX/8REwdaP4elBnbYG0mU1rE4tCio0X9ydYg65FfYWS5oMutntcBo8YN5+8H8UjPAEQqmAe2NeutlEHy95SRC4x8xHRNOlHkXm/0gnMFko9/4895dQ6/Bon2Hw9g3Xiy6PgahFZQ0yQSlGdwlgEZ3cyQCY4etOlHkqKGyStlzf6uzCOl5xgw49dmwzv6wgfIjdAAAW0F7aME0/FWLrp8r97QPlH0IIs5YRNMbXdQXMoC1xqssUbFJD+gp8PwLntDiln2gzAwFH+0F4qU1U0QCvvnja4xGaoKA1yicOUtyy3e3f/YAo3HvBzhB1pjqCPrZN4CxHVtcqoBGBfQs3scNz9+MTBtnugq005LFAqHwSCmRoQvRffpZog2UhO+mk2+29CW0W1dGVSGsy5KYV4H2RwGaFvW6LIb0Z/Ih qSOK1P45 mitBDOj+x2wbHr0XwJ1+Wfj3FfJRLQCzc6U69pqIzF8U4VEt/2DaOiUN1kEsATMECaBcR5SLFCNvVyFt9zyDc9No6K0L0GK1BE+tdvZw5Zvynkld+ssbaAsx7VPT37laMDJSuzzG3Gac4HBlbMqdP2ulywxYWSSkbji5vpG76Y6sZKL13zuFJy/DFevVaVuuKG0jqJ2v7nTJVTrVLZdDb5d6o8XCTa/4CSmPr6BBN3FH9p30rJrZ4SocOayOH3sXyQ5kkWt4kruwN4GRU5QqNboX2uZ+Df5Sulqb+DXC8JG/3SN+sDoxe8ud3Vylija8U2RvInoJrtQvjuEdCLTCaPOAfpcAVd6AMRn0JR5FLH58h0aL3ifVqyxrdDk679dRfzLj+tcGW2RKysRjsfyVLU/VzbPdS1p8vx/aS X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jul 12, 2024 at 4:26=E2=80=AFPM Huang, Ying = wrote: > > Yafang Shao writes: > > > On Fri, Jul 12, 2024 at 3:06=E2=80=AFPM Huang, Ying wrote: > >> > >> Yafang Shao writes: > >> > >> > On Fri, Jul 12, 2024 at 2:18=E2=80=AFPM Huang, Ying wrote: > >> >> > >> >> Yafang Shao writes: > >> >> > >> >> > On Fri, Jul 12, 2024 at 1:26=E2=80=AFPM Huang, Ying wrote: > >> >> >> > >> >> >> Yafang Shao writes: > >> >> >> > >> >> >> > On Fri, Jul 12, 2024 at 11:07=E2=80=AFAM Huang, Ying wrote: > >> >> >> >> > >> >> >> >> Yafang Shao writes: > >> >> >> >> > >> >> >> >> > On Fri, Jul 12, 2024 at 9:21=E2=80=AFAM Huang, Ying wrote: > >> >> >> >> >> > >> >> >> >> >> Yafang Shao writes: > >> >> >> >> >> > >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51=E2=80=AFPM Huang, Ying wrote: > >> >> >> >> >> >> > >> >> >> >> >> >> Yafang Shao writes: > >> >> >> >> >> >> > >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20=E2=80=AFPM Huang, Ying <= ying.huang@intel.com> wrote: > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Yafang Shao writes: > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44=E2=80=AFPM Huang, Yin= g wrote: > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> Yafang Shao writes: > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51=E2=80=AFAM Huang,= Ying wrote: > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> Yafang Shao writes: > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_= MAX poses challenges for > >> >> >> >> >> >> >> >> >> > quickly experimenting with specific workload= s in a production environment, > >> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes = caused by contention on the > >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl pa= rameter vm.pcp_batch_scale_max > >> >> >> >> >> >> >> >> >> > is introduced as a more practical alternativ= e. > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I can = understand that kernel > >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob= . But, sysctl knob is ABI > >> >> >> >> >> >> >> >> >> too. > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock conten= tion issue, several suggestions > >> >> >> >> >> >> >> >> >> > have been proposed. One approach involves di= viding large zones into multi > >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], w= hile another entails splitting > >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to = memory arenas and shifting away > >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify t= he range of free lists a > >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, impl= ementing these solutions is > >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended develo= pment effort. > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt ins= tead of improve zone->lock > >> >> >> >> >> >> >> >> >> contention. Instead, it will reduce page allo= cation/freeing latency. > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You= introduced a > >> >> >> >> >> >> >> >> > configuration that has proven to be difficult t= o use, and you have > >> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it = to a more user-friendly > >> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire ab= out the rationale > >> >> >> >> >> >> >> >> > behind introducing this configuration in the be= ginning? > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> Sorry, I don't understand your words. Do you nee= d me to explain what is > >> >> >> >> >> >> >> >> "neutral"? > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > No, thanks. > >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear = and comprehensive > >> >> >> >> >> >> >> > explanation of what "neutral" means, providing me = with a better > >> >> >> >> >> >> >> > understanding of the concept. > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > So, can you explain why you introduced it as a con= fig in the beginning ? > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> I think that I have explained it in the commit log o= f commit > >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale fact= or to avoid too long > >> >> >> >> >> >> >> latency"). Which introduces the config. > >> >> >> >> >> >> > > >> >> >> >> >> >> > What specifically are your expectations for how users= should utilize > >> >> >> >> >> >> > this config in real production workload? > >> >> >> >> >> >> > > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained for= ever. Can you > >> >> >> >> >> >> >> explain why you need it? Why cannot you use a fixed= value after initial > >> >> >> >> >> >> >> experiments. > >> >> >> >> >> >> > > >> >> >> >> >> >> > Given the extensive scale of our production environme= nt, with hundreds > >> >> >> >> >> >> > of thousands of servers, it begs the question: how do= you propose we > >> >> >> >> >> >> > efficiently manage the various workloads that remain = unaffected by the > >> >> >> >> >> >> > sysctl change implemented on just a few thousand serv= ers? Is it > >> >> >> >> >> >> > feasible to expect us to recompile and release a new = kernel for every > >> >> >> >> >> >> > instance where the default value falls short? Surely,= there must be > >> >> >> >> >> >> > more practical and efficient approaches we can explor= e together to > >> >> >> >> >> >> > ensure optimal performance across all workloads. > >> >> >> >> >> >> > > >> >> >> >> >> >> > When making improvements or modifications, kindly ens= ure that they are > >> >> >> >> >> >> > not solely confined to a test or lab environment. It'= s vital to also > >> >> >> >> >> >> > consider the needs and requirements of our actual use= rs, along with > >> >> >> >> >> >> > the diverse workloads they encounter in their daily o= perations. > >> >> >> >> >> >> > >> >> >> >> >> >> Have you found that your different systems requires dif= ferent > >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? > >> >> >> >> >> > > >> >> >> >> >> > For specific workloads that introduce latency, we set th= e value to 0. > >> >> >> >> >> > For other workloads, we keep it unchanged until we deter= mine that the > >> >> >> >> >> > default value is also suboptimal. What is the issue with= this > >> >> >> >> >> > approach? > >> >> >> >> >> > >> >> >> >> >> Firstly, this is a system wide configuration, not workload= specific. > >> >> >> >> >> So, other workloads run on the same system will be impacte= d too. Will > >> >> >> >> >> you run one workload only on one system? > >> >> >> >> > > >> >> >> >> > It seems we're living on different planets. You're happily = working in > >> >> >> >> > your lab environment, while I'm struggling with real-world = production > >> >> >> >> > issues. > >> >> >> >> > > >> >> >> >> > For servers: > >> >> >> >> > > >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max =3D 0 > >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max =3D 5 > >> >> >> >> > Server 1,000,001 and beyond: Happy with all values > >> >> >> >> > > >> >> >> >> > Is this hard to understand? > >> >> >> >> > > >> >> >> >> > In other words: > >> >> >> >> > > >> >> >> >> > For applications: > >> >> >> >> > > >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max =3D 0 > >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max =3D= 5 > >> >> >> >> > Application 1,000,001 and beyond: Happy with all values > >> >> >> >> > >> >> >> >> Good to know this. Thanks! > >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> Secondly, we need some evidences to introduce a new system= ABI. For > >> >> >> >> >> example, we need to use different configuration on differe= nt systems > >> >> >> >> >> otherwise some workloads will be hurt. Can you provide so= me evidences > >> >> >> >> >> to support your change? IMHO, it's not good enough to say= I don't know > >> >> >> >> >> why I just don't want to change existing systems. If so, = it may be > >> >> >> >> >> better to wait until you have more evidences. > >> >> >> >> > > >> >> >> >> > It seems the community encourages developers to experiment = with their > >> >> >> >> > improvements in lab environments using meticulously designe= d test > >> >> >> >> > cases A, B, C, and as many others as they can imagine, ulti= mately > >> >> >> >> > obtaining perfect data. However, it discourages developers = from > >> >> >> >> > directly addressing real-world workloads. Sigh. > >> >> >> >> > >> >> >> >> You cannot know whether your workloads benefit or hurt for th= e different > >> >> >> >> batch number and how in your production environment? If you = cannot, how > >> >> >> >> do you decide which workload deploys on which system (with di= fferent > >> >> >> >> batch number configuration). If you can, can you provide suc= h > >> >> >> >> information to support your patch? > >> >> >> > > >> >> >> > We leverage a meticulous selection of network metrics, particu= larly > >> >> >> > focusing on TcpExt indicators, to keep a close eye on applicat= ion > >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts, > >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStart= Retrans, > >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. > >> >> >> > > >> >> >> > In instances where a problematic container terminates, we've n= oticed a > >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrence= s per > >> >> >> > second, which serves as a clear indication that other applicat= ions are > >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_s= cale_max > >> >> >> > parameter to 0, we've been able to drastically reduce the maxi= mum > >> >> >> > frequency of these timeouts to less than one per second. > >> >> >> > >> >> >> Thanks a lot for sharing this. I learned much from it! > >> >> >> > >> >> >> > At present, we're selectively applying this adjustment to clus= ters > >> >> >> > that exclusively host the identified problematic applications,= and > >> >> >> > we're closely monitoring their performance to ensure stability= . To > >> >> >> > date, we've observed no network latency issues as a result of = this > >> >> >> > change. However, we remain cautious about extending this optim= ization > >> >> >> > to other clusters, as the decision ultimately depends on a var= iety of > >> >> >> > factors. > >> >> >> > > >> >> >> > It's important to note that we're not eager to implement this = change > >> >> >> > across our entire fleet, as we recognize the potential for unf= oreseen > >> >> >> > consequences. Instead, we're taking a cautious approach by ini= tially > >> >> >> > applying it to a limited number of servers. This allows us to = assess > >> >> >> > its impact and make informed decisions about whether or not to= expand > >> >> >> > its use in the future. > >> >> >> > >> >> >> So, you haven't observed any performance hurt yet. Right? > >> >> > > >> >> > Right. > >> >> > > >> >> >> If you > >> >> >> haven't, I suggest you to keep the patch in your downstream kern= el for a > >> >> >> while. In the future, if you find the performance of some workl= oads > >> >> >> hurts because of the new batch number, you can repost the patch = with the > >> >> >> supporting data. If in the end, the performance of more and mor= e > >> >> >> workloads is good with the new batch number. You may consider t= o make 0 > >> >> >> the default value :-) > >> >> > > >> >> > That is not how the real world works. > >> >> > > >> >> > In the real world: > >> >> > > >> >> > - No one knows what may happen in the future. > >> >> > Therefore, if possible, we should make systems flexible, unless > >> >> > there is a strong justification for using a hard-coded value. > >> >> > > >> >> > - Minimize changes whenever possible. > >> >> > These systems have been working fine in the past, even if with = lower > >> >> > performance. Why make changes just for the sake of improving > >> >> > performance? Does the key metric of your performance data truly m= atter > >> >> > for their workload? > >> >> > >> >> These are good policy in your organization and business. But, it's= not > >> >> necessary the policy that Linux kernel upstream should take. > >> > > >> > You mean the Upstream Linux kernel only designed for the lab ? > >> > > >> >> > >> >> Community needs to consider long-term maintenance overhead, so it a= dds > >> >> new ABI (such as sysfs knob) to kernel with the necessary justifica= tion. > >> >> In general, it prefer to use a good default value or an automatic > >> >> algorithm that works for everyone. Community tries avoiding (or fi= xing) > >> >> regressions as much as possible, but this will not stop kernel from > >> >> changing, even if it's big. > >> > > >> > Please explain to me why the kernel config is not ABI, but the sysct= l is ABI. > >> > >> Linux kernel will not break ABI until the last users stop using it. > > > > However, you haven't given a clear reference why the systl is an ABI. > > TBH, I don't find a formal document said it explicitly after some > searching. > > Hi, Andrew, Matthew, > > Can you help me on this? Whether sysctl is considered Linux kernel ABI? > Or something similar? In my experience, we consistently utilize an if-statement to configure sysctl settings in our production environments. if [ -f ${sysctl_file} ]; then echo ${new_value} > ${sysctl_file} fi Additionally, you can incorporate this into rc.local to ensure the configuration is applied upon system reboot. Even if you add it to the sysctl.conf without the if-statement, it won't break anything. The pcp-related sysctl parameter, vm.percpu_pagelist_high_fraction, underwent a naming change along with a functional update from its predecessor, vm.percpu_pagelist_fraction, in commit 74f44822097c ("mm/page_alloc: introduce vm.percpu_pagelist_high_fraction"). Despite this significant change, there have been no reported issues or complaints, suggesting that the renaming and functional update have not negatively impacted the system's functionality. --=20 Regards Yafang