From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6B455C3DA4A for ; Fri, 12 Jul 2024 09:24:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EDE5F6B0093; Fri, 12 Jul 2024 05:24:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E8DF56B0095; Fri, 12 Jul 2024 05:24:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D56C36B0096; Fri, 12 Jul 2024 05:24:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id AF5596B0093 for ; Fri, 12 Jul 2024 05:24:45 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 5C5B31A0A1A for ; Fri, 12 Jul 2024 09:24:45 +0000 (UTC) X-FDA: 82330565730.30.6CC0E6F Received: from mail-qk1-f176.google.com (mail-qk1-f176.google.com [209.85.222.176]) by imf24.hostedemail.com (Postfix) with ESMTP id 91BB3180032 for ; Fri, 12 Jul 2024 09:24:43 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QD0Pd2V6; spf=pass (imf24.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.176 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720776249; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3YNlhl/fftSmNWFmFjY4shJTQFC0EG0zRyqASpHAvDs=; b=J+/oE0rDA9QfzWNG02TcHHvgEc6te4d4qkI545HFNTlzHxkF25CiHcJgmNN1LgLZGFd10i gxbhPVnZxHPFc6WJ8jxmioVJCNqgB8jkqtQh10Vtrc8oMpyZ+lSLhuAe0KgkD4QFCAK7Md 1NfbvukXR0fgG6C23BslthAPrHac2kY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720776249; a=rsa-sha256; cv=none; b=3+D8lF1ElPbLqWFi3kbgooa9IO0s/lwHTED0M1njLn3BEFvA/8Fnw/qstmnDc17N84m6Nv vw6VOBEbZ9OznUhnjSEAiJcGu+bycE3KExYZfoTI77Pi7AF870ASu59INWNjcYHtKgbnAj g8lJJ6lvZTBM+E7XM7eaPhCk8jjFfj0= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QD0Pd2V6; spf=pass (imf24.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.176 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-qk1-f176.google.com with SMTP id af79cd13be357-79f02fe11aeso137052885a.2 for ; Fri, 12 Jul 2024 02:24:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1720776282; x=1721381082; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=3YNlhl/fftSmNWFmFjY4shJTQFC0EG0zRyqASpHAvDs=; b=QD0Pd2V6Xu4J3sulB3CfcU9zPwkpT3slqLPekWeUJjXMmipUWe3l58RO/drgBYDwL4 JGuLUjQThjvDRb/bbu7T6LIQsfExYY3Nvi8GV8b/Wqk9KKWa2oGn8wnumlK9muvgN8cp Q3EvAh3++M+XerWJUooUt3fHVFqT3IuLWzPaB0fT9jPkLuFmwYQ/wkxH9ujXB98JlgmK 4j/Ps58cn3PE3+zG/qPIVoauqQJuop3ovr9lk+PJ/J40OCYRMXQzKnDscjgEDNaQ4Ypc YpFyxXVv+N7fVhwoYMByad6bBudZBkcnqkss3co7WN//oqvbXz73H1IC+bWWpJQ3ANlM bctg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720776282; x=1721381082; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3YNlhl/fftSmNWFmFjY4shJTQFC0EG0zRyqASpHAvDs=; b=DVpPZIXLWV6PZGM0bKMGsOgMo591RnBG6cwEjG63k1n9XLoWJvOJZikl4m53Y7Df7w moCXiGN7teGnyKJYSVaCyadoMfQSH8Y39mabFMFkMSenWb7GWJVDZamACPgNjoZwOKXz cPSNLOPv0oCqRECp4aUcBpn9hGTgQ9y744pJ1GagMmUzqKja74FjDasT+JjEsInaTU4k 1Tco31DKsKf8o5tronovR75Vd5DStLG3ZXH3tjM2lJb+P24fucNqdDP+BXMIHSSS6vSb c4b5+A0Nvwt1DV3RVivJDcCWgKXV3BBfLTx4LCi9ilDJSK1329CgPf3m3U5ZJivD/ENK aLbw== X-Forwarded-Encrypted: i=1; AJvYcCVbsWRBYLtjlkNJlPe46KPmX50omSiDVhXf3mqVyoQp9pNJl5NiADKXJdiTyWLglnpWrFKKEys8CAZb/yy9NiPkuO4= X-Gm-Message-State: AOJu0YzFt1ggiH/4gGJCqSIobCxeTRLPo22l9p4bu6NVrBrp93ju3XJu 63VUPL19T5NzSHt3ZnM9Vr5oOgee9jyFsXrwhKT/WGHFhTRzlcP2PwVGlZiIxEXl6A6Ai1uQxZH sQ06+cQXgs/U83uPnDQUkEwfRils= X-Google-Smtp-Source: AGHT+IG/8FfxVwTr6Uj3DlopquMo4+ViMl0gNRrvoG5ywphamReJVEqcX7DCp/lO59t0rlWpiZXiFSSu2yj+9/q5vkc= X-Received: by 2002:a05:6214:27c2:b0:6b7:4298:2c75 with SMTP id 6a1803df08f44-6b742983396mr81891736d6.55.1720776282523; Fri, 12 Jul 2024 02:24:42 -0700 (PDT) MIME-Version: 1.0 References: <20240707094956.94654-1-laoar.shao@gmail.com> <878qyaarm6.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o774a0pv.fsf@yhuang6-desk2.ccr.corp.intel.com> <87frsg9waa.fsf@yhuang6-desk2.ccr.corp.intel.com> <877cds9pa2.fsf@yhuang6-desk2.ccr.corp.intel.com> <87y1678l0f.fsf@yhuang6-desk2.ccr.corp.intel.com> <87plrj8g42.fsf@yhuang6-desk2.ccr.corp.intel.com> <87h6cv89n4.fsf@yhuang6-desk2.ccr.corp.intel.com> <87cynj878z.fsf@yhuang6-desk2.ccr.corp.intel.com> <874j8v851a.fsf@yhuang6-desk2.ccr.corp.intel.com> <87zfqn6mr5.fsf@yhuang6-desk2.ccr.corp.intel.com> <87v81b6kmd.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87v81b6kmd.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Yafang Shao Date: Fri, 12 Jul 2024 17:24:03 +0800 Message-ID: Subject: Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max To: "Huang, Ying" Cc: akpm@linux-foundation.org, Matthew Wilcox , mgorman@techsingularity.net, linux-mm@kvack.org, David Rientjes Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 91BB3180032 X-Stat-Signature: k7ccbiz3qoz1n8g7tuj6qjsoibibxqc5 X-HE-Tag: 1720776283-934028 X-HE-Meta: U2FsdGVkX1/JBBk0FKPKxCBXnto7bKsRN/W6ghKXENED2bd+IQB4mPaLxwt+2HCYclUDJi+yI8ygL5rHC/OcHdgbdRPUGxN1z1Uz4zxOl0P+V2ojib9FT6FKIcNdLYBlzNlIIjt8n8D1HJDh10ukPwr1U151w3ZczAF2eKBy16ybXngCt/m4zUgsw8uXQoGj4XdtqsfC09MtHog+UdHiDzrIqJivbjHWyrlJT+9LBB8vxdcn9nXruDsXkY3oOtRVVB2aB8Bz5pIIKTMz53337TVVHEabZaz3GSScuoVHXNTyjAoNCBHMwwLUdYJRUcG00Z7qD5hcNocDcWMKc1dm4bhV76F7tLpDhLTyg0ZTjHhuhYFm8hS+EhTdfIKOMYjNq+3q1ZGei8UqMLMLp0rDdqvM4CKl4QZZhMXw6QgvddLhiPmFBjZrM9zrTCgawpGMKOAWVXzFmlOiYHpM2R/GJOIghVT8IOb0hoMoJZ6/sTHFF/mZGnj/Yca2MLGvE7qHWZ/d73Tdfb9JPUZGjja8wmNvukbQl3zmVR+GwVaVgwSU7ywn2YySzpuXFU/nmd+tG+DzlmWSMrBozrNhuBhPUSLQo6d6G0z+s0KUDrLzQzGloWA/dTAOBLjRgyBjwXxX+RdtsK+8f9pbhuprxZk4AkTclEw8V2kE/9LwqEg212AI+lEzywsouTe6ahkXRMlW/C+AndmZGIfF3WubUjCJ9uplWDplzkunWqeHFVhUxpMAit34nFj+EVFxP1lTQNKvX+KZNxeWMXo2ksL25fG7+xpy/n+XvuGYbvgRszhVqtPAFza2ccTjuD3gMrujIEyxWc4GJPsLqr/iAzYL3l+8yRyU+LshO66ZzbSSR9RgllcqzGYHXOP0FJfcAyghD7avF4u5Fs1uGY6RXUxNo44LlsMd7UuodRB4qnVfb991aFhnq/1YDp+i7kZpDhdIxowz0BGOiWoddyFNVWb+j5c lqJkX9+L dmOEHvxclCG/E/y57zxDXFimzLpYLKQQ4eDlnO44LGuV2VfOT6aAXps0sTV/n84Jw+aVwtRB/KXBP/+0BxrpvYqw3FoUSj84rhViYz8B8atACZPYISXpcmKNnQ6NCBNXlXhoR0b6WKnvegXcEGQWKThx9hYVnVia38ZAoY1gcfTSXf9FjTuKCeVFDS/RNBIik0Q3iyaw1tfsSBg/Vlj6XVRHpLcReEbDkRyiprWTrWvkzAN1AganzvrrtUvwy+EPngg3U4FrHfWa9gyT172Y7LZNBMRwTLkNzJzD8BOhXj2I8DQ10MY9zT8tg6jCTJaMcf54DljS7NycVnWMXUKFYq7D3Z7ZXI2yZG9G/Xd1FmeiulxOpA7V0eu8dASHEEFCbYbFl X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jul 12, 2024 at 5:13=E2=80=AFPM Huang, Ying = wrote: > > Yafang Shao writes: > > > On Fri, Jul 12, 2024 at 4:26=E2=80=AFPM Huang, Ying wrote: > >> > >> Yafang Shao writes: > >> > >> > On Fri, Jul 12, 2024 at 3:06=E2=80=AFPM Huang, Ying wrote: > >> >> > >> >> Yafang Shao writes: > >> >> > >> >> > On Fri, Jul 12, 2024 at 2:18=E2=80=AFPM Huang, Ying wrote: > >> >> >> > >> >> >> Yafang Shao writes: > >> >> >> > >> >> >> > On Fri, Jul 12, 2024 at 1:26=E2=80=AFPM Huang, Ying wrote: > >> >> >> >> > >> >> >> >> Yafang Shao writes: > >> >> >> >> > >> >> >> >> > On Fri, Jul 12, 2024 at 11:07=E2=80=AFAM Huang, Ying wrote: > >> >> >> >> >> > >> >> >> >> >> Yafang Shao writes: > >> >> >> >> >> > >> >> >> >> >> > On Fri, Jul 12, 2024 at 9:21=E2=80=AFAM Huang, Ying wrote: > >> >> >> >> >> >> > >> >> >> >> >> >> Yafang Shao writes: > >> >> >> >> >> >> > >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51=E2=80=AFPM Huang, Ying <= ying.huang@intel.com> wrote: > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Yafang Shao writes: > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20=E2=80=AFPM Huang, Yin= g wrote: > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> Yafang Shao writes: > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44=E2=80=AFPM Huang, = Ying wrote: > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> Yafang Shao writes: > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51=E2=80=AFAM Hua= ng, Ying wrote: > >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> Yafang Shao writes: > >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCA= LE_MAX poses challenges for > >> >> >> >> >> >> >> >> >> >> > quickly experimenting with specific workl= oads in a production environment, > >> >> >> >> >> >> >> >> >> >> > particularly when monitoring latency spik= es caused by contention on the > >> >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl= parameter vm.pcp_batch_scale_max > >> >> >> >> >> >> >> >> >> >> > is introduced as a more practical alterna= tive. > >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I c= an understand that kernel > >> >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl k= nob. But, sysctl knob is ABI > >> >> >> >> >> >> >> >> >> >> too. > >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock con= tention issue, several suggestions > >> >> >> >> >> >> >> >> >> >> > have been proposed. One approach involves= dividing large zones into multi > >> >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0]= , while another entails splitting > >> >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar = to memory arenas and shifting away > >> >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identif= y the range of free lists a > >> >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, i= mplementing these solutions is > >> >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended dev= elopment effort. > >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt = instead of improve zone->lock > >> >> >> >> >> >> >> >> >> >> contention. Instead, it will reduce page a= llocation/freeing latency. > >> >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. = You introduced a > >> >> >> >> >> >> >> >> >> > configuration that has proven to be difficul= t to use, and you have > >> >> >> >> >> >> >> >> >> > been resistant to suggestions for modifying = it to a more user-friendly > >> >> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire= about the rationale > >> >> >> >> >> >> >> >> >> > behind introducing this configuration in the= beginning? > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> Sorry, I don't understand your words. Do you = need me to explain what is > >> >> >> >> >> >> >> >> >> "neutral"? > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> > No, thanks. > >> >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a cle= ar and comprehensive > >> >> >> >> >> >> >> >> > explanation of what "neutral" means, providing = me with a better > >> >> >> >> >> >> >> >> > understanding of the concept. > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> > So, can you explain why you introduced it as a = config in the beginning ? > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> I think that I have explained it in the commit lo= g of commit > >> >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale f= actor to avoid too long > >> >> >> >> >> >> >> >> latency"). Which introduces the config. > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > What specifically are your expectations for how us= ers should utilize > >> >> >> >> >> >> >> > this config in real production workload? > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained = forever. Can you > >> >> >> >> >> >> >> >> explain why you need it? Why cannot you use a fi= xed value after initial > >> >> >> >> >> >> >> >> experiments. > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > Given the extensive scale of our production enviro= nment, with hundreds > >> >> >> >> >> >> >> > of thousands of servers, it begs the question: how= do you propose we > >> >> >> >> >> >> >> > efficiently manage the various workloads that rema= in unaffected by the > >> >> >> >> >> >> >> > sysctl change implemented on just a few thousand s= ervers? Is it > >> >> >> >> >> >> >> > feasible to expect us to recompile and release a n= ew kernel for every > >> >> >> >> >> >> >> > instance where the default value falls short? Sure= ly, there must be > >> >> >> >> >> >> >> > more practical and efficient approaches we can exp= lore together to > >> >> >> >> >> >> >> > ensure optimal performance across all workloads. > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > When making improvements or modifications, kindly = ensure that they are > >> >> >> >> >> >> >> > not solely confined to a test or lab environment. = It's vital to also > >> >> >> >> >> >> >> > consider the needs and requirements of our actual = users, along with > >> >> >> >> >> >> >> > the diverse workloads they encounter in their dail= y operations. > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Have you found that your different systems requires = different > >> >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? > >> >> >> >> >> >> > > >> >> >> >> >> >> > For specific workloads that introduce latency, we set= the value to 0. > >> >> >> >> >> >> > For other workloads, we keep it unchanged until we de= termine that the > >> >> >> >> >> >> > default value is also suboptimal. What is the issue w= ith this > >> >> >> >> >> >> > approach? > >> >> >> >> >> >> > >> >> >> >> >> >> Firstly, this is a system wide configuration, not workl= oad specific. > >> >> >> >> >> >> So, other workloads run on the same system will be impa= cted too. Will > >> >> >> >> >> >> you run one workload only on one system? > >> >> >> >> >> > > >> >> >> >> >> > It seems we're living on different planets. You're happi= ly working in > >> >> >> >> >> > your lab environment, while I'm struggling with real-wor= ld production > >> >> >> >> >> > issues. > >> >> >> >> >> > > >> >> >> >> >> > For servers: > >> >> >> >> >> > > >> >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max =3D 0 > >> >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max =3D 5 > >> >> >> >> >> > Server 1,000,001 and beyond: Happy with all values > >> >> >> >> >> > > >> >> >> >> >> > Is this hard to understand? > >> >> >> >> >> > > >> >> >> >> >> > In other words: > >> >> >> >> >> > > >> >> >> >> >> > For applications: > >> >> >> >> >> > > >> >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max =3D 0 > >> >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = =3D 5 > >> >> >> >> >> > Application 1,000,001 and beyond: Happy with all values > >> >> >> >> >> > >> >> >> >> >> Good to know this. Thanks! > >> >> >> >> >> > >> >> >> >> >> >> > >> >> >> >> >> >> Secondly, we need some evidences to introduce a new sys= tem ABI. For > >> >> >> >> >> >> example, we need to use different configuration on diff= erent systems > >> >> >> >> >> >> otherwise some workloads will be hurt. Can you provide= some evidences > >> >> >> >> >> >> to support your change? IMHO, it's not good enough to = say I don't know > >> >> >> >> >> >> why I just don't want to change existing systems. If s= o, it may be > >> >> >> >> >> >> better to wait until you have more evidences. > >> >> >> >> >> > > >> >> >> >> >> > It seems the community encourages developers to experime= nt with their > >> >> >> >> >> > improvements in lab environments using meticulously desi= gned test > >> >> >> >> >> > cases A, B, C, and as many others as they can imagine, u= ltimately > >> >> >> >> >> > obtaining perfect data. However, it discourages develope= rs from > >> >> >> >> >> > directly addressing real-world workloads. Sigh. > >> >> >> >> >> > >> >> >> >> >> You cannot know whether your workloads benefit or hurt for= the different > >> >> >> >> >> batch number and how in your production environment? If y= ou cannot, how > >> >> >> >> >> do you decide which workload deploys on which system (with= different > >> >> >> >> >> batch number configuration). If you can, can you provide = such > >> >> >> >> >> information to support your patch? > >> >> >> >> > > >> >> >> >> > We leverage a meticulous selection of network metrics, part= icularly > >> >> >> >> > focusing on TcpExt indicators, to keep a close eye on appli= cation > >> >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts, > >> >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowSt= artRetrans, > >> >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. > >> >> >> >> > > >> >> >> >> > In instances where a problematic container terminates, we'v= e noticed a > >> >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurre= nces per > >> >> >> >> > second, which serves as a clear indication that other appli= cations are > >> >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batc= h_scale_max > >> >> >> >> > parameter to 0, we've been able to drastically reduce the m= aximum > >> >> >> >> > frequency of these timeouts to less than one per second. > >> >> >> >> > >> >> >> >> Thanks a lot for sharing this. I learned much from it! > >> >> >> >> > >> >> >> >> > At present, we're selectively applying this adjustment to c= lusters > >> >> >> >> > that exclusively host the identified problematic applicatio= ns, and > >> >> >> >> > we're closely monitoring their performance to ensure stabil= ity. To > >> >> >> >> > date, we've observed no network latency issues as a result = of this > >> >> >> >> > change. However, we remain cautious about extending this op= timization > >> >> >> >> > to other clusters, as the decision ultimately depends on a = variety of > >> >> >> >> > factors. > >> >> >> >> > > >> >> >> >> > It's important to note that we're not eager to implement th= is change > >> >> >> >> > across our entire fleet, as we recognize the potential for = unforeseen > >> >> >> >> > consequences. Instead, we're taking a cautious approach by = initially > >> >> >> >> > applying it to a limited number of servers. This allows us = to assess > >> >> >> >> > its impact and make informed decisions about whether or not= to expand > >> >> >> >> > its use in the future. > >> >> >> >> > >> >> >> >> So, you haven't observed any performance hurt yet. Right? > >> >> >> > > >> >> >> > Right. > >> >> >> > > >> >> >> >> If you > >> >> >> >> haven't, I suggest you to keep the patch in your downstream k= ernel for a > >> >> >> >> while. In the future, if you find the performance of some wo= rkloads > >> >> >> >> hurts because of the new batch number, you can repost the pat= ch with the > >> >> >> >> supporting data. If in the end, the performance of more and = more > >> >> >> >> workloads is good with the new batch number. You may conside= r to make 0 > >> >> >> >> the default value :-) > >> >> >> > > >> >> >> > That is not how the real world works. > >> >> >> > > >> >> >> > In the real world: > >> >> >> > > >> >> >> > - No one knows what may happen in the future. > >> >> >> > Therefore, if possible, we should make systems flexible, unl= ess > >> >> >> > there is a strong justification for using a hard-coded value. > >> >> >> > > >> >> >> > - Minimize changes whenever possible. > >> >> >> > These systems have been working fine in the past, even if wi= th lower > >> >> >> > performance. Why make changes just for the sake of improving > >> >> >> > performance? Does the key metric of your performance data trul= y matter > >> >> >> > for their workload? > >> >> >> > >> >> >> These are good policy in your organization and business. But, i= t's not > >> >> >> necessary the policy that Linux kernel upstream should take. > >> >> > > >> >> > You mean the Upstream Linux kernel only designed for the lab ? > >> >> > > >> >> >> > >> >> >> Community needs to consider long-term maintenance overhead, so i= t adds > >> >> >> new ABI (such as sysfs knob) to kernel with the necessary justif= ication. > >> >> >> In general, it prefer to use a good default value or an automati= c > >> >> >> algorithm that works for everyone. Community tries avoiding (or= fixing) > >> >> >> regressions as much as possible, but this will not stop kernel f= rom > >> >> >> changing, even if it's big. > >> >> > > >> >> > Please explain to me why the kernel config is not ABI, but the sy= sctl is ABI. > >> >> > >> >> Linux kernel will not break ABI until the last users stop using it. > >> > > >> > However, you haven't given a clear reference why the systl is an ABI= . > >> > >> TBH, I don't find a formal document said it explicitly after some > >> searching. > >> > >> Hi, Andrew, Matthew, > >> > >> Can you help me on this? Whether sysctl is considered Linux kernel AB= I? > >> Or something similar? > > > > In my experience, we consistently utilize an if-statement to configure > > sysctl settings in our production environments. > > > > if [ -f ${sysctl_file} ]; then > > echo ${new_value} > ${sysctl_file} > > fi > > > > Additionally, you can incorporate this into rc.local to ensure the > > configuration is applied upon system reboot. > > > > Even if you add it to the sysctl.conf without the if-statement, it > > won't break anything. > > > > The pcp-related sysctl parameter, vm.percpu_pagelist_high_fraction, > > underwent a naming change along with a functional update from its > > predecessor, vm.percpu_pagelist_fraction, in commit 74f44822097c > > ("mm/page_alloc: introduce vm.percpu_pagelist_high_fraction"). Despite > > this significant change, there have been no reported issues or > > complaints, suggesting that the renaming and functional update have > > not negatively impacted the system's functionality. > > Thanks for your information. From the commit, sysctl isn't considered > as the kernel ABI. > > Even if so, IMHO, we shouldn't introduce a user tunable knob without a > real world requirements except more flexibility. Indeed, I do not reside in the physical realm but within a virtualized universe. (Of course, that is your perspective.) -- Regards Yafang