From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3E61FC2BD09 for ; Fri, 12 Jul 2024 05:42:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C19356B0089; Fri, 12 Jul 2024 01:42:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BC9386B0092; Fri, 12 Jul 2024 01:42:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A90E96B0093; Fri, 12 Jul 2024 01:42:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 877D86B0089 for ; Fri, 12 Jul 2024 01:42:06 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 0DC941A094D for ; Fri, 12 Jul 2024 05:42:06 +0000 (UTC) X-FDA: 82330004652.21.4EFDE55 Received: from mail-qv1-f53.google.com (mail-qv1-f53.google.com [209.85.219.53]) by imf24.hostedemail.com (Postfix) with ESMTP id 3A88118000F for ; Fri, 12 Jul 2024 05:42:04 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=i1K6sxYl; spf=pass (imf24.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.53 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720762907; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1pwqFUYeUe8nvGkdYZjVIm4dUI9WgMRuKn+1fpx1Juk=; b=4dbXI7yJ5X6Km2DHCLgttNYcxVwOv6astzCHgOfTBi3dxoh+++cxGVOQ2v18brv6x4+5Oy XKQ9Qz2egTz7Bz0sZp1n9/h+HFwCFnJo4ONCFtZ/wR4IRAT42HhfKIj3Qbv3vPgUnFZ2mn vTbm8g08NSTlhxm5zYnHld0qOXUFJwg= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=i1K6sxYl; spf=pass (imf24.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.53 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720762907; a=rsa-sha256; cv=none; b=bAuyGpA9y6Np4oGQJ7smbvF+qHAtkKL+g0aw3orkw07wBwBVZRgd4e9zDeVrRBq9Sk1XB6 UXpk4ro8z4Z+NDFs9JtMvWTBMHk3WvofV4NyS6I3n5O6vGnYhtw9miH6Xi1sGCcDBgP1H8 VneUatNVZoUOVNNr2mHvWljGAfVhOYw= Received: by mail-qv1-f53.google.com with SMTP id 6a1803df08f44-6b06e63d288so9710696d6.0 for ; Thu, 11 Jul 2024 22:42:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1720762923; x=1721367723; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=1pwqFUYeUe8nvGkdYZjVIm4dUI9WgMRuKn+1fpx1Juk=; b=i1K6sxYlrJrUu+CkvglDALD0DIw72bmXgqCV/OAIitI+c3EFXrCF+1USNPgeFmwTLU SOb8XpLr+Jcj/nHUeKAi+bwH+5mkqYJGk0cERJaZTJnf35vhDrhODjNyjAXLnm0sbTT6 RkXtjoxHZbTvS8eMlGAPLUnzVsAKnGfL5eESrH+vCT9431Zc+oaocfoA77TfyhzRNHhY FB6LO6FnP2eFFwmF9JD+zyxDrkyOPa9cmbLgT0yeN0FWTNemm4/PHf6iaPpGSIWZX2tm lPyCTXSZESQnEKabtzxypZu6HWIDTC+suXxtgZsP4h3m1JhJp/MrH8py8fY1VXMaGDKs xEFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720762923; x=1721367723; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1pwqFUYeUe8nvGkdYZjVIm4dUI9WgMRuKn+1fpx1Juk=; b=fDsCrhBI+3APGd11qkKZUmQ+6nW9cmr30301x02v4HkPMG+WDqAb9zbJAIKhpjAacW 7LNvR+W1nsVyNZCklubsgFWfCZexYSI6v9JsuempoJ1fiX8ahsBx8uLUvWyihcS1PaL1 4bFeBg40kGPC7rMnOSDzmlzVdLw2t8OqAVMgXst+AOPntjAgBc2VeKpJ1jAmmzZnTVai Svs51/BfgndW5hzKnrXe5V7kAil0rIy3kGI9AzJxnBHLX3Zq5V1taVGDfoP6lmKrqzmd nffBEghF+2+J0dcjdv7zdqjO92D/wFxTfX2IOOCq67h8XEFJK7FELlev7208FNATQe6V TKhg== X-Forwarded-Encrypted: i=1; AJvYcCU2SxgBaQVApDm9GpSkuqo2ifFbIDh9fDaRfkNrG6J08PBrzqboGuSmcAXllzohMC+JLhwAp7kGa71gwstiVVx7GOU= X-Gm-Message-State: AOJu0YwxMnwjehJp2dD17RPXdx0riJbnBfsBT5L+UITdse553RtYReAy NJk7ByEuEZUpcoPDZqYYwYyaXn9ffvyvMIbUmjQ0gipHcY+Pcn6s2gpcagIYeuqM3eLEvikLAoT gFVMYhaPSfCelSscbOP2NNrnj85w= X-Google-Smtp-Source: AGHT+IG62XWTF0kWptCFqyhT4w6PeWXSAes3uT2WmKAi/o1U3NCQO4FVVhAOF5Z86/FVt+QrjlLjcDdmhMhTgajZmW4= X-Received: by 2002:a05:6214:3002:b0:6b5:42b7:122 with SMTP id 6a1803df08f44-6b61c1f2519mr127177396d6.60.1720762923269; Thu, 11 Jul 2024 22:42:03 -0700 (PDT) MIME-Version: 1.0 References: <20240707094956.94654-1-laoar.shao@gmail.com> <20240707094956.94654-4-laoar.shao@gmail.com> <878qyaarm6.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o774a0pv.fsf@yhuang6-desk2.ccr.corp.intel.com> <87frsg9waa.fsf@yhuang6-desk2.ccr.corp.intel.com> <877cds9pa2.fsf@yhuang6-desk2.ccr.corp.intel.com> <87y1678l0f.fsf@yhuang6-desk2.ccr.corp.intel.com> <87plrj8g42.fsf@yhuang6-desk2.ccr.corp.intel.com> <87h6cv89n4.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87h6cv89n4.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Yafang Shao Date: Fri, 12 Jul 2024 13:41:26 +0800 Message-ID: Subject: Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max To: "Huang, Ying" Cc: akpm@linux-foundation.org, mgorman@techsingularity.net, linux-mm@kvack.org, Matthew Wilcox , David Rientjes Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 3A88118000F X-Stat-Signature: kpi8bexnpaj7a9td6657x3pwrgg7z61p X-Rspam-User: X-HE-Tag: 1720762924-509666 X-HE-Meta: U2FsdGVkX1/Lrqs8eh4o/gqPsLPvdjjseeQx9Cy9P1ctcaLiJknJLJs3XfES6U9evXSk+cXCkePsQqJBaGldPRvXRnKcePldlpAou3l4TKl2aOjz7J6NiXqTU4lAgvRA3OBprgBKsJ3Lh+QtpHtFs9/uNWiKs+BZWEUTycN++nIN8z4aPQyKwcAl0Sz62Fzc6Ub8QSKVtYmG7JzYHODNPPSXrTV0i1W7VOCKRNogukm4ueEUioowr5wISRYkJpT6xfXpwhq0PEnY44dFEjvc+EGw8eX5FPCx1qOXczNlVB0R/NsWcf2ahCoKB2WWpml0V1/Egw9FVRbxm8BtHw9ni00DlhZqurl438PPaRp09QmUISg2zbMs+fynF78PkPBIht4mIlhgzRV2W4mGsjyX6wXwJLqbj42yiO5AKTVcglJ7lb7SG84mMaojndWnzHh3Qmaq9FHBAm5fKOg7G//NQ+x/H4a7ZV0EaWdFFZXz0fy41Otp2mh9CV4cFjcbXIvzaBA8xtL4vM4htLqnJhw7yzOdA5p6Y7jE/wuwVzh45Vlo2E96j/RmKMYWQux6C33X8KhwuDkGbw6WoSp6u8hcFu1ySTxE60vbm7MBKsmQ3Q0oF3HalicN7NoY9/VUXPvimVpBjvFdaW4UQdCknQAyDQiVZ5yaf4RRjr1E+Tpprib/daFD3ypTI5PkA+B09vtokv38Qdb3U59YTGiqILNzTzOf1VHK0fK1RmqpnSskl0XN9tfvKPjV119SJwbxBZzTXbYsAf6v0NOrz7KzPxLQ4pBBjyfn5Ic94iM1THrB045hd1Ca/AY32GP0yXHKXfZ6RRZPRRydS4Szgq8q2T7NFWuuDG8gFQ/iExnMtoAu8q3mOXNpgjfr2xXmPxsadfi9nQm3pon5JeGdULam1JRtsP1dzlerI2XTDNvUhpv8uZuPvOVR6628D14ejkWqupoSuGkrQErOvwGK0RJZsLB Hxa6M93W P9SuWWSkL6x5gR6lh1JKw8scGWKgFUxtQ/eLU9P+4U8l0QFatAg/OmDkstH27TNfZB7WltNxDYx7s4z0jlzkT13r8oJwbIUogkYGN019+20UKfyeMDy6eFznG7o5Efuvdhu2ye2viwD0CDeT7THBbGKqqEMFtZE/OXe2yyoFk2WCUQITUD1N1tU4uSw0cD0k0/hMsRD2OUfwNOdrOY7xQdo3/ufl2S5iOFJ4VraBJ8ctouz7n+U0r0xnFvNodY8WP51o5GdOsGK1EtlLXCS1/OHjjORwOGkgGRaUctiCb3e3DMdA5oAa/xRgQf0dv9XsYclrsUPbXL4Yngj7L1vk1Z5caPUIynCvnsrpFB8UrLbWrglkLX8U95xTuYtEoDu+wMmmsE8DvPehyDjZVj7GZBC+ZHdsawJmJJOvb X-Bogosity: Ham, tests=bogofilter, spamicity=0.000007, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jul 12, 2024 at 1:26=E2=80=AFPM Huang, Ying = wrote: > > Yafang Shao writes: > > > On Fri, Jul 12, 2024 at 11:07=E2=80=AFAM Huang, Ying wrote: > >> > >> Yafang Shao writes: > >> > >> > On Fri, Jul 12, 2024 at 9:21=E2=80=AFAM Huang, Ying wrote: > >> >> > >> >> Yafang Shao writes: > >> >> > >> >> > On Thu, Jul 11, 2024 at 6:51=E2=80=AFPM Huang, Ying wrote: > >> >> >> > >> >> >> Yafang Shao writes: > >> >> >> > >> >> >> > On Thu, Jul 11, 2024 at 4:20=E2=80=AFPM Huang, Ying wrote: > >> >> >> >> > >> >> >> >> Yafang Shao writes: > >> >> >> >> > >> >> >> >> > On Thu, Jul 11, 2024 at 2:44=E2=80=AFPM Huang, Ying wrote: > >> >> >> >> >> > >> >> >> >> >> Yafang Shao writes: > >> >> >> >> >> > >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51=E2=80=AFAM Huang, Ying wrote: > >> >> >> >> >> >> > >> >> >> >> >> >> Yafang Shao writes: > >> >> >> >> >> >> > >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses= challenges for > >> >> >> >> >> >> > quickly experimenting with specific workloads in a pr= oduction environment, > >> >> >> >> >> >> > particularly when monitoring latency spikes caused by= contention on the > >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter v= m.pcp_batch_scale_max > >> >> >> >> >> >> > is introduced as a more practical alternative. > >> >> >> >> >> >> > >> >> >> >> >> >> In general, I'm neutral to the change. I can understan= d that kernel > >> >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, s= ysctl knob is ABI > >> >> >> >> >> >> too. > >> >> >> >> >> >> > >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issu= e, several suggestions > >> >> >> >> >> >> > have been proposed. One approach involves dividing la= rge zones into multi > >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while anot= her entails splitting > >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory ar= enas and shifting away > >> >> >> >> >> >> > from relying solely on zone_id to identify the range = of free lists a > >> >> >> >> >> >> > particular page belongs to[1]. However, implementing = these solutions is > >> >> >> >> >> >> > likely to necessitate a more extended development eff= ort. > >> >> >> >> >> >> > >> >> >> >> >> >> Per my understanding, the change will hurt instead of i= mprove zone->lock > >> >> >> >> >> >> contention. Instead, it will reduce page allocation/fr= eeing latency. > >> >> >> >> >> > > >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduc= ed a > >> >> >> >> >> > configuration that has proven to be difficult to use, an= d you have > >> >> >> >> >> > been resistant to suggestions for modifying it to a more= user-friendly > >> >> >> >> >> > and practical tuning approach. May I inquire about the r= ationale > >> >> >> >> >> > behind introducing this configuration in the beginning? > >> >> >> >> >> > >> >> >> >> >> Sorry, I don't understand your words. Do you need me to e= xplain what is > >> >> >> >> >> "neutral"? > >> >> >> >> > > >> >> >> >> > No, thanks. > >> >> >> >> > After consulting with ChatGPT, I received a clear and compr= ehensive > >> >> >> >> > explanation of what "neutral" means, providing me with a be= tter > >> >> >> >> > understanding of the concept. > >> >> >> >> > > >> >> >> >> > So, can you explain why you introduced it as a config in th= e beginning ? > >> >> >> >> > >> >> >> >> I think that I have explained it in the commit log of commit > >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avo= id too long > >> >> >> >> latency"). Which introduces the config. > >> >> >> > > >> >> >> > What specifically are your expectations for how users should u= tilize > >> >> >> > this config in real production workload? > >> >> >> > > >> >> >> >> > >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Ca= n you > >> >> >> >> explain why you need it? Why cannot you use a fixed value af= ter initial > >> >> >> >> experiments. > >> >> >> > > >> >> >> > Given the extensive scale of our production environment, with = hundreds > >> >> >> > of thousands of servers, it begs the question: how do you prop= ose we > >> >> >> > efficiently manage the various workloads that remain unaffecte= d by the > >> >> >> > sysctl change implemented on just a few thousand servers? Is i= t > >> >> >> > feasible to expect us to recompile and release a new kernel fo= r every > >> >> >> > instance where the default value falls short? Surely, there mu= st be > >> >> >> > more practical and efficient approaches we can explore togethe= r to > >> >> >> > ensure optimal performance across all workloads. > >> >> >> > > >> >> >> > When making improvements or modifications, kindly ensure that = they are > >> >> >> > not solely confined to a test or lab environment. It's vital t= o also > >> >> >> > consider the needs and requirements of our actual users, along= with > >> >> >> > the diverse workloads they encounter in their daily operations= . > >> >> >> > >> >> >> Have you found that your different systems requires different > >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? > >> >> > > >> >> > For specific workloads that introduce latency, we set the value t= o 0. > >> >> > For other workloads, we keep it unchanged until we determine that= the > >> >> > default value is also suboptimal. What is the issue with this > >> >> > approach? > >> >> > >> >> Firstly, this is a system wide configuration, not workload specific= . > >> >> So, other workloads run on the same system will be impacted too. W= ill > >> >> you run one workload only on one system? > >> > > >> > It seems we're living on different planets. You're happily working i= n > >> > your lab environment, while I'm struggling with real-world productio= n > >> > issues. > >> > > >> > For servers: > >> > > >> > Server 1 to 10,000: vm.pcp_batch_scale_max =3D 0 > >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max =3D 5 > >> > Server 1,000,001 and beyond: Happy with all values > >> > > >> > Is this hard to understand? > >> > > >> > In other words: > >> > > >> > For applications: > >> > > >> > Application 1 to 10,000: vm.pcp_batch_scale_max =3D 0 > >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max =3D 5 > >> > Application 1,000,001 and beyond: Happy with all values > >> > >> Good to know this. Thanks! > >> > >> >> > >> >> Secondly, we need some evidences to introduce a new system ABI. Fo= r > >> >> example, we need to use different configuration on different system= s > >> >> otherwise some workloads will be hurt. Can you provide some eviden= ces > >> >> to support your change? IMHO, it's not good enough to say I don't = know > >> >> why I just don't want to change existing systems. If so, it may be > >> >> better to wait until you have more evidences. > >> > > >> > It seems the community encourages developers to experiment with thei= r > >> > improvements in lab environments using meticulously designed test > >> > cases A, B, C, and as many others as they can imagine, ultimately > >> > obtaining perfect data. However, it discourages developers from > >> > directly addressing real-world workloads. Sigh. > >> > >> You cannot know whether your workloads benefit or hurt for the differe= nt > >> batch number and how in your production environment? If you cannot, h= ow > >> do you decide which workload deploys on which system (with different > >> batch number configuration). If you can, can you provide such > >> information to support your patch? > > > > We leverage a meticulous selection of network metrics, particularly > > focusing on TcpExt indicators, to keep a close eye on application > > latency. This includes metrics such as TcpExt.TCPTimeouts, > > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans, > > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. > > > > In instances where a problematic container terminates, we've noticed a > > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per > > second, which serves as a clear indication that other applications are > > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max > > parameter to 0, we've been able to drastically reduce the maximum > > frequency of these timeouts to less than one per second. > > Thanks a lot for sharing this. I learned much from it! > > > At present, we're selectively applying this adjustment to clusters > > that exclusively host the identified problematic applications, and > > we're closely monitoring their performance to ensure stability. To > > date, we've observed no network latency issues as a result of this > > change. However, we remain cautious about extending this optimization > > to other clusters, as the decision ultimately depends on a variety of > > factors. > > > > It's important to note that we're not eager to implement this change > > across our entire fleet, as we recognize the potential for unforeseen > > consequences. Instead, we're taking a cautious approach by initially > > applying it to a limited number of servers. This allows us to assess > > its impact and make informed decisions about whether or not to expand > > its use in the future. > > So, you haven't observed any performance hurt yet. Right? Right. > If you > haven't, I suggest you to keep the patch in your downstream kernel for a > while. In the future, if you find the performance of some workloads > hurts because of the new batch number, you can repost the patch with the > supporting data. If in the end, the performance of more and more > workloads is good with the new batch number. You may consider to make 0 > the default value :-) That is not how the real world works. In the real world: - No one knows what may happen in the future. Therefore, if possible, we should make systems flexible, unless there is a strong justification for using a hard-coded value. - Minimize changes whenever possible. These systems have been working fine in the past, even if with lower performance. Why make changes just for the sake of improving performance? Does the key metric of your performance data truly matter for their workload? --=20 Regards Yafang