From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 01747C2BD09 for ; Fri, 12 Jul 2024 06:41:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 604776B0092; Fri, 12 Jul 2024 02:41:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5B4C66B0093; Fri, 12 Jul 2024 02:41:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4550E6B0095; Fri, 12 Jul 2024 02:41:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 241316B0092 for ; Fri, 12 Jul 2024 02:41:57 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id CAA5540967 for ; Fri, 12 Jul 2024 06:41:56 +0000 (UTC) X-FDA: 82330155432.28.7878620 Received: from mail-qv1-f44.google.com (mail-qv1-f44.google.com [209.85.219.44]) by imf14.hostedemail.com (Postfix) with ESMTP id 095DE100017 for ; Fri, 12 Jul 2024 06:41:54 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="eKCX/F0j"; spf=pass (imf14.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.44 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720766497; a=rsa-sha256; cv=none; b=TK5bH/qYOncNXY8utW5vaIzkt0BWFunRVKaziw3Wb/j0a6KgIuvQ6YUXfKkE/x7wbzeNx0 /JRGrkPZQE8khJO8chWSrGzWY1yMmjg/CO82uT/b+ZM9tuMGj+Fja9bSLZK6Zb/qGrK1zg U/TGrrDiksyqcXyDpEZUNR/xabV0s8k= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="eKCX/F0j"; spf=pass (imf14.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.44 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720766497; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=C8lJO6or2WqP5Im2ZWxZHOcAUtggJ866SEvBb1xDolM=; b=z+UP6t0QvrPJ5ekvbuAkfyRVIaH385Pcx+1MfoElMgDZ8WjeL5iBjR058t516ZgVJIJj/3 AWiIGqDDO6SLfCsHSRYH9wMVQsoDKo2z0OG/Zho1l0GFqPyhXWDEGq/ag38mznjPDUvaut NvkfNd7ZhE97XAn8v3xmd+xZMWluyfY= Received: by mail-qv1-f44.google.com with SMTP id 6a1803df08f44-6b5f128b199so10202846d6.0 for ; Thu, 11 Jul 2024 23:41:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1720766514; x=1721371314; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=C8lJO6or2WqP5Im2ZWxZHOcAUtggJ866SEvBb1xDolM=; b=eKCX/F0j9J9Uxoeponf03Ywlmue6JdCQvEGIdvXAvIv1ryrd/GehNQLAMIIpWMPXZL bPDl+xh083xmr6wMLQFd6xbOhQwtbJuXW8wpIouIJ6Ic5TwAvD5/Mszoas155rJi0dvV lKEHwrU0VKMakRqQXDg9wAKpy4cxoPciBC5G9H0MCbFItyZ2zKBfWko9+pHH1RfEgZVl W54aXOFv5kO5iBw+CAaMvvFLlTKfqPFQgs6/lDt7lzA5t5b1vSN0NwO3/7Q1YtAaP/9x pkto0WArftLTbvisvbw1oX+Ww/bKbGVRg/+6j937P38lI1m74KtMVKENioXVF3xiUzLS dh6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720766514; x=1721371314; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=C8lJO6or2WqP5Im2ZWxZHOcAUtggJ866SEvBb1xDolM=; b=wmQTUKsHTFXAkYGLER0uRsESx7myxZQJGrdXajUXDbXuahONHTpFgIO2gzC2tgAyis 6A6IYn/BjXPPMKji470XFN1SF00E6R/Px1O/3q3FddIARGvk5aqicjv+QW5MSaQ2YrvD uEMVBdvv7gT7jQ432/mDno41yUcN85IZMmMJPN6rNsZF0xCEQTeoN68FHodf5/xKTryj wa3UWy5lwgY5c+de7Lxq6l1kCNiTOQ7Wgr4wRiIeXY69uHkm/Z623mbopABvDEl+OFQK nzPy2usAvGwLehIsyey7vYjKKUVeoG1wcfFFjqDtkKkDa1MbRJ5QU1FL7dpP/EA3LKGz W06A== X-Forwarded-Encrypted: i=1; AJvYcCX1d2Jsfi5zuDmViCD4c/37EC6trfFra4WKKOa2XJ8eNFqc/BlzbpvzeKSyD+7y+2YSmQs1fwp4vjAVqVjZ3hZRCJ8= X-Gm-Message-State: AOJu0YxgYdARjK5IdvGDEUtxduVhROGwIRacbn1dCw3AAWA0nPIK5G7K bJO/xKHGbNgw6aExq9je1vOTgOW7Gy7viEVWxEG6qJEF//VVpZUPJeTJiVlGBEQJeQApKRtwyqJ 5XGj/Z06vTDQmNinmRNznX+YHZvo= X-Google-Smtp-Source: AGHT+IEvunCScv3HNCv1gANtFMdijyV7cQtKuV+al0tRfahYTkMwQQXFbZB6qBkFGkZhzAK2PRWnKPgoMrXBYFk8zgk= X-Received: by 2002:a05:6214:19ee:b0:6b7:4319:a6c7 with SMTP id 6a1803df08f44-6b74319a811mr68682836d6.61.1720766514112; Thu, 11 Jul 2024 23:41:54 -0700 (PDT) MIME-Version: 1.0 References: <20240707094956.94654-1-laoar.shao@gmail.com> <20240707094956.94654-4-laoar.shao@gmail.com> <878qyaarm6.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o774a0pv.fsf@yhuang6-desk2.ccr.corp.intel.com> <87frsg9waa.fsf@yhuang6-desk2.ccr.corp.intel.com> <877cds9pa2.fsf@yhuang6-desk2.ccr.corp.intel.com> <87y1678l0f.fsf@yhuang6-desk2.ccr.corp.intel.com> <87plrj8g42.fsf@yhuang6-desk2.ccr.corp.intel.com> <87h6cv89n4.fsf@yhuang6-desk2.ccr.corp.intel.com> <87cynj878z.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87cynj878z.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Yafang Shao Date: Fri, 12 Jul 2024 14:41:17 +0800 Message-ID: Subject: Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max To: "Huang, Ying" Cc: akpm@linux-foundation.org, mgorman@techsingularity.net, linux-mm@kvack.org, Matthew Wilcox , David Rientjes Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 3wngd41ox8xrw573op7d33bqzhns5j65 X-Rspamd-Queue-Id: 095DE100017 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1720766514-170451 X-HE-Meta: U2FsdGVkX18oOTKFaeID6eEDDNT8gPle//PtrSLRlrrFN8bZTyyySJADMyU9QawbyFoLos4cxqiWFv4g99Vf88KJY4nU6Q1I7QE77wqHJQLcS7SfYfY3iAxOpfKinXah6DkhZcRLihsT1pKNrt0Ydxm8P+bkjbXaTr+455vJNK9CIAtVGEEioCIvLR7hku9tMNIdIhEgUSu23i722JUIdQ8c38Zl5CaCVXqi0RCR34jJBgaIpo1bXwNlsVEhKgH7kQnB/QucSjhOD0BZEOw2AKqrpxtF2duNauGoWwu6jcdJj+Dtv6y2JmmJD76tFdKx2smJBCmZSbyzgVDvImtfh4gVfnYUxvJGTlAhnkdi91Cdu5p1XU/HcLIKZPScr+ixoh/QYlapSdMp4R2B9qb0hADNES8T3Vxu71hkaLZI7QSRpLJU1BojWoivsspl9FnDIHp4tLhyjwWw4XZtECknazfNRVSuQ2HAXRxGlRm+pTmu4/7Da5QzijxvcXKYdS6LseVZDOA+K5Md5nmqxBS66nuuLHWfcwCkZIvk/1+EVNmS/hP5GsOkyY79dX2/dZzfpcv3dO7HyTxPhYjH1rqY6uYtW2Lk1hdaFrIF0PEVl7rY3WnxJj5DJfgEPVCf4aAfD0PJW/zSyTn05OXy78Mov5r53Kc4Q3W3LOwGo2uu/FLuBwYr1dgElv5tQpGQul9xaMcEG4q76rzPy5AuEaBQBcxsLJ7goF7GJdxuYKnnSu+dWJX8Qwc5lfqhqcKc/E7FgnKhxEnFGy8do+46s1L/SMaiBOY5gavbByXkPy9JcVjQ2+J7LnQXJl3kTTRtWgnjpHVeNix7YfG5I+gLlrg1L3YoJ+cNexoak50bkbAzPQrZp182J/+JG21F+CSIJGp8XbryA07Rr+adTT2YOMS8gCrLlnQl6M/bujhuCc7yToJBZQsiC14WaZas3xQqoufdYp2A60dymqZwEcgMvRo gr6L3T95 ucYjDHke5NVPqUlNOW8nH3WwFq7WsjzVIx04iLJkSs+jPVBPsdjG956h2C/DQk11d6tR7y7xjeu6UCwylniJpUABsSZOvP0hG0fVBUSzDB4Jh3MCDTTBFGaV20Pl65AY1uBJkZu6uul8a2KI8bYG6pb8K80om/ZsVS4EW0E+XMPafBo9/rEFKdikVqZEnW7LfYD1r8qDuU3wQZFT0lvCEbbN5DRoPkxxZ+KB7ztw+3g+nBJnf5csu36sdAebiB7gokyQS0S3tFiwelWBU6ac3delYruHgMWPDiFqXRvXxwz2ih8ozwGrGf2Evfa4rp0caKe2cdg22TA+0IVamqubNhebHzKmz+U9usq8XSyCvlcrjovKhQ9nE9oxlGH3yCBHorZacfYeo1qjBdokIrz2c1AsKZNaDJubw6dzX X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jul 12, 2024 at 2:18=E2=80=AFPM Huang, Ying = wrote: > > Yafang Shao writes: > > > On Fri, Jul 12, 2024 at 1:26=E2=80=AFPM Huang, Ying wrote: > >> > >> Yafang Shao writes: > >> > >> > On Fri, Jul 12, 2024 at 11:07=E2=80=AFAM Huang, Ying wrote: > >> >> > >> >> Yafang Shao writes: > >> >> > >> >> > On Fri, Jul 12, 2024 at 9:21=E2=80=AFAM Huang, Ying wrote: > >> >> >> > >> >> >> Yafang Shao writes: > >> >> >> > >> >> >> > On Thu, Jul 11, 2024 at 6:51=E2=80=AFPM Huang, Ying wrote: > >> >> >> >> > >> >> >> >> Yafang Shao writes: > >> >> >> >> > >> >> >> >> > On Thu, Jul 11, 2024 at 4:20=E2=80=AFPM Huang, Ying wrote: > >> >> >> >> >> > >> >> >> >> >> Yafang Shao writes: > >> >> >> >> >> > >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44=E2=80=AFPM Huang, Ying wrote: > >> >> >> >> >> >> > >> >> >> >> >> >> Yafang Shao writes: > >> >> >> >> >> >> > >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51=E2=80=AFAM Huang, Ying = wrote: > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Yafang Shao writes: > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX po= ses challenges for > >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a= production environment, > >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused= by contention on the > >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl paramete= r vm.pcp_batch_scale_max > >> >> >> >> >> >> >> > is introduced as a more practical alternative. > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> In general, I'm neutral to the change. I can unders= tand that kernel > >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob. But= , sysctl knob is ABI > >> >> >> >> >> >> >> too. > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention i= ssue, several suggestions > >> >> >> >> >> >> >> > have been proposed. One approach involves dividing= large zones into multi > >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while a= nother entails splitting > >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory= arenas and shifting away > >> >> >> >> >> >> >> > from relying solely on zone_id to identify the ran= ge of free lists a > >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementi= ng these solutions is > >> >> >> >> >> >> >> > likely to necessitate a more extended development = effort. > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Per my understanding, the change will hurt instead o= f improve zone->lock > >> >> >> >> >> >> >> contention. Instead, it will reduce page allocation= /freeing latency. > >> >> >> >> >> >> > > >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You intro= duced a > >> >> >> >> >> >> > configuration that has proven to be difficult to use,= and you have > >> >> >> >> >> >> > been resistant to suggestions for modifying it to a m= ore user-friendly > >> >> >> >> >> >> > and practical tuning approach. May I inquire about th= e rationale > >> >> >> >> >> >> > behind introducing this configuration in the beginnin= g? > >> >> >> >> >> >> > >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me t= o explain what is > >> >> >> >> >> >> "neutral"? > >> >> >> >> >> > > >> >> >> >> >> > No, thanks. > >> >> >> >> >> > After consulting with ChatGPT, I received a clear and co= mprehensive > >> >> >> >> >> > explanation of what "neutral" means, providing me with a= better > >> >> >> >> >> > understanding of the concept. > >> >> >> >> >> > > >> >> >> >> >> > So, can you explain why you introduced it as a config in= the beginning ? > >> >> >> >> >> > >> >> >> >> >> I think that I have explained it in the commit log of comm= it > >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to = avoid too long > >> >> >> >> >> latency"). Which introduces the config. > >> >> >> >> > > >> >> >> >> > What specifically are your expectations for how users shoul= d utilize > >> >> >> >> > this config in real production workload? > >> >> >> >> > > >> >> >> >> >> > >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. = Can you > >> >> >> >> >> explain why you need it? Why cannot you use a fixed value= after initial > >> >> >> >> >> experiments. > >> >> >> >> > > >> >> >> >> > Given the extensive scale of our production environment, wi= th hundreds > >> >> >> >> > of thousands of servers, it begs the question: how do you p= ropose we > >> >> >> >> > efficiently manage the various workloads that remain unaffe= cted by the > >> >> >> >> > sysctl change implemented on just a few thousand servers? I= s it > >> >> >> >> > feasible to expect us to recompile and release a new kernel= for every > >> >> >> >> > instance where the default value falls short? Surely, there= must be > >> >> >> >> > more practical and efficient approaches we can explore toge= ther to > >> >> >> >> > ensure optimal performance across all workloads. > >> >> >> >> > > >> >> >> >> > When making improvements or modifications, kindly ensure th= at they are > >> >> >> >> > not solely confined to a test or lab environment. It's vita= l to also > >> >> >> >> > consider the needs and requirements of our actual users, al= ong with > >> >> >> >> > the diverse workloads they encounter in their daily operati= ons. > >> >> >> >> > >> >> >> >> Have you found that your different systems requires different > >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? > >> >> >> > > >> >> >> > For specific workloads that introduce latency, we set the valu= e to 0. > >> >> >> > For other workloads, we keep it unchanged until we determine t= hat the > >> >> >> > default value is also suboptimal. What is the issue with this > >> >> >> > approach? > >> >> >> > >> >> >> Firstly, this is a system wide configuration, not workload speci= fic. > >> >> >> So, other workloads run on the same system will be impacted too.= Will > >> >> >> you run one workload only on one system? > >> >> > > >> >> > It seems we're living on different planets. You're happily workin= g in > >> >> > your lab environment, while I'm struggling with real-world produc= tion > >> >> > issues. > >> >> > > >> >> > For servers: > >> >> > > >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max =3D 0 > >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max =3D 5 > >> >> > Server 1,000,001 and beyond: Happy with all values > >> >> > > >> >> > Is this hard to understand? > >> >> > > >> >> > In other words: > >> >> > > >> >> > For applications: > >> >> > > >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max =3D 0 > >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max =3D 5 > >> >> > Application 1,000,001 and beyond: Happy with all values > >> >> > >> >> Good to know this. Thanks! > >> >> > >> >> >> > >> >> >> Secondly, we need some evidences to introduce a new system ABI. = For > >> >> >> example, we need to use different configuration on different sys= tems > >> >> >> otherwise some workloads will be hurt. Can you provide some evi= dences > >> >> >> to support your change? IMHO, it's not good enough to say I don= 't know > >> >> >> why I just don't want to change existing systems. If so, it may= be > >> >> >> better to wait until you have more evidences. > >> >> > > >> >> > It seems the community encourages developers to experiment with t= heir > >> >> > improvements in lab environments using meticulously designed test > >> >> > cases A, B, C, and as many others as they can imagine, ultimately > >> >> > obtaining perfect data. However, it discourages developers from > >> >> > directly addressing real-world workloads. Sigh. > >> >> > >> >> You cannot know whether your workloads benefit or hurt for the diff= erent > >> >> batch number and how in your production environment? If you cannot= , how > >> >> do you decide which workload deploys on which system (with differen= t > >> >> batch number configuration). If you can, can you provide such > >> >> information to support your patch? > >> > > >> > We leverage a meticulous selection of network metrics, particularly > >> > focusing on TcpExt indicators, to keep a close eye on application > >> > latency. This includes metrics such as TcpExt.TCPTimeouts, > >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetran= s, > >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. > >> > > >> > In instances where a problematic container terminates, we've noticed= a > >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per > >> > second, which serves as a clear indication that other applications a= re > >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_m= ax > >> > parameter to 0, we've been able to drastically reduce the maximum > >> > frequency of these timeouts to less than one per second. > >> > >> Thanks a lot for sharing this. I learned much from it! > >> > >> > At present, we're selectively applying this adjustment to clusters > >> > that exclusively host the identified problematic applications, and > >> > we're closely monitoring their performance to ensure stability. To > >> > date, we've observed no network latency issues as a result of this > >> > change. However, we remain cautious about extending this optimizatio= n > >> > to other clusters, as the decision ultimately depends on a variety o= f > >> > factors. > >> > > >> > It's important to note that we're not eager to implement this change > >> > across our entire fleet, as we recognize the potential for unforesee= n > >> > consequences. Instead, we're taking a cautious approach by initially > >> > applying it to a limited number of servers. This allows us to assess > >> > its impact and make informed decisions about whether or not to expan= d > >> > its use in the future. > >> > >> So, you haven't observed any performance hurt yet. Right? > > > > Right. > > > >> If you > >> haven't, I suggest you to keep the patch in your downstream kernel for= a > >> while. In the future, if you find the performance of some workloads > >> hurts because of the new batch number, you can repost the patch with t= he > >> supporting data. If in the end, the performance of more and more > >> workloads is good with the new batch number. You may consider to make= 0 > >> the default value :-) > > > > That is not how the real world works. > > > > In the real world: > > > > - No one knows what may happen in the future. > > Therefore, if possible, we should make systems flexible, unless > > there is a strong justification for using a hard-coded value. > > > > - Minimize changes whenever possible. > > These systems have been working fine in the past, even if with lower > > performance. Why make changes just for the sake of improving > > performance? Does the key metric of your performance data truly matter > > for their workload? > > These are good policy in your organization and business. But, it's not > necessary the policy that Linux kernel upstream should take. You mean the Upstream Linux kernel only designed for the lab ? > > Community needs to consider long-term maintenance overhead, so it adds > new ABI (such as sysfs knob) to kernel with the necessary justification. > In general, it prefer to use a good default value or an automatic > algorithm that works for everyone. Community tries avoiding (or fixing) > regressions as much as possible, but this will not stop kernel from > changing, even if it's big. Please explain to me why the kernel config is not ABI, but the sysctl is AB= I. > > IIUC, because of the different requirements, there are upstream and > downstream kernels. The downstream developer backport features from the upsteam kernel, and if they find issues in the upstream kernel, they should contribute it back. That is how the Linux Community works, right ? --=20 Regards Yafang