From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E6033C3DA45 for ; Fri, 12 Jul 2024 03:07:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7D2036B0083; Thu, 11 Jul 2024 23:07:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 782646B0088; Thu, 11 Jul 2024 23:07:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6227D6B0089; Thu, 11 Jul 2024 23:07:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 443536B0083 for ; Thu, 11 Jul 2024 23:07:23 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id AE183C08F0 for ; Fri, 12 Jul 2024 03:07:22 +0000 (UTC) X-FDA: 82329614724.26.5AF2E50 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) by imf10.hostedemail.com (Postfix) with ESMTP id 8F34AC0018 for ; Fri, 12 Jul 2024 03:07:19 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=CISxjlwP; spf=pass (imf10.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720753596; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ug8D2pshO92CrpKyDyDo7pt/aerQGBn5EossoIg7vCw=; b=kFj4pDS4ee6ySVDUTZ4VulaDW5/yPJaTTe7mjhoVEULtEww1rdGQWltLc7cVlEDTxMLpGa 3WCIF36jOFRyt9CFdDkVI0oT04gPGNdFsjzG6DZX7db9Lm60y5NmmMde1TAHUPI1L2EbjX YQ+nOS6MfM5juVN/rXXQmVO6DVfNKSQ= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=CISxjlwP; spf=pass (imf10.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720753596; a=rsa-sha256; cv=none; b=4ZGtTZ98WmXOccwRtOe32wRKzm94LYOORsrGZSiwtXs+zjezmLxLd9C4TfNX4qTO32yFjP nx4mU7O0UsG+WawlXNRpwa82ZxUxvA13lwaU5YLyeUeCAAy4bzR9m8gBYEh/7txoS0ZqoU CliGf8WRMjKba1PZt506XZJwZRkvjRg= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1720753640; x=1752289640; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=GrTTXfZJ2mvpaTKnPM40r8i+zfMlhuygqp9I14xsHN0=; b=CISxjlwPqIbA2Lukw6ahHSW6vtrkGZfYXAdtUYynOAIx4S8hIB3DG6ld zq5i85CO7IzeVGX/NKrVUe/so94L0E0ky4RNcxMCCuSOiSMcqZMdpsOXJ qVD4vE685nO/USvkKm5WQEDegcafLbWpwqBXFqQ+srcXR5oBHLNRxznoX oHvkFZm1K/j/lyOllM6Ue+js9YNPSN2tVGw9i4e+m0n0gDgKrlOTLkMog LEmGf/C657DJXqaGdSqDE6RLKVWoHXE7EE0btkVyh9iY6FYJPY1uravjl 7wlUZuBfvDb4zHd2AjUmgNDeLSZ13Um2vnZzNVedgs4OQmr1eMthTtBpD A==; X-CSE-ConnectionGUID: AmMVFLFYQRateM36qkpzTw== X-CSE-MsgGUID: IUWIElmzSzmsqR1G7bTExA== X-IronPort-AV: E=McAfee;i="6700,10204,11130"; a="28771086" X-IronPort-AV: E=Sophos;i="6.09,202,1716274800"; d="scan'208";a="28771086" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Jul 2024 20:07:18 -0700 X-CSE-ConnectionGUID: OavJmF/mQsOXRM91gHxePQ== X-CSE-MsgGUID: vP1QnEUTR7SHrtUmf1po8A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,202,1716274800"; d="scan'208";a="48788466" Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Jul 2024 20:07:08 -0700 From: "Huang, Ying" To: Yafang Shao Cc: akpm@linux-foundation.org, mgorman@techsingularity.net, linux-mm@kvack.org, Matthew Wilcox , David Rientjes Subject: Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max In-Reply-To: (Yafang Shao's message of "Fri, 12 Jul 2024 10:25:16 +0800") References: <20240707094956.94654-1-laoar.shao@gmail.com> <20240707094956.94654-4-laoar.shao@gmail.com> <878qyaarm6.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o774a0pv.fsf@yhuang6-desk2.ccr.corp.intel.com> <87frsg9waa.fsf@yhuang6-desk2.ccr.corp.intel.com> <877cds9pa2.fsf@yhuang6-desk2.ccr.corp.intel.com> <87y1678l0f.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Fri, 12 Jul 2024 11:05:17 +0800 Message-ID: <87plrj8g42.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 8F34AC0018 X-Stat-Signature: fz1szx7n8zepxx1x5u6i8m8f7dbxbpk8 X-Rspam-User: X-HE-Tag: 1720753639-487689 X-HE-Meta: U2FsdGVkX1/Pn7Sq0eFLtS/QmIpNVTpLZ3+ibJLKk08rqr4XpVNXMoLTwHN3Lpo3xRrqWcAwf5wb5cL+TDS/kvb4o93XkatWe6CGsVEiLAKou7BKaETOwmKj65V7h5q9+wEADmJ4oM2bzOXr5aQ5U/sWlpbsxJ//PN+knE6BGLevXNwrXpVN1R3HFnTCcfDwibnhRotbV1ss71oBpQdHDBSnR6cAYMPch6kLf1jLijeT+OhZ/zPfnYBq7ExbRHR+H5GTdeTsfvNq3ezNQ1zt7luf/+MtLPK6O00N2A70leI7AgLl4fu8JtvyXBMbqss3LQWGc55xmGjEp3rktG/bJUuqnjAF0aEPBPDtl33mELqpGdZ8kR/EDFeIlm59ECeTQ6xi4k5/pgrbDZ6UYfQJrqYGFgFbW1v0pYB5/eji7KEFe7iw4yjaQ+g7cF5NUCFCRfDtCQPD+xj0cW0eXUAxGX3ddzAzxxR8R6/5owODdAcZ6amaVIaV9BSqrUjX0pVobRKBgjh7oedrgAuBBlUYVVTb2I6mNd+/Z+wqplj5oH0nV39yk+Sga/3GOfZpYvNlDVeMwou+CwOfRvLZkAydIB3WvZjCMWwIqOJkKkeyyEvGf8ivZjOSmgfkbiNErpy6ebmbwSCYeUuu0n6LyMjc+pvkXlIUF5+6hXGIS+pZ8JkDhgYgDhFxf0wIHoWqW6asxUC3m3jnCByYf8CbbVrCXhCt5W4Pv3OuD63rXU7zDCFm/bKU+ZywOSday8lsNMarhgQnLX9VkFb0MDztmyT9YueJW+IZQw4XjhPQ1pLnuH2qDBbTf61p5Rd7cgbkgWA+DyvOsGjI4ISDHibWMpmtNmHnQi+03FpN3v33b2av+EeT/tfuinguB+r6i5S2FkmPRhRa9JZkPsr/WVsTReTGz57MM38y7nzIsLfzMA0/nTzi7VPH4EE4hHySECjmA4oCY0vC+GJfr3ixpuU76Iw YP5kx1E5 zYDKAwbYSy/00SZPI/SMkL/PQTY3sfG+2dcc8fTivRIIfqKnkYPNw9YAZ9gWkqTjkBFjxg+zRos3h2yul6JczBiWsQhUDYUrphoEMmK5k9RWhJJdr71nZ5sxEEMGr/12/5P5W+BbeCnd9H2cDOd4ZdjEYAUp+CYGRaMEsCXV2vy0BT7+ra4Reur51TDPjcLBqfkKvoMubkhD4OzA+uqtHy3W4sOPaWbgV5/EJto7s7hWMr03FjzVQ34UzstssCB0eax3aSrWjHuulXqAVIzgLHmDEHxCnqVcznTz9t3QVfvP0y0zc7gF3k2+8dA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Yafang Shao writes: > On Fri, Jul 12, 2024 at 9:21=E2=80=AFAM Huang, Ying wrote: >> >> Yafang Shao writes: >> >> > On Thu, Jul 11, 2024 at 6:51=E2=80=AFPM Huang, Ying wrote: >> >> >> >> Yafang Shao writes: >> >> >> >> > On Thu, Jul 11, 2024 at 4:20=E2=80=AFPM Huang, Ying wrote: >> >> >> >> >> >> Yafang Shao writes: >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44=E2=80=AFPM Huang, Ying wrote: >> >> >> >> >> >> >> >> Yafang Shao writes: >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51=E2=80=AFAM Huang, Ying wrote: >> >> >> >> >> >> >> >> >> >> Yafang Shao writes: >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses chal= lenges for >> >> >> >> >> > quickly experimenting with specific workloads in a product= ion environment, >> >> >> >> >> > particularly when monitoring latency spikes caused by cont= ention on the >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp= _batch_scale_max >> >> >> >> >> > is introduced as a more practical alternative. >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I can understand tha= t kernel >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl= knob is ABI >> >> >> >> >> too. >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, se= veral suggestions >> >> >> >> >> > have been proposed. One approach involves dividing large z= ones into multi >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another e= ntails splitting >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas = and shifting away >> >> >> >> >> > from relying solely on zone_id to identify the range of fr= ee lists a >> >> >> >> >> > particular page belongs to[1]. However, implementing these= solutions is >> >> >> >> >> > likely to necessitate a more extended development effort. >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improv= e zone->lock >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing= latency. >> >> >> >> > >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a >> >> >> >> > configuration that has proven to be difficult to use, and you= have >> >> >> >> > been resistant to suggestions for modifying it to a more user= -friendly >> >> >> >> > and practical tuning approach. May I inquire about the ration= ale >> >> >> >> > behind introducing this configuration in the beginning? >> >> >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explai= n what is >> >> >> >> "neutral"? >> >> >> > >> >> >> > No, thanks. >> >> >> > After consulting with ChatGPT, I received a clear and comprehens= ive >> >> >> > explanation of what "neutral" means, providing me with a better >> >> >> > understanding of the concept. >> >> >> > >> >> >> > So, can you explain why you introduced it as a config in the beg= inning ? >> >> >> >> >> >> I think that I have explained it in the commit log of commit >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid to= o long >> >> >> latency"). Which introduces the config. >> >> > >> >> > What specifically are your expectations for how users should utilize >> >> > this config in real production workload? >> >> > >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you >> >> >> explain why you need it? Why cannot you use a fixed value after i= nitial >> >> >> experiments. >> >> > >> >> > Given the extensive scale of our production environment, with hundr= eds >> >> > of thousands of servers, it begs the question: how do you propose we >> >> > efficiently manage the various workloads that remain unaffected by = the >> >> > sysctl change implemented on just a few thousand servers? Is it >> >> > feasible to expect us to recompile and release a new kernel for eve= ry >> >> > instance where the default value falls short? Surely, there must be >> >> > more practical and efficient approaches we can explore together to >> >> > ensure optimal performance across all workloads. >> >> > >> >> > When making improvements or modifications, kindly ensure that they = are >> >> > not solely confined to a test or lab environment. It's vital to also >> >> > consider the needs and requirements of our actual users, along with >> >> > the diverse workloads they encounter in their daily operations. >> >> >> >> Have you found that your different systems requires different >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? >> > >> > For specific workloads that introduce latency, we set the value to 0. >> > For other workloads, we keep it unchanged until we determine that the >> > default value is also suboptimal. What is the issue with this >> > approach? >> >> Firstly, this is a system wide configuration, not workload specific. >> So, other workloads run on the same system will be impacted too. Will >> you run one workload only on one system? > > It seems we're living on different planets. You're happily working in > your lab environment, while I'm struggling with real-world production > issues. > > For servers: > > Server 1 to 10,000: vm.pcp_batch_scale_max =3D 0 > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max =3D 5 > Server 1,000,001 and beyond: Happy with all values > > Is this hard to understand? > > In other words: > > For applications: > > Application 1 to 10,000: vm.pcp_batch_scale_max =3D 0 > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max =3D 5 > Application 1,000,001 and beyond: Happy with all values Good to know this. Thanks! >> >> Secondly, we need some evidences to introduce a new system ABI. For >> example, we need to use different configuration on different systems >> otherwise some workloads will be hurt. Can you provide some evidences >> to support your change? IMHO, it's not good enough to say I don't know >> why I just don't want to change existing systems. If so, it may be >> better to wait until you have more evidences. > > It seems the community encourages developers to experiment with their > improvements in lab environments using meticulously designed test > cases A, B, C, and as many others as they can imagine, ultimately > obtaining perfect data. However, it discourages developers from > directly addressing real-world workloads. Sigh. You cannot know whether your workloads benefit or hurt for the different batch number and how in your production environment? If you cannot, how do you decide which workload deploys on which system (with different batch number configuration). If you can, can you provide such information to support your patch? -- Best Regards, Huang, Ying