From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2A942E71076 for ; Thu, 21 Sep 2023 13:58:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 83F9F6B016E; Thu, 21 Sep 2023 09:58:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7EFFD6B019D; Thu, 21 Sep 2023 09:58:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6DEC66B01A4; Thu, 21 Sep 2023 09:58:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 6057F6B016E for ; Thu, 21 Sep 2023 09:58:32 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id F1F19C0886 for ; Thu, 21 Sep 2023 13:58:31 +0000 (UTC) X-FDA: 81260759622.09.7E19C0D Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.31]) by imf14.hostedemail.com (Postfix) with ESMTP id C7EC410002B for ; Thu, 21 Sep 2023 13:58:29 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Pn1n1J9J; spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695304710; a=rsa-sha256; cv=none; b=PunpeKZAusLI7CqXqZNN/nwCK6lZPqn9F+2pFHLdVEf98fs2OBW6LBbbsyOuGi7F6l3X07 Soi8o6HZL6Ca8f81UjELWzewWKilOk7C5iPKzf3Lpa3Zh7YhpxRN/Qr5kbcxJRuDwlgKPW lY2Iz0Ult4Pq2muJ5+L9gq142FeC7M8= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Pn1n1J9J; spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695304710; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=F2QRgP9XSPnmxOPnsPPT1jkOh0B4wX/sO26ZRLLfkXM=; b=guZ+4ywl7umdMoeZNVJvGNtn2LM3zG8x/MyLZhALN/n7E5qydEAD8APynPO41tz9hW7FP7 Rjpe8pLM/xXzVD26eRW/2ya6lJlP00MvvE/CPlrNhL7zOcMemfAz9sXCZkHWuFWHN/VOpE kDfY49vc/5uQSy17MEXuKFyECU0KCqs= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695304709; x=1726840709; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=ftW/+CKi2mW4lDelhCa9qIoImXCyTfRFXibv41RPhmk=; b=Pn1n1J9J7BiIclEmGSpSiiXN1lLIyH1RY3ZQnT0x96dnKTy6a+RdokxX vZ/595fpvJqlde84QvD1aa+kPPHdIzQVl+7B5f2phJiJ0s4pibQAgxTfh lmqZYPQclaW0FOkes7HylbZaXMuaOqczzpKy3NVR0omMJg8cTOB8JXy0M EAQ9WkSc41ocN5dlB6GDWS0WopLoxlWW/9kjS7gSqKgbvL7nCwKZX9XG+ tu0HmkmJf0ynVFkL/2HjV+bfvhGBG29CMVOhX3YGX8VM4RCDXenQRHkSS 7C7PcvtRMuwuPeddVh6DGDDsooGMgCd+5zOLPEGnfwTTR6yAuwkZ3cloc Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10840"; a="444641998" X-IronPort-AV: E=Sophos;i="6.03,165,1694761200"; d="scan'208";a="444641998" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Sep 2023 06:34:48 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10840"; a="812646899" X-IronPort-AV: E=Sophos;i="6.03,165,1694761200"; d="scan'208";a="812646899" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Sep 2023 06:34:45 -0700 From: "Huang, Ying" To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Arjan Van De Ven , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: Re: [PATCH 00/10] mm: PCP high auto-tuning References: <20230920061856.257597-1-ying.huang@intel.com> <20230920094118.8b8f739125c6aede17c627e0@linux-foundation.org> Date: Thu, 21 Sep 2023 21:32:35 +0800 In-Reply-To: <20230920094118.8b8f739125c6aede17c627e0@linux-foundation.org> (Andrew Morton's message of "Wed, 20 Sep 2023 09:41:18 -0700") Message-ID: <87leczwt1o.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: C7EC410002B X-Stat-Signature: htqz6sebratqp4wjpxz9pf4kiimasj1o X-Rspam-User: X-HE-Tag: 1695304709-164169 X-HE-Meta: U2FsdGVkX187ZH/oZFXBvass3oFBvHo/1CV/CcupF3JiVF8hydg1BPC7WPOUObfiNdZk8rszpnD+Ep1Jjio5/oQN0JeMIVHwwfUIcxKaO799OJ+7gTl8W6/QuIhZwGunf9UOYlZHiw0vnh8Z1eWY0seoQz8CDQ7sA5mG5UzSc0du1O/2Crj3DRk0sTYPN0+00Ac/O588h2Y47brer5V0SCm6gZliBvGbIgt3e90rlro9UZuFTSV8ovc34CS2QUSi+I06udglu/x2URWaixIhLl3xTn35Dp7MjahQ1lzoMtWWTb6nG3ENr5VrYJHNBQjX571+7mhh5x3T2Lgxp/ahmg+Oi5Hj6TO4zfdz3jO/b5cxaU3/Bw6WuRcNVy7P1jGWYcX6gGxQRgcfBwkYbutQ+KsRfzGBLsgfBgT2SeTeBAJIRPrco1BV73MDYHO8b4UgmXikyrR0YZf3hC+I97S/iJeQ4E7gEt2tByeSZNeoV1jfoNTgmu5uMvRZoAksRCUol+Qmpa0nLLdVDUxLj4cJnWVnORQyVEABFv8lFfDIY0lhOeXMnrnf+QWcBwG4bqh6JKv8JiS7Ionz2EeNxzR6KYwD5qeBVgqO8EDQnb2VEesLsGRNXPSVQhPrkiIq6hCfExTzxu4A/mKZV43wFOp8XriS1cN99YOxQnOc18PIzz/GHn3Pbnfjsrz6H7IH3cOklOY1ap5sUgIMmkqkBhPqR2ZU+jZRiLlGyKmmHqZTiwVC31YgY3xiWGqYSCmCCksrmfwcUXLKj8ctIJO0mpvJ6biQc1m+c5bOmxYAlsxRgq7GPsubz5wEI+WcnPTxjv3Tbg7naYwBqMwbQZsSxLFZ98FXmEqE85EJ5kq9S0nzq5EgDkC4pNBegc5wQn5hRQTGInv/aYGRC7fjVfaKU3ZmHOcVHB1YX/KVrO41WAlTfyfoICPMSqKFpLNR6+ft53Gew6vnb1ns/+CVzjeDjZo /isPeCfy 8mD+GtnMH+OPc5jsRNBN5a0noJJki2ZhzGn6wt2ANk5SzllP3ERjk2Nu6txCCShX1Ei4cTuE+ADPGOiFBok6j9NMkeSHkp5yTOAiHzvn8YN0lfsxfFDYgbxkoNXF04YFOf+sLhjHt7xFfdu5meLTXMJvEL49QXNwb071PJMp8h5yslwTGd2POzhFn6UV9v7Cyb37Af7AewR6/qN5xKDTrP5BxwBwCCzASB3geXKVVjfylqYpSFCs/GJb/mufv3+ppbDLP3mx/zDljKZ10rmF8Ja4GJV057IMg6yaZj1hEF0gXZqpblWUx5sxjgS6ns0iv7N01MAuI4c8XUCWcwkscl1E6mTJR0MJt4uZ8 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, Andrew, Andrew Morton writes: > On Wed, 20 Sep 2023 14:18:46 +0800 Huang Ying wrote: > >> The page allocation performance requirements of different workloads >> are often different. So, we need to tune the PCP (Per-CPU Pageset) >> high on each CPU automatically to optimize the page allocation >> performance. > > Some of the performance changes here are downright scary. > > I've never been very sure that percpu pages was very beneficial (and > hey, I invented the thing back in the Mesozoic era). But these numbers > make me think it's very important and we should have been paying more > attention. > >> The list of patches in series is as follows, >> >> 1 mm, pcp: avoid to drain PCP when process exit >> 2 cacheinfo: calculate per-CPU data cache size >> 3 mm, pcp: reduce lock contention for draining high-order pages >> 4 mm: restrict the pcp batch scale factor to avoid too long latency >> 5 mm, page_alloc: scale the number of pages that are batch allocated >> 6 mm: add framework for PCP high auto-tuning >> 7 mm: tune PCP high automatically >> 8 mm, pcp: decrease PCP high if free pages < high watermark >> 9 mm, pcp: avoid to reduce PCP high unnecessarily >> 10 mm, pcp: reduce detecting time of consecutive high order page freeing >> >> Patch 1/2/3 optimize the PCP draining for consecutive high-order pages >> freeing. >> >> Patch 4/5 optimize batch freeing and allocating. >> >> Patch 6/7/8/9 implement and optimize a PCP high auto-tuning method. >> >> Patch 10 optimize the PCP draining for consecutive high order page >> freeing based on PCP high auto-tuning. >> >> The test results for patches with performance impact are as follows, >> >> kbuild >> ====== >> >> On a 2-socket Intel server with 224 logical CPU, we tested kbuild on >> one socket with `make -j 112`. >> >> build time zone lock% free_high alloc_zone >> ---------- ---------- --------- ---------- >> base 100.0 43.6 100.0 100.0 >> patch1 96.6 40.3 49.2 95.2 >> patch3 96.4 40.5 11.3 95.1 >> patch5 96.1 37.9 13.3 96.8 >> patch7 86.4 9.8 6.2 22.0 >> patch9 85.9 9.4 4.8 16.3 >> patch10 87.7 12.6 29.0 32.3 > > You're seriously saying that kbuild got 12% faster? > > I see that [07/10] (autotuning) alone sped up kbuild by 10%? Thank you very much for questioning! I double-checked the my test results and configuration and found that I used an uncommon configuration. So the description of the test should have been, On a 2-socket Intel server with 224 logical CPU, we tested kbuild with `numactl -m 1 -- make -j 112`. This will make processes running on socket 0 to use the normal zone of socket 1. The remote accessing to zone->lock cause heavy lock contention. I apologize for any confusing caused by the above test results. If we test kbuild with `make -j 224` on the machine, the test results becomes, build time lock% free_high alloc_zone ---------- ---------- --------- ---------- base 100.0 16.8 100.0 100.0 patch5 99.2 13.9 9.5 97.0 patch7 98.5 5.4 4.8 19.2 Although lock contention cycles%, draining PCP for high order freeing, and allocating from zone reduces greatly, the build time almost doesn't change. We also tested kbuild in the following way, created 8 cgroup, and run `make -j 28` in each cgroup. That is, the total parallel is same, but LRU lock contention can be eliminated via cgroup. And, the single-process link stage take less proportion to the parallel compiling stage. This isn't common for personal usage. But it can be used by something like 0Day kbuild service. The test result is as follows, build time lock% free_high alloc_zone ---------- ---------- --------- ---------- base 100.0 14.2 100.0 100.0 patch5 98.5 8.5 8.1 97.1 patch7 95.0 0.7 3.0 19.0 The lock contention cycles% reduces to nearly 0, because LRU lock contention is eliminated too. The build time reduction becomes visible too. We will continue to do a full test with this configuration. > Other thoughts: > > - What if any facilities are provided to permit users/developers to > monitor the operation of the autotuning algorithm? /proc/zoneinfo can be used to observe PCP high and count for each CPU. > - I'm not seeing any Documentation/ updates. Surely there are things > we can tell users? I will think about that. > - This: > > : It's possible that PCP high auto-tuning doesn't work well for some > : workloads. So, when PCP high is tuned by hand via the sysctl knob, > : the auto-tuning will be disabled. The PCP high set by hand will be > : used instead. > > Is it a bit hacky to disable autotuning when the user alters > pcp-high? Would it be cleaner to have a separate on/off knob for > autotuning? This was suggested by Mel Gormon, https://lore.kernel.org/linux-mm/20230714140710.5xbesq6xguhcbyvi@techsingularity.net/ " I'm not opposed to having an adaptive pcp->high in concept. I think it would be best to disable adaptive tuning if percpu_pagelist_high_fraction is set though. I expect that users of that tunable are rare and that if it *is* used that there is a very good reason for it. " Do you think that this is reasonable? > And how is the user to determine that "PCP high auto-tuning doesn't work > well" for their workload? One way is to check the perf profiling results. If there is heavy zone lock contention, the PCP high auto-tuning doesn't work well enough to eliminate the zone lock contention. Users may try to tune PCP high by hand. -- Best Regards, Huang, Ying