From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 299CBC001DC for ; Wed, 19 Jul 2023 06:00:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B4E2B28002D; Wed, 19 Jul 2023 02:00:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AFD558D0012; Wed, 19 Jul 2023 02:00:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 99E2028002D; Wed, 19 Jul 2023 02:00:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 88F9A8D0012 for ; Wed, 19 Jul 2023 02:00:57 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 4CF1C120179 for ; Wed, 19 Jul 2023 06:00:57 +0000 (UTC) X-FDA: 81027312954.27.1B5D98B Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by imf08.hostedemail.com (Postfix) with ESMTP id A690416000B for ; Wed, 19 Jul 2023 06:00:53 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=eaGcSyPs; spf=pass (imf08.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.151 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1689746454; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7HItK/TY/C7XiDWu0HLeMP2acQ8fJKTDcbUh2b0eSuM=; b=f+DTW+bsVjeLhabeettXG4VV4Bfk8EvINi16Wyf9hRQq3w7bPvaOJOlQbxyqjJB1xdBSsn oja+4FBYSca7S4+BRdLlrh8FtfRT7pGuCRuvYGbyhtYh4ogelksDlV9PENayvSCyevSoSC 5hSBEHOUbuG5d+NT3Pnax1HMIhaghy4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689746454; a=rsa-sha256; cv=none; b=u5e8eroxD2uygKgvz3mswdddfhWjwWgwc0JULgvz8q51obr0wS8f0HHjtgVGOB9sbZrUgd Dr7pYsNwe26wdBbleiUEMcOpiHF9Byy+zUom/Lv4HMv72CAEjVUKoZPoxAxs240pimG72k hTc1ShpHCP7x6So7nau7C9pBqD3ukDA= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=eaGcSyPs; spf=pass (imf08.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.151 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1689746453; x=1721282453; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=Lrm8C4LkSKxSTMxpNCc3TGkjG2hlTDS4b5Oxvf/6LnY=; b=eaGcSyPsELYv4dADjMFqUwwS1OL+zmE3uNQEz+eNFAjHwRe6867l++1i Adn5iF1tUnsU1OBEUAB2UoqI46IUuY4576+FBqzpROCc3541Q0V+bHSEc KlZ7EG0Nn2eo+t4B9ARi9MUKZDEb606qAKVBkprFdQC4nWkSJ+lKP/hBG 9sODYcHm2mLM8AlP8KRiYTKFJu1/Jlp2MNmR1jyJ8nPgNz28x8pShvHAe zJUs8fmIh8SE2x+Q/tMDEcFcXwXXpniYUPFFyTn5Mgzx+ei59FbhekhhS WZumJCMEuAA43bsvEU2JbsDheGUlPaSa1hFODcnttS26C2McEaI1eoVPL A==; X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="346682832" X-IronPort-AV: E=Sophos;i="6.01,216,1684825200"; d="scan'208";a="346682832" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2023 23:00:51 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="847940388" X-IronPort-AV: E=Sophos;i="6.01,216,1684825200"; d="scan'208";a="847940388" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2023 23:00:43 -0700 From: "Huang, Ying" To: Mel Gorman Cc: Michal Hocko , , , Arjan Van De Ven , Andrew Morton , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Pavel Tatashin , Matthew Wilcox Subject: Re: [RFC 2/2] mm: alloc/free depth based PCP high auto-tuning References: <20230710065325.290366-1-ying.huang@intel.com> <20230710065325.290366-3-ying.huang@intel.com> <20230712090526.thk2l7sbdcdsllfi@techsingularity.net> <871qhcdwa1.fsf@yhuang6-desk2.ccr.corp.intel.com> <20230714140710.5xbesq6xguhcbyvi@techsingularity.net> <87pm4qdhk4.fsf@yhuang6-desk2.ccr.corp.intel.com> <20230717135017.7ro76lsaninbazvf@techsingularity.net> <87lefeca2z.fsf@yhuang6-desk2.ccr.corp.intel.com> <20230718123428.jcy4avtjg3rhuh7i@techsingularity.net> Date: Wed, 19 Jul 2023 13:59:00 +0800 In-Reply-To: <20230718123428.jcy4avtjg3rhuh7i@techsingularity.net> (Mel Gorman's message of "Tue, 18 Jul 2023 13:34:28 +0100") Message-ID: <87mszsbfx7.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Stat-Signature: xcwxjagueo3hod9zwo8ui17ganwx9osk X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: A690416000B X-Rspam-User: X-HE-Tag: 1689746453-953989 X-HE-Meta: U2FsdGVkX18D86/z8/zg0AgKqfIvhbJkoiDMUIm2RVIE4RelqCnoUedmYCdmdPKU38bDbHm2j99FSoxwu2e+CJ6PER9Osug6z9UZIMWmkrhOO5xIvVIGmbLPBTJyiWkrOzg+k7y5D+CPH4bS5gQPwjoZR9/X+m0JOG5NZT+9eR1e7vcFHIyLDV8Yioy/oWZ1/OvrDY0lzehfDkQo0taJBtGS2gwqVpXNz+YztGlFTuM/IKI6tH72iAdLRpLHDoM7dTwHqJDz+a3uFAFtSLTZxOLzGjnsl8NzCkbr7532qOpqi1+m5JMK621cxPnBpRAcX7bfgoepMsCOde3aH2hcRGYBV/x7Fq0NiDvJ7Ur125tkLURd8LSjXvbCTDm9KzbekfvCW1hHoniA0bP9WwBbxqQORYnYXjn3ix38rxwqawk8ltD15Wfj+CMYA3uG6B2t+a+4i2umXddqsv8QyVdLuaFN+19+PxnVy/D0iLgXE0AP/NllluQRymgPK6Rk8a5uW3NLHKL0KCM6CUIu8sEHhfvmek6JjsZ6uLil2B4htv9M96IK6AyMjydgo32T4r7duuVC0BqhqQDM5M1hsEhOcWb0BLCOxw3XqL4otNeIpFSUO/knibcsGosdKKFwxlhXItGeuKoi0WZEMmfP4FAnITfqD9ce4a93mAfeuvd7Eg9bYJ8XtIbR3TZkKFOpsfQFzhrS5gNhgqN20MYBlBRN0IljODLeG67upqpOb9cR1DxNritcyQ5AnD1g60sS1XppLvGpy/alNklAQzIHx6U1wZ0Uf+ZtIMniYg4G3xfpKKLKYU8H3JD/RLkCZ76aOhKnIylAdZiXdubVFdRcLADEC25fZaoD+N00AicSj/SqlAmSj7sPns3n2/FcTUoMCTq/KccY20gUzKxhPiOAyidS4PnkD7dwANY4/noTe6HEZkCKgVHtL0ea0WCjML2SDKjMNbUZiyvmV+gC61bkNZF CAyfCpqY CCFh0YRk/8QmACwxfJIhXQ3OHhZmn3tG03N1jl2p3N+QjRrYu5ZwjoL7z144sQ4HXhE71KBV9XAwvqZRqQKPAxwAGu7q1nL6SEvlvtGUe5qPLVn8mQefVdi2yjAsyMvwq7MtzNfSHJGEM/4zMse8xur2b3SLCGjrszsBGrg+e8mfr7IhI6f6PZHUfP0DNx/m1FCdWvqQ+iY2HN3E= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Mel Gorman writes: > On Tue, Jul 18, 2023 at 08:55:16AM +0800, Huang, Ying wrote: >> Mel Gorman writes: >> >> > On Mon, Jul 17, 2023 at 05:16:11PM +0800, Huang, Ying wrote: >> >> Mel Gorman writes: >> >> >> >> > Batch should have a much lower maximum than high because it's a deferred cost >> >> > that gets assigned to an arbitrary task. The worst case is where a process >> >> > that is a light user of the allocator incurs the full cost of a refill/drain. >> >> > >> >> > Again, intuitively this may be PID Control problem for the "Mix" case >> >> > to estimate the size of high required to minimise drains/allocs as each >> >> > drain/alloc is potentially a lock contention. The catchall for corner >> >> > cases would be to decay high from vmstat context based on pcp->expires. The >> >> > decay would prevent the "high" being pinned at an artifically high value >> >> > without any zone lock contention for prolonged periods of time and also >> >> > mitigate worst-case due to state being per-cpu. The downside is that "high" >> >> > would also oscillate for a continuous steady allocation pattern as the PID >> >> > control might pick an ideal value suitable for a long period of time with >> >> > the "decay" disrupting that ideal value. >> >> >> >> Maybe we can track the minimal value of pcp->count. If it's small >> >> enough recently, we can avoid to decay pcp->high. Because the pages in >> >> PCP are used for allocations instead of idle. >> > >> > Implement as a separate patch. I suspect this type of heuristic will be >> > very benchmark specific and the complexity may not be worth it in the >> > general case. >> >> OK. >> >> >> Another question is as follows. >> >> >> >> For example, on CPU A, a large number of pages are freed, and we >> >> maximize batch and high. So, a large number of pages are put in PCP. >> >> Then, the possible situations may be, >> >> >> >> a) a large number of pages are allocated on CPU A after some time >> >> b) a large number of pages are allocated on another CPU B >> >> >> >> For a), we want the pages are kept in PCP of CPU A as long as possible. >> >> For b), we want the pages are kept in PCP of CPU A as short as possible. >> >> I think that we need to balance between them. What is the reasonable >> >> time to keep pages in PCP without many allocations? >> >> >> > >> > This would be a case where you're relying on vmstat to drain the PCP after >> > a period of time as it is a corner case. >> >> Yes. The remaining question is how long should "a period of time" be? > > Match the time used for draining "remote" pages from the PCP lists. The > choice is arbitrary and no matter what value is chosen, it'll be possible > to build an adverse workload. OK. >> If it's long, the pages in PCP can be used for allocation after some >> time. If it's short the pages can be put in buddy, so can be used by >> other workloads if needed. >> > > Assume that the main reason to expire pages and put them back on the buddy > list is to avoid premature allocation failures due to pages pinned on the > PCP. Once pages are going back onto the buddy list and the expiry is hit, > it might as well be assumed that the pages are cache-cold. Some bad corner > cases should be mitigated by disabling the adapative sizing when reclaim is > active. Yes. This can be mitigated, but the page allocation performance may be hurt. > The big remaaining corner case to watch out for is where the sum > of the boosted pcp->high exceeds the low watermark. If that should ever > happen then potentially a premature OOM happens because the watermarks > are fine so no reclaim is active but no pages are available. It may even > be the case that the sum of pcp->high should not exceed *min* as that > corner case means that processes may prematurely enter direct reclaim > (not as bad as OOM but still bad). Sorry, I don't understand this. When pages are moved from buddy to PCP, zone NR_FREE_PAGES will be decreased in rmqueue_bulk(). That is, pages in PCP will be counted as used instead of free. And, in zone_watermark_ok*() and zone_watermark_fast(), zone NR_FREE_PAGES is used to check watermark. So, if my understanding were correct, if the number of pages in PCP is larger than low/min watermark, we can still trigger reclaim. Whether is my understanding correct? >> Anyway, I will do some experiment for that. >> >> > You cannot reasonably detect the pattern on two separate per-cpu lists >> > without either inspecting remote CPU state or maintaining global >> > state. Either would incur cache miss penalties that probably cost more >> > than the heuristic saves. >> >> Yes. Totally agree. Best Regards, Huang, Ying