From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 299CBC001DC
	for <linux-mm@archiver.kernel.org>; Wed, 19 Jul 2023 06:00:58 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B4E2B28002D; Wed, 19 Jul 2023 02:00:57 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AFD558D0012; Wed, 19 Jul 2023 02:00:57 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 99E2028002D; Wed, 19 Jul 2023 02:00:57 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 88F9A8D0012
	for <linux-mm@kvack.org>; Wed, 19 Jul 2023 02:00:57 -0400 (EDT)
Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 4CF1C120179
	for <linux-mm@kvack.org>; Wed, 19 Jul 2023 06:00:57 +0000 (UTC)
X-FDA: 81027312954.27.1B5D98B
Received: from mga17.intel.com (mga17.intel.com [192.55.52.151])
	by imf08.hostedemail.com (Postfix) with ESMTP id A690416000B
	for <linux-mm@kvack.org>; Wed, 19 Jul 2023 06:00:53 +0000 (UTC)
Authentication-Results: imf08.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=eaGcSyPs;
	spf=pass (imf08.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.151 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1689746454;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=7HItK/TY/C7XiDWu0HLeMP2acQ8fJKTDcbUh2b0eSuM=;
	b=f+DTW+bsVjeLhabeettXG4VV4Bfk8EvINi16Wyf9hRQq3w7bPvaOJOlQbxyqjJB1xdBSsn
	oja+4FBYSca7S4+BRdLlrh8FtfRT7pGuCRuvYGbyhtYh4ogelksDlV9PENayvSCyevSoSC
	5hSBEHOUbuG5d+NT3Pnax1HMIhaghy4=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689746454; a=rsa-sha256;
	cv=none;
	b=u5e8eroxD2uygKgvz3mswdddfhWjwWgwc0JULgvz8q51obr0wS8f0HHjtgVGOB9sbZrUgd
	Dr7pYsNwe26wdBbleiUEMcOpiHF9Byy+zUom/Lv4HMv72CAEjVUKoZPoxAxs240pimG72k
	hTc1ShpHCP7x6So7nau7C9pBqD3ukDA=
ARC-Authentication-Results: i=1;
	imf08.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=eaGcSyPs;
	spf=pass (imf08.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.151 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1689746453; x=1721282453;
  h=from:to:cc:subject:references:date:in-reply-to:
   message-id:mime-version;
  bh=Lrm8C4LkSKxSTMxpNCc3TGkjG2hlTDS4b5Oxvf/6LnY=;
  b=eaGcSyPsELYv4dADjMFqUwwS1OL+zmE3uNQEz+eNFAjHwRe6867l++1i
   Adn5iF1tUnsU1OBEUAB2UoqI46IUuY4576+FBqzpROCc3541Q0V+bHSEc
   KlZ7EG0Nn2eo+t4B9ARi9MUKZDEb606qAKVBkprFdQC4nWkSJ+lKP/hBG
   9sODYcHm2mLM8AlP8KRiYTKFJu1/Jlp2MNmR1jyJ8nPgNz28x8pShvHAe
   zJUs8fmIh8SE2x+Q/tMDEcFcXwXXpniYUPFFyTn5Mgzx+ei59FbhekhhS
   WZumJCMEuAA43bsvEU2JbsDheGUlPaSa1hFODcnttS26C2McEaI1eoVPL
   A==;
X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="346682832"
X-IronPort-AV: E=Sophos;i="6.01,216,1684825200"; 
   d="scan'208";a="346682832"
Received: from orsmga004.jf.intel.com ([10.7.209.38])
  by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2023 23:00:51 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="847940388"
X-IronPort-AV: E=Sophos;i="6.01,216,1684825200"; 
   d="scan'208";a="847940388"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2023 23:00:43 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>,  <linux-mm@kvack.org>,
  <linux-kernel@vger.kernel.org>,  Arjan Van De Ven
 <arjan@linux.intel.com>,  Andrew Morton <akpm@linux-foundation.org>,
  Vlastimil Babka <vbabka@suse.cz>,  David Hildenbrand <david@redhat.com>,
  Johannes Weiner <jweiner@redhat.com>,  Dave Hansen
 <dave.hansen@linux.intel.com>,  Pavel Tatashin
 <pasha.tatashin@soleen.com>,  Matthew Wilcox <willy@infradead.org>
Subject: Re: [RFC 2/2] mm: alloc/free depth based PCP high auto-tuning
References: <20230710065325.290366-1-ying.huang@intel.com>
	<20230710065325.290366-3-ying.huang@intel.com>
	<ZK060sMG0GfC5gUS@dhcp22.suse.cz>
	<20230712090526.thk2l7sbdcdsllfi@techsingularity.net>
	<871qhcdwa1.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<20230714140710.5xbesq6xguhcbyvi@techsingularity.net>
	<87pm4qdhk4.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<20230717135017.7ro76lsaninbazvf@techsingularity.net>
	<87lefeca2z.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<20230718123428.jcy4avtjg3rhuh7i@techsingularity.net>
Date: Wed, 19 Jul 2023 13:59:00 +0800
In-Reply-To: <20230718123428.jcy4avtjg3rhuh7i@techsingularity.net> (Mel
	Gorman's message of "Tue, 18 Jul 2023 13:34:28 +0100")
Message-ID: <87mszsbfx7.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Stat-Signature: xcwxjagueo3hod9zwo8ui17ganwx9osk
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: A690416000B
X-Rspam-User: 
X-HE-Tag: 1689746453-953989
X-HE-Meta: U2FsdGVkX18D86/z8/zg0AgKqfIvhbJkoiDMUIm2RVIE4RelqCnoUedmYCdmdPKU38bDbHm2j99FSoxwu2e+CJ6PER9Osug6z9UZIMWmkrhOO5xIvVIGmbLPBTJyiWkrOzg+k7y5D+CPH4bS5gQPwjoZR9/X+m0JOG5NZT+9eR1e7vcFHIyLDV8Yioy/oWZ1/OvrDY0lzehfDkQo0taJBtGS2gwqVpXNz+YztGlFTuM/IKI6tH72iAdLRpLHDoM7dTwHqJDz+a3uFAFtSLTZxOLzGjnsl8NzCkbr7532qOpqi1+m5JMK621cxPnBpRAcX7bfgoepMsCOde3aH2hcRGYBV/x7Fq0NiDvJ7Ur125tkLURd8LSjXvbCTDm9KzbekfvCW1hHoniA0bP9WwBbxqQORYnYXjn3ix38rxwqawk8ltD15Wfj+CMYA3uG6B2t+a+4i2umXddqsv8QyVdLuaFN+19+PxnVy/D0iLgXE0AP/NllluQRymgPK6Rk8a5uW3NLHKL0KCM6CUIu8sEHhfvmek6JjsZ6uLil2B4htv9M96IK6AyMjydgo32T4r7duuVC0BqhqQDM5M1hsEhOcWb0BLCOxw3XqL4otNeIpFSUO/knibcsGosdKKFwxlhXItGeuKoi0WZEMmfP4FAnITfqD9ce4a93mAfeuvd7Eg9bYJ8XtIbR3TZkKFOpsfQFzhrS5gNhgqN20MYBlBRN0IljODLeG67upqpOb9cR1DxNritcyQ5AnD1g60sS1XppLvGpy/alNklAQzIHx6U1wZ0Uf+ZtIMniYg4G3xfpKKLKYU8H3JD/RLkCZ76aOhKnIylAdZiXdubVFdRcLADEC25fZaoD+N00AicSj/SqlAmSj7sPns3n2/FcTUoMCTq/KccY20gUzKxhPiOAyidS4PnkD7dwANY4/noTe6HEZkCKgVHtL0ea0WCjML2SDKjMNbUZiyvmV+gC61bkNZF
 CAyfCpqY
 CCFh0YRk/8QmACwxfJIhXQ3OHhZmn3tG03N1jl2p3N+QjRrYu5ZwjoL7z144sQ4HXhE71KBV9XAwvqZRqQKPAxwAGu7q1nL6SEvlvtGUe5qPLVn8mQefVdi2yjAsyMvwq7MtzNfSHJGEM/4zMse8xur2b3SLCGjrszsBGrg+e8mfr7IhI6f6PZHUfP0DNx/m1FCdWvqQ+iY2HN3E=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Mel Gorman <mgorman@techsingularity.net> writes:

> On Tue, Jul 18, 2023 at 08:55:16AM +0800, Huang, Ying wrote:
>> Mel Gorman <mgorman@techsingularity.net> writes:
>> 
>> > On Mon, Jul 17, 2023 at 05:16:11PM +0800, Huang, Ying wrote:
>> >> Mel Gorman <mgorman@techsingularity.net> writes:
>> >> 
>> >> > Batch should have a much lower maximum than high because it's a deferred cost
>> >> > that gets assigned to an arbitrary task. The worst case is where a process
>> >> > that is a light user of the allocator incurs the full cost of a refill/drain.
>> >> >
>> >> > Again, intuitively this may be PID Control problem for the "Mix" case
>> >> > to estimate the size of high required to minimise drains/allocs as each
>> >> > drain/alloc is potentially a lock contention. The catchall for corner
>> >> > cases would be to decay high from vmstat context based on pcp->expires. The
>> >> > decay would prevent the "high" being pinned at an artifically high value
>> >> > without any zone lock contention for prolonged periods of time and also
>> >> > mitigate worst-case due to state being per-cpu. The downside is that "high"
>> >> > would also oscillate for a continuous steady allocation pattern as the PID
>> >> > control might pick an ideal value suitable for a long period of time with
>> >> > the "decay" disrupting that ideal value.
>> >> 
>> >> Maybe we can track the minimal value of pcp->count.  If it's small
>> >> enough recently, we can avoid to decay pcp->high.  Because the pages in
>> >> PCP are used for allocations instead of idle.
>> >
>> > Implement as a separate patch. I suspect this type of heuristic will be
>> > very benchmark specific and the complexity may not be worth it in the
>> > general case.
>> 
>> OK.
>> 
>> >> Another question is as follows.
>> >> 
>> >> For example, on CPU A, a large number of pages are freed, and we
>> >> maximize batch and high.  So, a large number of pages are put in PCP.
>> >> Then, the possible situations may be,
>> >> 
>> >> a) a large number of pages are allocated on CPU A after some time
>> >> b) a large number of pages are allocated on another CPU B
>> >> 
>> >> For a), we want the pages are kept in PCP of CPU A as long as possible.
>> >> For b), we want the pages are kept in PCP of CPU A as short as possible.
>> >> I think that we need to balance between them.  What is the reasonable
>> >> time to keep pages in PCP without many allocations?
>> >> 
>> >
>> > This would be a case where you're relying on vmstat to drain the PCP after
>> > a period of time as it is a corner case.
>> 
>> Yes.  The remaining question is how long should "a period of time" be?
>
> Match the time used for draining "remote" pages from the PCP lists. The
> choice is arbitrary and no matter what value is chosen, it'll be possible
> to build an adverse workload.

OK.

>> If it's long, the pages in PCP can be used for allocation after some
>> time.  If it's short the pages can be put in buddy, so can be used by
>> other workloads if needed.
>> 
>
> Assume that the main reason to expire pages and put them back on the buddy
> list is to avoid premature allocation failures due to pages pinned on the
> PCP. Once pages are going back onto the buddy list and the expiry is hit,
> it might as well be assumed that the pages are cache-cold. Some bad corner
> cases should be mitigated by disabling the adapative sizing when reclaim is
> active.

Yes.  This can be mitigated, but the page allocation performance may be
hurt.

> The big remaaining corner case to watch out for is where the sum
> of the boosted pcp->high exceeds the low watermark.  If that should ever
> happen then potentially a premature OOM happens because the watermarks
> are fine so no reclaim is active but no pages are available. It may even
> be the case that the sum of pcp->high should not exceed *min* as that
> corner case means that processes may prematurely enter direct reclaim
> (not as bad as OOM but still bad).

Sorry, I don't understand this.  When pages are moved from buddy to PCP,
zone NR_FREE_PAGES will be decreased in rmqueue_bulk().  That is, pages
in PCP will be counted as used instead of free.  And, in
zone_watermark_ok*() and zone_watermark_fast(), zone NR_FREE_PAGES is
used to check watermark.  So, if my understanding were correct, if the
number of pages in PCP is larger than low/min watermark, we can still
trigger reclaim.  Whether is my understanding correct?

>> Anyway, I will do some experiment for that.
>> 
>> > You cannot reasonably detect the pattern on two separate per-cpu lists
>> > without either inspecting remote CPU state or maintaining global
>> > state. Either would incur cache miss penalties that probably cost more
>> > than the heuristic saves.
>> 
>> Yes.  Totally agree.

Best Regards,
Huang, Ying