From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D3D98C77B7F
	for <linux-mm@archiver.kernel.org>; Tue, 16 May 2023 09:39:27 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id F3935280005; Tue, 16 May 2023 05:39:26 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id EE9E3280004; Tue, 16 May 2023 05:39:26 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id DB072280005; Tue, 16 May 2023 05:39:26 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id CB212280004
	for <linux-mm@kvack.org>; Tue, 16 May 2023 05:39:26 -0400 (EDT)
Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 6E6FE801B3
	for <linux-mm@kvack.org>; Tue, 16 May 2023 09:39:26 +0000 (UTC)
X-FDA: 80795620332.04.2174365
Received: from mga03.intel.com (mga03.intel.com [134.134.136.65])
	by imf24.hostedemail.com (Postfix) with ESMTP id 225F3180009
	for <linux-mm@kvack.org>; Tue, 16 May 2023 09:39:22 +0000 (UTC)
Authentication-Results: imf24.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=SLYHwmjM;
	spf=pass (imf24.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1684229964;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=GswTU5oXGtMFeVF8rtFUdgfhwUQY3+wd0Ga1x9ba4ac=;
	b=T3qusPXfErj2EIpm+Gx8jAQ0mVpbILgXKzYG0ngBF8uW8XjU6bA5gsVNWhu6SgXdO4adTq
	pMO5qrBXRoBbempDzz/6n9Tu0xeLtnH/XQeKjkCNCx0zLwfzMzjXSo01ryVdpxpEh8/Cxo
	3u6sE8lVSdHhQYA12Fc6vaHF3yyrVoU=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684229964; a=rsa-sha256;
	cv=none;
	b=kqqX0DSQ7/kUHVLszYSHux6Q7w+e0iWlvpOwW1TEuqRuTPf6XCuwL3JgZnMzuIVyX0gEeV
	nEopniziZQ3LFhxPB7oJQuBUkdy4X10lLPNhBR+nbochFtGsvtRbFIGCUrRshVGkhSW0IY
	Fv0kVjm0Z8OhkV80ZXlHuZkVCBNlTEY=
ARC-Authentication-Results: i=1;
	imf24.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=SLYHwmjM;
	spf=pass (imf24.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1684229963; x=1715765963;
  h=from:to:cc:subject:references:date:in-reply-to:
   message-id:mime-version;
  bh=a+1CyWspy7I4EDdP6KfbpZ/3Tq3hn9rZKAuCfoYYGB4=;
  b=SLYHwmjMsO0sGRGO9fxN+RyFEu5f8nFZlAtGjggoXFQckEk3UzqxY19q
   ye50oiin32Bt/YvK1WgGp3CF5a9q0bJc1hgzGXlpu5XorFPp9CJR/YG9O
   xQslZjwUfTNxCv+uQaunKQV4VRkTRtEnOnHIn06uchpjsOGi4S5lNI6f7
   hDZrqFDDnKx9e+Ohvqlr4Il2U75qpY2IOg6xtm8z2MAo2NmKES0ltAPR/
   HRAbaLN3yxdmhepCFcrIBRliJWaORdw6mrxIPpnZrmP1qmTGjYiLuX7V3
   WJAUsnLbYjklRjY5g0GSQRxdptyHkjFwA4WWHJIwOBgf6XDbtjHOjwt58
   w==;
X-IronPort-AV: E=McAfee;i="6600,9927,10711"; a="354598000"
X-IronPort-AV: E=Sophos;i="5.99,278,1677571200"; 
   d="scan'208";a="354598000"
Received: from orsmga008.jf.intel.com ([10.7.209.65])
  by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 May 2023 02:39:21 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10711"; a="731928090"
X-IronPort-AV: E=Sophos;i="5.99,278,1677571200"; 
   d="scan'208";a="731928090"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 May 2023 02:39:18 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org,  linux-kernel@vger.kernel.org,  Arjan Van De Ven
 <arjan@linux.intel.com>,  Andrew Morton <akpm@linux-foundation.org>,  Mel
 Gorman <mgorman@techsingularity.net>,  Vlastimil Babka <vbabka@suse.cz>,
  David Hildenbrand <david@redhat.com>,  Johannes Weiner
 <jweiner@redhat.com>,  Dave Hansen <dave.hansen@linux.intel.com>,  Pavel
 Tatashin <pasha.tatashin@soleen.com>,  Matthew Wilcox
 <willy@infradead.org>
Subject: Re: [RFC 0/6] mm: improve page allocator scalability via splitting
 zones
References: <20230511065607.37407-1-ying.huang@intel.com>
	<ZF0ET82ajDbFrIw/@dhcp22.suse.cz>
	<87r0rm8die.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZGIUEqhSydAdvRFN@dhcp22.suse.cz>
Date: Tue, 16 May 2023 17:38:06 +0800
In-Reply-To: <ZGIUEqhSydAdvRFN@dhcp22.suse.cz> (Michal Hocko's message of
	"Mon, 15 May 2023 13:14:26 +0200")
Message-ID: <87jzx87h1d.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspamd-Queue-Id: 225F3180009
X-Stat-Signature: 8dbx7r9x63qg3xgj6xb67dn7gu1xszsi
X-Rspam-User: 
X-Rspamd-Server: rspam09
X-HE-Tag: 1684229962-52265
X-HE-Meta: U2FsdGVkX18a+v7N4oD+5CMq9iOdqmK2+xAZES/H92KECKZRxl3NyNFkP/RCxVYNpzDLj9//wUx87JpnKUpWj1kmL1piTpveII1B8BjnCcEw9KhT3pvxQ1bB39+mQa+iyj8ldXaIaEKgTJelXZCbDxu0GlcrdIFrJGRq+kic3+ivyu+3DNIYk8v8LKyxhy3rLhqqE921pfLKlWeZW5K5y1Z4213ME7ToDigNTP6MY10Wco248xoZw7Wzt2QD3tcvdwX0hBYkbH67I0AtFIki3XaCjsjY02XiB1riRGszjX+mFRNwH2cQGxl40PVgHKk3bzMFHuAgLhgNfgrNwEAnWnIPmXVsGaJhimnPXOX1nQLMGKqhWypjBwjUma/qwpwN4t3eh0uPA/LASAzkmyIPZcloYryPklsD8QwUkI5trJ16d++BV+UP5mFmUjxsOEGh80LyxArvWYd6hxk/dmqSUh48KSGLSS6ohQ8zIggfROrWY9M/JbxYkX08AGpGcGnOjimCFhcLNDv72DuqnA6Q2RYVh5/WOm4xY+hhP7lzuqhTB8qZHYK1m3sxNeMd18T/EeGOVNaGvExewAkIdkPIBAwQq1vXDoiBurEszTHBxklNWmqcJlurVt5YADBd4/jJyvKqJwyz9N1SYM+2l8Lxkux59Is4fU1EInW876XsS0o4ed2CgSbNs66TLmxxGirC2uZb+xKVlllvbwCCtnBqkvWMVa/kELTChJCHX0XFuVDa6Z4SPaf/Xa8fMkXVCCKumgr0ntBtpXwFqYUQ4/0FIlDHfusraxiiI9rgiroi2a3nRfHV2zhCnVvnH1WcenPMo2excqke/w4U/SskJEKlruPVVXTKJn+IiygwsIS11+c3JNvnuIeqzL26Y7PFryUMeQNhv9BfWwzviGhod6rwd5WOWTEZ+nh60NsohPW4JAOPDksqFWnx4dCf1AzUKbyILkjhZGuxMWW8xw/1w8x
 Q5yMnabg
 pkClwkNMVPiRuTKwo7OiEVPwwZBXgydRIhRbUFcPFSL1rzw42korq1fby+npoO4PVhX5VK3yJBi0kTvXENd0A3K+fw3mY6Vb8sxvlmlT88a/V9+wSnxL6oXurS5lMVIl+GwDBQg/EgZr5ZEh3zXUQZyWr9/cDy/WFZ1YZWEGEJ4yFw40HmIHuP1jRaLEhVJ4aTx/U3OaQcAl+K8A8hT6ljttkZw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Michal Hocko <mhocko@suse.com> writes:

> On Fri 12-05-23 10:55:21, Huang, Ying wrote:
>> Hi, Michal,
>> 
>> Thanks for comments!
>> 
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Thu 11-05-23 14:56:01, Huang Ying wrote:
>> >> The patchset is based on upstream v6.3.
>> >> 
>> >> More and more cores are put in one physical CPU (usually one NUMA node
>> >> too).  In 2023, one high-end server CPU has 56, 64, or more cores.
>> >> Even more cores per physical CPU are planned for future CPUs.  While
>> >> all cores in one physical CPU will contend for the page allocation on
>> >> one zone in most cases.  This causes heavy zone lock contention in
>> >> some workloads.  And the situation will become worse and worse in the
>> >> future.
>> >> 
>> >> For example, on an 2-socket Intel server machine with 224 logical
>> >> CPUs, if the kernel is built with `make -j224`, the zone lock
>> >> contention cycles% can reach up to about 12.7%.
>> >> 
>> >> To improve the scalability of the page allocation, in this series, we
>> >> will create one zone instance for each about 256 GB memory of a zone
>> >> type generally.  That is, one large zone type will be split into
>> >> multiple zone instances.  Then, different logical CPUs will prefer
>> >> different zone instances based on the logical CPU No.  So the total
>> >> number of logical CPUs contend on one zone will be reduced.  Thus the
>> >> scalability is improved.
>> >
>> > It is not really clear to me why you need a new zone for all this rather
>> > than partition free lists internally within the zone? Essentially to
>> > increase the current two level system to 3: per cpu caches, per cpu
>> > arenas and global fallback.
>> 
>> Sorry, I didn't get your idea here.  What is per cpu arenas?  What's the
>> difference between it and per cpu caches (PCP)?
>
> Sorry, I didn't give this much thought than the above. Essentially, we
> have 2 level system right now. Pcp caches should reduce the contention
> on the per cpu level and that should work reasonably well, if you manage
> to align batch sizes to the workload AFAIK. If this is not sufficient
> then why to add the full zone rather than to add another level that
> caches across a larger than a cpu unit. Maybe a core?
>
> This might be a wrong way around going for this but there is not much
> performance analysis about the source of the lock contention so I am
> mostly guessing.

I guess that the page allocation scalability will be improved if we put
more pages in the per CPU caches, or add another level of cache for
multiple logical CPUs.  Because more page allocation requirements can be
satisfied without acquiring zone lock.

As other caching system, there are always cases that the caches are
drained and too many requirements goes to underlying slow layer (zone
here).  For example, if a workload needs to allocate a huge number of
pages (larger than cache size) in parallel, it will run into zone lock
contention finally.  The situation will became worse and worse if we
share one zone with more and more logical CPUs.  Which is the trend in
industry now.  Per my understanding, we can observe the high zone lock
contention cycles in kbuild test because of that.

So, per my understanding, to improve the page allocation scalability in
bad situations (that is, caching doesn't work well enough), we need to
restrict the number of logical CPUs that share one zone.  This series is
an attempt for that.  Better caching can increase the good situations
and reduce the bad situations.  But it seems hard to eliminate all bad
situations.

>From another perspective, we don't install more and more memory for each
logical CPU.  This makes it hard to enlarge the default per-CPU cache
size.

>> > I am also missing some information why pcp caches tunning is not
>> > sufficient.
>> 
>> PCP does improve the page allocation scalability greatly!  But it
>> doesn't help much for workloads that allocating pages on one CPU and
>> free them in different CPUs.  PCP tuning can improve the page allocation
>> scalability for a workload greatly.  But it's not trivial to find the
>> best tuning parameters for various workloads and workload run time
>> statuses (workloads may have different loads and memory requirements at
>> different time).  And we may run different workloads on different
>> logical CPUs of the system.  This also makes it hard to find the best
>> PCP tuning globally.
>
> Yes this makes sense. Does that mean that the global pcp tuning is not
> keeping up and we need to be able to do more auto-tuning on local bases
> rather than global?

Similar as above, I think that PCP helps the good situations performance
greatly, and splitting zone can help the bad situations scalability.
They are working at the different levels.

As for PCP auto-tuning, I think that it's hard to implement it to
resolve all problems (that is, makes PCP never be drained).

And auto-tuning doesn't sound easy.  Do you have some idea of how to do
that?

>> It would be better to find a solution to improve
>> the page allocation scalability out of box or automatically.  Do you
>> agree?
>
> Yes. 

Best Regards,
Huang, Ying