From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4D50ACDB482
	for <linux-mm@archiver.kernel.org>; Fri, 13 Oct 2023 03:09:11 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A0A098D0017; Thu, 12 Oct 2023 23:09:10 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9BA4A8D0015; Thu, 12 Oct 2023 23:09:10 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 881E48D0017; Thu, 12 Oct 2023 23:09:10 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 739C98D0015
	for <linux-mm@kvack.org>; Thu, 12 Oct 2023 23:09:10 -0400 (EDT)
Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 3F1D41CAF53
	for <linux-mm@kvack.org>; Fri, 13 Oct 2023 03:09:10 +0000 (UTC)
X-FDA: 81338956860.25.32C67F5
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126])
	by imf08.hostedemail.com (Postfix) with ESMTP id 0433D16000D
	for <linux-mm@kvack.org>; Fri, 13 Oct 2023 03:09:06 +0000 (UTC)
Authentication-Results: imf08.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=GuanAwrM;
	spf=pass (imf08.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1697166547;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=gEASk/yU/FH14pK9TMd3jMSlcdS9K+ffrXQEBJfXVZA=;
	b=jGVYVPL83NQs4AOU4M9raq4wQQgQ3ZwHBMOKR/Zt4r5t1YkbOSeoqF26CQIiOkK/T6xFGk
	TbDuH2f5Y8cuhOPWDQwpOMtY70GUCBGlxPXnJ8pLbdmszPz6dAWaJLvc6C98/Abxn2DFr4
	mbXHL+XMTCaxveIC9hrlFTkzbQVtd0Y=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697166547; a=rsa-sha256;
	cv=none;
	b=77Kh0sbFzm1J7qVVs1niGqNi2hMx4Ov0z8z/MywQwnzJsTxljevC5TYeqDx3s/0rVuL3a8
	HgRcf+h1juqQ9xy/QnOXhfOQOa49EvhyJ/ajzfX+7WBDvow/6uj15ru0ch6GFgQ4M4x/qv
	LcuR0hA5ep6Z0Bpp8FWQDT8YtVAwPhg=
ARC-Authentication-Results: i=1;
	imf08.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=GuanAwrM;
	spf=pass (imf08.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1697166547; x=1728702547;
  h=from:to:cc:subject:references:date:in-reply-to:
   message-id:mime-version;
  bh=o/31H4jmTsoKxypwaT/p0YWnSttnLl3/KzbSB3YQFtU=;
  b=GuanAwrMhuv2Fm+YTEsUurDEFxZ4rWKED1xg4bfRAFgTQJzIb81sTuxi
   Fk6KjFufbpkb7oWrwZuGTXwNJDNEZCKN+XNo1ifZqsg/V8AW/jtXZMafa
   tM1LkE85heGPoYBP0s7uoEmVX0zjEaTfd/qb2E2JELHLxFEUxoXQ0G9wZ
   foLqRVoLSv8k911wJ2nMZsH+aBPyQcboXZ/s4KkDWUGcfqEp23M0sIwQR
   uR2kIYRSOxJoVNJkMjsuB6ZJ3w8+VUIgXO18BWV/tQkz+hoyRvsXA8mfg
   RdIaKnoF6282CD6sRr1rs36oSYEQShPlpB0GtP+6nBmWIj9rnuolrexe0
   w==;
X-IronPort-AV: E=McAfee;i="6600,9927,10861"; a="370163251"
X-IronPort-AV: E=Sophos;i="6.03,219,1694761200"; 
   d="scan'208";a="370163251"
Received: from orsmga007.jf.intel.com ([10.7.209.58])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Oct 2023 20:09:04 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10861"; a="748168202"
X-IronPort-AV: E=Sophos;i="6.03,219,1694761200"; 
   d="scan'208";a="748168202"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Oct 2023 20:08:59 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: <linux-mm@kvack.org>,  <linux-kernel@vger.kernel.org>,  Arjan Van De Ven
 <arjan@linux.intel.com>,  Sudeep Holla <sudeep.holla@arm.com>,  Andrew
 Morton <akpm@linux-foundation.org>,  Vlastimil Babka <vbabka@suse.cz>,
  "David Hildenbrand" <david@redhat.com>,  Johannes Weiner
 <jweiner@redhat.com>,  "Dave Hansen" <dave.hansen@linux.intel.com>,
  Michal Hocko <mhocko@suse.com>,  "Pavel Tatashin"
 <pasha.tatashin@soleen.com>,  Matthew Wilcox <willy@infradead.org>,
  Christoph Lameter <cl@linux.com>
Subject: Re: [PATCH 02/10] cacheinfo: calculate per-CPU data cache size
References: <20230920061856.257597-1-ying.huang@intel.com>
	<20230920061856.257597-3-ying.huang@intel.com>
	<20231011122027.pw3uw32sdxxqjsrq@techsingularity.net>
	<87h6mwf3gf.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<20231012125253.fpeehd6362c5v2sj@techsingularity.net>
	<87v8bcdly7.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<20231012152250.xuu5mvghwtonpvp2@techsingularity.net>
Date: Fri, 13 Oct 2023 11:06:51 +0800
In-Reply-To: <20231012152250.xuu5mvghwtonpvp2@techsingularity.net> (Mel
	Gorman's message of "Thu, 12 Oct 2023 16:22:50 +0100")
Message-ID: <87pm1jcjas.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspamd-Queue-Id: 0433D16000D
X-Rspam-User: 
X-Stat-Signature: iua6j37y5sbe73w4muknjsr3b4y3qj3u
X-Rspamd-Server: rspam03
X-HE-Tag: 1697166546-220496
X-HE-Meta: U2FsdGVkX18Gz5UCzXZGpH4/OKUKIpqCodbj25WCeaZyxnyZ60H+P9Y2aKaehAku8Fcjhz4ejmRFqMEWXrOVpvEsDOZ8D5KZcU0Y/jkmeuoCk5oDI+HwQdppV+fNUvcby3Dmnvb15N9u5FKXTfayQRIttEV+NnbasigaNrSFNT6ngUuuyGC8HLrYD+p5A9RMFxgcd8PnawzaA7NQKDJppJYtnXTV37rL9t6v+xXaZuggH0t5pU0Xj7izWyVJWyh3CJ/ZnRkXy0MeZCslRMuR8Svu9dNcp7S00lA5PlTlqal6xH80Tk/bQGh0d9oF8NDdRTTmF+cbYMclw4iBWTS4Arwvq0EGxPDm7fwTkN1HSRXKSxtZhr9UDgYMPhzo9bXMXJ6ov7yd5wTO9wZ2I2qdG6CRQc1tREnddRt/e/clSxdB56AqPwGAl+v0H99a622qWLLDfydK8Z9VmPYDKrSnleo73J5rGXLJ3KS329xhRbcFPYsmA+uXwPLQ7gHtVe5RxDaHz3XmxrwS7TOI3WhjCNh/OtkNvAxT3k2dnl+pVBcyIC1Q15EcRSJz30qMhYro7hvCXKQnKwhtf95I/V/twn9OfPi7SlMETGg83M5ftuFdc6xPjnJkBe0NoVyx+V7R7+i38bkuXZS/SGK8fsnKtm8NrrR35uRLYwiz0+WlZlUaYXNntJ2/hoJdZTd1FzgnBc89DAhu//+A/TYmEHrBUyHePLwA8LuV8Nkj27smePfuwS9eo+wbmDU/MZj68Q198sobw+j7z8dT0iowu2UlSJzPC0VRGesn+rr9WmBneH7aASA7KUEA609aPBOFPRY1RS2U22RR5eInLb5FfoiNTBsF97OZ8brZ6TdP6622Tyd5K03wehM3KDd/+jVBDAoTVTtTqrfMYRbyXsyjEsiHjPorBGXM5jO1z3MKdt9qH6R418Y//iRIwpQ05GVh/jXP4LkArUkKCzZFAUJTtpM
 JcJ3vzdS
 d4FqeFQPBKGnfe0mQYxAUzjn20l3TCpeV/VosWRSq68o5oJRp32eN95E2V+byM2rVk1jo5BulxiwdRTVKTzkdzYiP6vEijFa+SQJMqeeNQomQTOnm5lN+W86F1lUrkkS5XPop9buq+QAIok4yAv/wq97M1LJAke3thv20jpBzEkaB17pS0gZBVZxVaLsVokABn57iwE571S6sNMt/Kf9JBwEfiGLX/1DMb1RvsgJUk28h/X5DWoUSf9wZ0PldwwsAeChvxUcAzbiiJVCgDVVRlZ84ko8qAJ9hnjWwo0eb5SFXaZC/2GUtxdesN4rHv1TcM289q+beYdDg6VkCLPMRajM+LCqt7GVYOaBXvL0Ntcfi8ndT600KIQeqtZgMRLxGx2N9dhOjWksT1PtJOw8akRqL/Fn+BlSc+BQAa/is53n88Drte354QM+gN60FD5hWVGBlYVWkdctGgHL/fop/ypF2ktXmK4FIF7gM+5aOgouLy/8LDTFGlVZPgcYosYZFTHD6YubU7uQI8VAbKu68xCmsJ0QZ3XWtjY+ezzuAALC4uIzoUzpzAeDPINtMuKJzX/xn0xHTTkuvPxBpNzoAIG6FkR4rUNBdR9wN9SuknrMDMliDOOvZWgzZPivvqoHWly8Y/bc2FaBXGUxS4EHs1GHN7diy70BYF6TzZjXQb0YCvP8=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Mel Gorman <mgorman@techsingularity.net> writes:

> On Thu, Oct 12, 2023 at 09:12:00PM +0800, Huang, Ying wrote:
>> Mel Gorman <mgorman@techsingularity.net> writes:
>> 
>> > On Thu, Oct 12, 2023 at 08:08:32PM +0800, Huang, Ying wrote:
>> >> Mel Gorman <mgorman@techsingularity.net> writes:
>> >> 
>> >> > On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
>> >> >> Per-CPU data cache size is useful information.  For example, it can be
>> >> >> used to determine per-CPU cache size.  So, in this patch, the data
>> >> >> cache size for each CPU is calculated via data_cache_size /
>> >> >> shared_cpu_weight.
>> >> >> 
>> >> >> A brute-force algorithm to iterate all online CPUs is used to avoid
>> >> >> to allocate an extra cpumask, especially in offline callback.
>> >> >> 
>> >> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> >> >
>> >> > It's not necessarily relevant to the patch, but at least the scheduler
>> >> > also stores some per-cpu topology information such as sd_llc_size -- the
>> >> > number of CPUs sharing the same last-level-cache as this CPU. It may be
>> >> > worth unifying this at some point if it's common that per-cpu
>> >> > information is too fine and per-zone or per-node information is too
>> >> > coarse. This would be particularly true when considering locking
>> >> > granularity,
>> >> >
>> >> >> Cc: Sudeep Holla <sudeep.holla@arm.com>
>> >> >> Cc: Andrew Morton <akpm@linux-foundation.org>
>> >> >> Cc: Mel Gorman <mgorman@techsingularity.net>
>> >> >> Cc: Vlastimil Babka <vbabka@suse.cz>
>> >> >> Cc: David Hildenbrand <david@redhat.com>
>> >> >> Cc: Johannes Weiner <jweiner@redhat.com>
>> >> >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> >> >> Cc: Michal Hocko <mhocko@suse.com>
>> >> >> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
>> >> >> Cc: Matthew Wilcox <willy@infradead.org>
>> >> >> Cc: Christoph Lameter <cl@linux.com>
>> >> >> ---
>> >> >>  drivers/base/cacheinfo.c  | 42 ++++++++++++++++++++++++++++++++++++++-
>> >> >>  include/linux/cacheinfo.h |  1 +
>> >> >>  2 files changed, 42 insertions(+), 1 deletion(-)
>> >> >> 
>> >> >> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
>> >> >> index cbae8be1fe52..3e8951a3fbab 100644
>> >> >> --- a/drivers/base/cacheinfo.c
>> >> >> +++ b/drivers/base/cacheinfo.c
>> >> >> @@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
>> >> >>  	return rc;
>> >> >>  }
>> >> >>  
>> >> >> +static void update_data_cache_size_cpu(unsigned int cpu)
>> >> >> +{
>> >> >> +	struct cpu_cacheinfo *ci;
>> >> >> +	struct cacheinfo *leaf;
>> >> >> +	unsigned int i, nr_shared;
>> >> >> +	unsigned int size_data = 0;
>> >> >> +
>> >> >> +	if (!per_cpu_cacheinfo(cpu))
>> >> >> +		return;
>> >> >> +
>> >> >> +	ci = ci_cacheinfo(cpu);
>> >> >> +	for (i = 0; i < cache_leaves(cpu); i++) {
>> >> >> +		leaf = per_cpu_cacheinfo_idx(cpu, i);
>> >> >> +		if (leaf->type != CACHE_TYPE_DATA &&
>> >> >> +		    leaf->type != CACHE_TYPE_UNIFIED)
>> >> >> +			continue;
>> >> >> +		nr_shared = cpumask_weight(&leaf->shared_cpu_map);
>> >> >> +		if (!nr_shared)
>> >> >> +			continue;
>> >> >> +		size_data += leaf->size / nr_shared;
>> >> >> +	}
>> >> >> +	ci->size_data = size_data;
>> >> >> +}
>> >> >
>> >> > This needs comments.
>> >> >
>> >> > It would be nice to add a comment on top describing the limitation of
>> >> > CACHE_TYPE_UNIFIED here in the context of
>> >> > update_data_cache_size_cpu().
>> >> 
>> >> Sure.  Will do that.
>> >> 
>> >
>> > Thanks.
>> >
>> >> > The L2 cache could be unified but much smaller than a L3 or other
>> >> > last-level-cache. It's not clear from the code what level of cache is being
>> >> > used due to a lack of familiarity of the cpu_cacheinfo code but size_data
>> >> > is not the size of a cache, it appears to be the share of a cache a CPU
>> >> > would have under ideal circumstances.
>> >> 
>> >> Yes.  And it isn't for one specific level of cache.  It's sum of per-CPU
>> >> shares of all levels of cache.  But the calculation is inaccurate.  More
>> >> details are in the below reply.
>> >> 
>> >> > However, as it appears to also be
>> >> > iterating hierarchy then this may not be accurate. Caches may or may not
>> >> > allow data to be duplicated between levels so the value may be inaccurate.
>> >> 
>> >> Thank you very much for pointing this out!  The cache can be inclusive
>> >> or not.  So, we cannot calculate the per-CPU slice of all-level caches
>> >> via adding them together blindly.  I will change this in a follow-on
>> >> patch.
>> >> 
>> >
>> > Please do, I would strongly suggest basing this on LLC only because it's
>> > the only value you can be sure of. This change is the only change that may
>> > warrant a respin of the series as the history will be somewhat confusing
>> > otherwise.
>> 
>> I am still checking whether it's possible to get cache inclusive
>> information via cpuid.
>> 
>
> cpuid may be x86-specific so that potentially leads to different behaviours
> on different architectures.
>
>> If there's no reliable way to do that.  We can use the max value of
>> per-CPU share of each level of cache.  For inclusive cache, that will be
>> the value of LLC.  For non-inclusive cache, the value will be more
>> accurate.  For example, on Intel Sapphire Rapids, the L2 cache is 2 MB
>> per core, while LLC is 1.875 MB per core according to [1].
>> 
>
> Be that as it may, it still opens the possibility of significantly different
> behaviour depending on the CPU family. I would strongly recommend that you
> start with LLC only because LLC is also the topology level of interest used
> by the scheduler and it's information that is generally available. Trying
> to get accurate information on every level and the complexity of dealing
> with inclusive vs exclusive cache or write-back vs write-through should
> be a separate patch, with separate justification and notes on how it can
> lead to behaviour specific to the CPU family or architecture.

IMHO, we should try to optimize for as many CPUs as possible.  The size
of the per-CPU (HW thread for SMT) slice of LLC of latest Intel server
CPUs is as follows,

Icelake: 0.75 MB
Sapphire Rapids: 0.9375 MB

While pcp->batch is 63 * 4 / 1024 = 0.2461 MB.

In [03/10], only if "per_cpu_cache_slice > 4 * pcp->batch", we will cache
pcp->batch before draining the PCP.  This makes the optimization
unavailable for a significant portion of the server CPUs.

In theory, if "per_cpu_cache_slice > 2 * pcp->batch", we can reuse
cache-hot pages between CPUs.  So, if we change the condition to
"per_cpu_cache_slice > 3 * pcp->batch", I think that we are still safe.

As for other CPUs, according to [2], AMD CPUs have larger per-CPU LLC.
So, it's OK for them.  ARM CPUs has much smaller per-CPU LLC, so some
further optimization is needed.

[2] https://www.anandtech.com/show/16594/intel-3rd-gen-xeon-scalable-review/2

So, I suggest to use "per_cpu_cache_slice > 3 * pcp->batch" in [03/10],
and use LLC in this patch [02/10].  Then, we can optimize the per-CPU
slice of cache calculation in the follow-up patches.

--
Best Regards,
Huang, Ying