From: "Huang, Ying" <ying.huang@intel.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
Arjan Van De Ven <arjan@linux.intel.com>,
Sudeep Holla <sudeep.holla@arm.com>,
Andrew Morton <akpm@linux-foundation.org>,
Vlastimil Babka <vbabka@suse.cz>,
"David Hildenbrand" <david@redhat.com>,
Johannes Weiner <jweiner@redhat.com>,
"Dave Hansen" <dave.hansen@linux.intel.com>,
Michal Hocko <mhocko@suse.com>,
"Pavel Tatashin" <pasha.tatashin@soleen.com>,
Matthew Wilcox <willy@infradead.org>,
Christoph Lameter <cl@linux.com>
Subject: Re: [PATCH 02/10] cacheinfo: calculate per-CPU data cache size
Date: Fri, 13 Oct 2023 11:06:51 +0800 [thread overview]
Message-ID: <87pm1jcjas.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <20231012152250.xuu5mvghwtonpvp2@techsingularity.net> (Mel Gorman's message of "Thu, 12 Oct 2023 16:22:50 +0100")
Mel Gorman <mgorman@techsingularity.net> writes:
> On Thu, Oct 12, 2023 at 09:12:00PM +0800, Huang, Ying wrote:
>> Mel Gorman <mgorman@techsingularity.net> writes:
>>
>> > On Thu, Oct 12, 2023 at 08:08:32PM +0800, Huang, Ying wrote:
>> >> Mel Gorman <mgorman@techsingularity.net> writes:
>> >>
>> >> > On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
>> >> >> Per-CPU data cache size is useful information. For example, it can be
>> >> >> used to determine per-CPU cache size. So, in this patch, the data
>> >> >> cache size for each CPU is calculated via data_cache_size /
>> >> >> shared_cpu_weight.
>> >> >>
>> >> >> A brute-force algorithm to iterate all online CPUs is used to avoid
>> >> >> to allocate an extra cpumask, especially in offline callback.
>> >> >>
>> >> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> >> >
>> >> > It's not necessarily relevant to the patch, but at least the scheduler
>> >> > also stores some per-cpu topology information such as sd_llc_size -- the
>> >> > number of CPUs sharing the same last-level-cache as this CPU. It may be
>> >> > worth unifying this at some point if it's common that per-cpu
>> >> > information is too fine and per-zone or per-node information is too
>> >> > coarse. This would be particularly true when considering locking
>> >> > granularity,
>> >> >
>> >> >> Cc: Sudeep Holla <sudeep.holla@arm.com>
>> >> >> Cc: Andrew Morton <akpm@linux-foundation.org>
>> >> >> Cc: Mel Gorman <mgorman@techsingularity.net>
>> >> >> Cc: Vlastimil Babka <vbabka@suse.cz>
>> >> >> Cc: David Hildenbrand <david@redhat.com>
>> >> >> Cc: Johannes Weiner <jweiner@redhat.com>
>> >> >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> >> >> Cc: Michal Hocko <mhocko@suse.com>
>> >> >> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
>> >> >> Cc: Matthew Wilcox <willy@infradead.org>
>> >> >> Cc: Christoph Lameter <cl@linux.com>
>> >> >> ---
>> >> >> drivers/base/cacheinfo.c | 42 ++++++++++++++++++++++++++++++++++++++-
>> >> >> include/linux/cacheinfo.h | 1 +
>> >> >> 2 files changed, 42 insertions(+), 1 deletion(-)
>> >> >>
>> >> >> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
>> >> >> index cbae8be1fe52..3e8951a3fbab 100644
>> >> >> --- a/drivers/base/cacheinfo.c
>> >> >> +++ b/drivers/base/cacheinfo.c
>> >> >> @@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
>> >> >> return rc;
>> >> >> }
>> >> >>
>> >> >> +static void update_data_cache_size_cpu(unsigned int cpu)
>> >> >> +{
>> >> >> + struct cpu_cacheinfo *ci;
>> >> >> + struct cacheinfo *leaf;
>> >> >> + unsigned int i, nr_shared;
>> >> >> + unsigned int size_data = 0;
>> >> >> +
>> >> >> + if (!per_cpu_cacheinfo(cpu))
>> >> >> + return;
>> >> >> +
>> >> >> + ci = ci_cacheinfo(cpu);
>> >> >> + for (i = 0; i < cache_leaves(cpu); i++) {
>> >> >> + leaf = per_cpu_cacheinfo_idx(cpu, i);
>> >> >> + if (leaf->type != CACHE_TYPE_DATA &&
>> >> >> + leaf->type != CACHE_TYPE_UNIFIED)
>> >> >> + continue;
>> >> >> + nr_shared = cpumask_weight(&leaf->shared_cpu_map);
>> >> >> + if (!nr_shared)
>> >> >> + continue;
>> >> >> + size_data += leaf->size / nr_shared;
>> >> >> + }
>> >> >> + ci->size_data = size_data;
>> >> >> +}
>> >> >
>> >> > This needs comments.
>> >> >
>> >> > It would be nice to add a comment on top describing the limitation of
>> >> > CACHE_TYPE_UNIFIED here in the context of
>> >> > update_data_cache_size_cpu().
>> >>
>> >> Sure. Will do that.
>> >>
>> >
>> > Thanks.
>> >
>> >> > The L2 cache could be unified but much smaller than a L3 or other
>> >> > last-level-cache. It's not clear from the code what level of cache is being
>> >> > used due to a lack of familiarity of the cpu_cacheinfo code but size_data
>> >> > is not the size of a cache, it appears to be the share of a cache a CPU
>> >> > would have under ideal circumstances.
>> >>
>> >> Yes. And it isn't for one specific level of cache. It's sum of per-CPU
>> >> shares of all levels of cache. But the calculation is inaccurate. More
>> >> details are in the below reply.
>> >>
>> >> > However, as it appears to also be
>> >> > iterating hierarchy then this may not be accurate. Caches may or may not
>> >> > allow data to be duplicated between levels so the value may be inaccurate.
>> >>
>> >> Thank you very much for pointing this out! The cache can be inclusive
>> >> or not. So, we cannot calculate the per-CPU slice of all-level caches
>> >> via adding them together blindly. I will change this in a follow-on
>> >> patch.
>> >>
>> >
>> > Please do, I would strongly suggest basing this on LLC only because it's
>> > the only value you can be sure of. This change is the only change that may
>> > warrant a respin of the series as the history will be somewhat confusing
>> > otherwise.
>>
>> I am still checking whether it's possible to get cache inclusive
>> information via cpuid.
>>
>
> cpuid may be x86-specific so that potentially leads to different behaviours
> on different architectures.
>
>> If there's no reliable way to do that. We can use the max value of
>> per-CPU share of each level of cache. For inclusive cache, that will be
>> the value of LLC. For non-inclusive cache, the value will be more
>> accurate. For example, on Intel Sapphire Rapids, the L2 cache is 2 MB
>> per core, while LLC is 1.875 MB per core according to [1].
>>
>
> Be that as it may, it still opens the possibility of significantly different
> behaviour depending on the CPU family. I would strongly recommend that you
> start with LLC only because LLC is also the topology level of interest used
> by the scheduler and it's information that is generally available. Trying
> to get accurate information on every level and the complexity of dealing
> with inclusive vs exclusive cache or write-back vs write-through should
> be a separate patch, with separate justification and notes on how it can
> lead to behaviour specific to the CPU family or architecture.
IMHO, we should try to optimize for as many CPUs as possible. The size
of the per-CPU (HW thread for SMT) slice of LLC of latest Intel server
CPUs is as follows,
Icelake: 0.75 MB
Sapphire Rapids: 0.9375 MB
While pcp->batch is 63 * 4 / 1024 = 0.2461 MB.
In [03/10], only if "per_cpu_cache_slice > 4 * pcp->batch", we will cache
pcp->batch before draining the PCP. This makes the optimization
unavailable for a significant portion of the server CPUs.
In theory, if "per_cpu_cache_slice > 2 * pcp->batch", we can reuse
cache-hot pages between CPUs. So, if we change the condition to
"per_cpu_cache_slice > 3 * pcp->batch", I think that we are still safe.
As for other CPUs, according to [2], AMD CPUs have larger per-CPU LLC.
So, it's OK for them. ARM CPUs has much smaller per-CPU LLC, so some
further optimization is needed.
[2] https://www.anandtech.com/show/16594/intel-3rd-gen-xeon-scalable-review/2
So, I suggest to use "per_cpu_cache_slice > 3 * pcp->batch" in [03/10],
and use LLC in this patch [02/10]. Then, we can optimize the per-CPU
slice of cache calculation in the follow-up patches.
--
Best Regards,
Huang, Ying
next prev parent reply other threads:[~2023-10-13 3:09 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-20 6:18 [PATCH 00/10] mm: PCP high auto-tuning Huang Ying
2023-09-20 6:18 ` [PATCH 01/10] mm, pcp: avoid to drain PCP when process exit Huang Ying
2023-10-11 12:46 ` Mel Gorman
2023-10-11 17:16 ` Andrew Morton
2023-10-12 13:09 ` Mel Gorman
2023-10-12 13:35 ` Huang, Ying
2023-10-12 12:21 ` Huang, Ying
2023-09-20 6:18 ` [PATCH 02/10] cacheinfo: calculate per-CPU data cache size Huang Ying
2023-09-20 9:24 ` Sudeep Holla
2023-09-22 7:56 ` Huang, Ying
2023-10-11 12:20 ` Mel Gorman
2023-10-12 12:08 ` Huang, Ying
2023-10-12 12:52 ` Mel Gorman
2023-10-12 13:12 ` Huang, Ying
2023-10-12 15:22 ` Mel Gorman
2023-10-13 3:06 ` Huang, Ying [this message]
2023-10-16 15:43 ` Mel Gorman
2023-09-20 6:18 ` [PATCH 03/10] mm, pcp: reduce lock contention for draining high-order pages Huang Ying
2023-10-11 12:49 ` Mel Gorman
2023-10-12 12:11 ` Huang, Ying
2023-09-20 6:18 ` [PATCH 04/10] mm: restrict the pcp batch scale factor to avoid too long latency Huang Ying
2023-10-11 12:52 ` Mel Gorman
2023-10-12 12:15 ` Huang, Ying
2023-09-20 6:18 ` [PATCH 05/10] mm, page_alloc: scale the number of pages that are batch allocated Huang Ying
2023-10-11 12:54 ` Mel Gorman
2023-09-20 6:18 ` [PATCH 06/10] mm: add framework for PCP high auto-tuning Huang Ying
2023-09-20 6:18 ` [PATCH 07/10] mm: tune PCP high automatically Huang Ying
2023-09-20 6:18 ` [PATCH 08/10] mm, pcp: decrease PCP high if free pages < high watermark Huang Ying
2023-10-11 13:08 ` Mel Gorman
2023-10-12 12:19 ` Huang, Ying
2023-09-20 6:18 ` [PATCH 09/10] mm, pcp: avoid to reduce PCP high unnecessarily Huang Ying
2023-10-11 14:09 ` Mel Gorman
2023-10-12 7:48 ` Huang, Ying
2023-10-12 12:49 ` Mel Gorman
2023-10-12 13:19 ` Huang, Ying
2023-09-20 6:18 ` [PATCH 10/10] mm, pcp: reduce detecting time of consecutive high order page freeing Huang Ying
2023-09-20 16:41 ` [PATCH 00/10] mm: PCP high auto-tuning Andrew Morton
2023-09-21 13:32 ` Huang, Ying
2023-09-21 15:46 ` Andrew Morton
2023-09-22 0:33 ` Huang, Ying
2023-10-11 13:05 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87pm1jcjas.fsf@yhuang6-desk2.ccr.corp.intel.com \
--to=ying.huang@intel.com \
--cc=akpm@linux-foundation.org \
--cc=arjan@linux.intel.com \
--cc=cl@linux.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=jweiner@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mhocko@suse.com \
--cc=pasha.tatashin@soleen.com \
--cc=sudeep.holla@arm.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox