From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EB1A2C5478C
	for <linux-mm@archiver.kernel.org>; Tue, 27 Feb 2024 00:40:20 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 4F5904401D1; Mon, 26 Feb 2024 19:40:20 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4A4E944017F; Mon, 26 Feb 2024 19:40:20 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 31DE74401D1; Mon, 26 Feb 2024 19:40:20 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 1D7C044017F
	for <linux-mm@kvack.org>; Mon, 26 Feb 2024 19:40:20 -0500 (EST)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id E3315A0941
	for <linux-mm@kvack.org>; Tue, 27 Feb 2024 00:40:19 +0000 (UTC)
X-FDA: 81835727358.13.BF4FE7B
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12])
	by imf12.hostedemail.com (Postfix) with ESMTP id 8B3B840008
	for <linux-mm@kvack.org>; Tue, 27 Feb 2024 00:40:17 +0000 (UTC)
Authentication-Results: imf12.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=HoTQ8aSO;
	spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1708994418;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=+2VQVBa78OSGLrcXrapIclEydbrNRhP9RMYILCJEpZ8=;
	b=u9hXXFPF6MSlRG1PGNWyiOXyc0Y1aN6YF56/xzwu0CYTt5Z2icLWoQawSk3rVbxA9wYvVN
	l756iSLbyG5q6yDohRV2KWdgRn/bTaT553KwCLZTf+LFl2Rk4fxB/T7eF8XbyshccKQuvn
	ag1o6U9RobgSJDPqMLlltbHQKRb2RHE=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708994418; a=rsa-sha256;
	cv=none;
	b=R2ty0Nkzwl0aBKmbreuR/7RKitsYsoQZ451z6GKOsPBvAGakGsqoJ7UhYem2vi6uoF6NOV
	dwm7qbn5j5Z0qcsKya/yrAPUIO0T9gk5jajv1LFRf+Y5FIpUKszA/zTvKVmU2MvWBtxvKP
	BN+5gORkVmb/hjeTZz8fTaJIzJSSz/E=
ARC-Authentication-Results: i=1;
	imf12.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=HoTQ8aSO;
	spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1708994418; x=1740530418;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version;
  bh=XQrJcpKgxIV+gtw9yVUIUl0OVoNPwIVzq5HXcRyo620=;
  b=HoTQ8aSOARPa/Sdm9lqzdVyyLK7tTohJqrURJpYuXxYE3SQIjSz/og1N
   G+W6WOqtuM6iq34zYJCk3pGK5q9FfMp2pnOl1mM8X7wsr4G4qpgJBAes4
   kApno161npXp2/Kg4ox+GVPTcRDkFIDX5b4j1jc4Qf4a8gnwTi6bBSELN
   wJ7HdDVtSMJWpDMtdX5jsER+A5I2SNku47WOFaj3+ypPJS2xGE+z5FwZu
   2OoeZGojHJZY5gPPDcoY2pc/lSsB2YCaDfr65YFXjwLRrEwIzfxxOug7z
   Z0dhqGu+ldxUCjlUMbYZW4LW0PsS/6Ff3OE0lmU/mYU2DEUjUjmwv6i5x
   A==;
X-IronPort-AV: E=McAfee;i="6600,9927,10996"; a="7101519"
X-IronPort-AV: E=Sophos;i="6.06,187,1705392000"; 
   d="scan'208";a="7101519"
Received: from fmviesa007.fm.intel.com ([10.60.135.147])
  by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Feb 2024 16:40:16 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.06,187,1705392000"; 
   d="scan'208";a="6729837"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmviesa007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Feb 2024 16:40:14 -0800
From: "Huang, Ying" <ying.huang@intel.com>
To: Gregory Price <gregory.price@memverge.com>
Cc: Gregory Price <gourry.memverge@gmail.com>,  <linux-mm@kvack.org>,
  <linux-kernel@vger.kernel.org>,  <hannes@cmpxchg.org>,
  <dan.j.williams@intel.com>,  <dave.jiang@intel.com>
Subject: Re: [RFC 1/1] mm/mempolicy: introduce system default interleave
 weights
In-Reply-To: <ZdygZ8ZidfaORg8F@memverge.com> (Gregory Price's message of "Mon,
	26 Feb 2024 09:29:59 -0500")
References: <20240220202529.2365-1-gregory.price@memverge.com>
	<20240220202529.2365-2-gregory.price@memverge.com>
	<87wmqxht4c.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZdgxaLSBznupVmJK@memverge.com>
	<87sf1jh7es.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZdygZ8ZidfaORg8F@memverge.com>
Date: Tue, 27 Feb 2024 08:38:19 +0800
Message-ID: <87edcyeo78.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspamd-Queue-Id: 8B3B840008
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-Stat-Signature: 6g9bex85o3ww3ejkcb1c6cphmr5j36ij
X-HE-Tag: 1708994417-826453
X-HE-Meta: U2FsdGVkX1/DW3R/RPTgmJR9U1OlyK3rC22kWdWMSXWa0IXCRCcytBvF9KO+/hESX2Dnxbud9olwVnWz61Jks1g3fgXAysYXYE8x3/T7t3uAiEeKBvtmqdlT1GHaE3gHny8wqgUBQviECWSociiUmUfQ1iEgC10akqWOcuYfWdIwKUEmmJ/TJYiF5ovwACJtKZaGJKdTg11S1IXXE4TsYuP0lo1AXCaLwmysZljCv4nt8eTefW+RejAoNk1GQWRRTIzolcxkcHR9u6FvB7K4sfm7qLEOIdrlc4czLQt8sBBRyUVC/CF2stTXAaWD90wf1Gihu+WnYP5i387hZYC5yHVDkGXXw5dlkmmL9Kx1K/aim6aH8VB2Aj/VWVM/iptUI5YLKCPTLnkGjwOGOpQcB37N9PMPUsuQuY3jVIAusCJjEIbAWOjl3sm3Tjf8R2qFJnIT6VzhsTuj+SFPM8dEs1rBVDzLcT8lLZibAg8Td+Pi9SifwmgYPDp5ORnJrcCV0hnb9efHtQLVhZHJ4+drTI1A192WIxXI3sFDUfmEHLJOH3T+IokpNQeQ6WcNpdHS8m69CwoDoXpyc8OXKIrocMa/lYK9VFj96fRqKKcsDHg2qMFEbrYkuL9/ojh3gdWdSs9aHVX3fZdZlOMMYtSPoxfSNMpzjzVYxO+D65M7yHny+JTAAt9hPuJQGsy5k2lnl0QSGtHogrV0aM2VS5V+KswfQsySS98qCqtpLoBYwu2IRq0M5b+EwnOwgS/Hc6K9bchuhL7EiCljJvmxrMwsPZKPRg+niwUdzaBP0Vq3bezwK+wea6hT/IBZ9v3/Favf3kJKH1+aPhXJXZRVPhL8ORkhhnf/WJohK+kAjoRCGJQs+odPeO6SUqMY+EN0dw/S12hHWrWxr9cRdlsOfyerL/h426ua4gODEjO40a3dwE0=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000005, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Gregory Price <gregory.price@memverge.com> writes:

> On Fri, Feb 23, 2024 at 05:11:23PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>>
>
> (sorry for the re-send, error replying to list)
>
>> >> > +	/* If node is not set or has < 1% of total bw, use minimum value of 1 */
>> >> > +	for (i = 0; i < nr_node_ids; i++) {
>> >> > +		if (new_bw[i])
>> >> > +			new_iw[i] = max((100 * new_bw[i] / ttl_bw), 1);
>> 
>> IIUC, the sum of interleave weights of all nodes will be 100.  If there
>> are more than 100 nodes in the system, this doesn't work properly.  How
>> about use some fixed number like "16" for DRAM node?
>>
>
> I suppose we could add a "type" value into the interface that says
> what approximate "tier" a node is in, or we could ask the tiering
> component for that information.  But what does this actually change?
>
> You still calculate the percentage of bandwidth provided by each node,
> and then just apply that to the larger default number. I don't see the
> point in that - if each node provides less than 1% of the overall system
> bandwidth, and larger numbers won't do much. In fact, we want smaller
> numbers to spread spacially local data out more aggressively.
>
> More important question: In what world is a large numa system liabile
> to use this interface to any real benefit?
>
>
> I'd briefly considered this, but I strayed away from supporting that
> case.  Probably worth documenting, at the very least.
>
> We had the cross-socket interleave discussion previously in the prior
> series.  The question above simplifies (complicates?) to:  How useful
> is interleave (weighted or not) in cross-socket workloads.
>
> Consider the following configuration:
>
>
>  ---------   A  --------    C    -------- D  ---------
>  | DRAM0 | ---- | cpu0 |---UPI---| cpu1 |----| DRAM1 |
>  ---------      --------         --------    ---------
> 	           | B              | E
>                 --------         --------
>                 | cxl0 |         | cxl1 |
>                 --------         --------
>
> Theoretical throughputs
>
> A&D: 512GB/s  (8 channel DDR5)
> B&E: 64GB/s   (1 CXL/PCIe5 link)
> C  : 62.4GB/s (3x UPI links)
>
> Where are the 100 nodes coming from?

If you have a real large machine with more than 100 nodes, and some of
them are CXL memory nodes, then it's possible that most nodes will have
interleave weight "1" because the sum of all interleave weights is
"100".  Then, even if you use only one socket, the interleave weight of
DRAM and CXL MEM could be all "1", lead to useless default value.  So, I
suggest don't cap the sum of interleave weights.

> If it's across interconnects (UPI), then the throughput to remote
> DRAM is better described by C, not A or D. However, we don't have
> that information (maybe we should?).  More importantly... is
> interleaving across these links even useful?  I suppose if you did
> sub-numa clustering stuff and had an ultra-super-numa-aware piece
> of software capable of keeping certain chunks of memory in certain
> cores that might be useful.... but then you probably actually want
> task-local weights as opposed to using the system default.
>
> Otherwise, does a UPI link actually get the full throughput? Probably
> only if the remote memory bus is unloaded.  If the remote bus is
> loaded, then link C performance information is basically a lie.
>
> I've been convinced (so far) that cross-socket interconnect
> interleaving is not a real use-case unless you intend to only run
> your software on a single socket and use the remote socket for
> whatever you can swipe over the interconnect. In that case, you're
> probably smart enough to set the interleave weights manually.
>
>
> So what if the nodes are coming from many memory sources down one
> or more local CXL links (link B from cpu0).
>
>  ---------   A  --------
>  | DRAM0 | ---- | cpu0 |
>  ---------      --------
> 	           | B 
>       ----------------------------
>       |                          |
>   --------                    --------
>   | cxl0 |       ......       | cxlN |
>   --------                    --------
>
> In that case it would be better for many reasons to reconfigure the
> system to combine those nodes into fewer nodes via a hardware interleave
> set.  This can be done in hardware (at a switch), in BIOS (at the root
> complex), or by the CXL Driver.  The result is fewer nodes, and the real
> performance of that node can be calculated by the drivers and repoted
> accordingly.
>
>
>
> So coming back to this code:  Then why am I doing GCD across all
> nodes, rather than taking the full topology into account?  Mostly
> because the topological information is not easily available, would
> be complex to communicate across components, and the full reduction
> is a decent approximation anyway.
>
> Example from above using real HMAT reported numbers
>
> A&D: 176100
> B&E: 60000
> C:   Not a node, no information available.
>
> Produces Node Weights
>
> Calculating total system weighted averagee
> A:37  D:37  B:12  E:12  (37 is prime so no reductions possible)
>
> Calculating local-node relationships only
> A:74--B:25  D:74--E:25  (GCD is 1, so no reductions possible)
>
> Notice that 12+37 = 49 -  12/49 = 24%
>
> So the ratios end up working out basically the same anyway, but
> the smaller numbers produced by averaging over the entire system
> are preferable to the "topologically aware" numbers anyway.
>
>
> Obviously this breaks in a "large numa system" - but again...
> is this even useful for those systems anyway? I contend: No.
>
>
> This is still reasonable accurate in non-hogeneous systems
>
>  ---------   A  --------    C    -------- D  ---------
>  | DRAM0 | ---- | cpu0 |---UPI---| cpu1 |----| DRAM1 |
>  ---------      --------         --------    ---------
> 	           | B
>                 --------
>                 | cxl0 |
>                 --------
>
> In this system the numbers work out to:
>
> Global:  A:42  B:14  D: 42  (GCD: 14)
> Reduce:  A:3   B:1   D: 3
>
> A user doing `-w --interleave=A,B` will get a ratio of 3:1, which
> is pretty much spot on.
>
>
> So, long winded winded way of saying:
> - Could we use a larger default number? Yes.
> - Does that actually help us? Not really, we want smaller numbers.

The larger number will be reduced after GCD.

> - Does this reduce to normal-interleave under large-numa systems? Yes.
> - Does that matter? Probably not. It doesn't seem like a real use case.
> - What if it is?  The workloads probably want task-local weights anyway.
>
>> >
>> > In this scenario, I'm not sure what to do.  We must have a non-0 value
>> > for that device (to avoid div-by-0), but setting an abitrarily large
>> > value also seems bad.
>> 
>> I think that it's kind of reasonable to use DRAM bandwidth for device
>> without data.  If there are only DRAM nodes and nodes without data, this
>> will make interleave weight to "1".
>>
>
> Yes, those nodes would reduce to 1.  Which is pretty much the best we can
> do without accounting for interconnects - which as discussed above is not
> really useful anyway.
>
>
>
> I think I'll draft up an LSF/MM chat to see if we can garner more input.
> If large-numa systems are a real issue, then yes we need to address it.

Sounds good to me!

--
Best Regards,
Huang, Ying

> ~Gregory