From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0E997C4332F
	for <linux-mm@archiver.kernel.org>; Thu,  2 Nov 2023 06:43:15 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 803F98D007E; Thu,  2 Nov 2023 02:43:14 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 78DB28D0026; Thu,  2 Nov 2023 02:43:14 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 606498D007E; Thu,  2 Nov 2023 02:43:14 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 4D86F8D0026
	for <linux-mm@kvack.org>; Thu,  2 Nov 2023 02:43:14 -0400 (EDT)
Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 1AD4E80DBE
	for <linux-mm@kvack.org>; Thu,  2 Nov 2023 06:43:14 +0000 (UTC)
X-FDA: 81412072308.17.6EC01BA
Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.43])
	by imf28.hostedemail.com (Postfix) with ESMTP id 7F8B6C0020
	for <linux-mm@kvack.org>; Thu,  2 Nov 2023 06:43:11 +0000 (UTC)
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=WO7OkeQM;
	spf=pass (imf28.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1698907392;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=n0QttJ++CSCosMAtv6PS6z3eCDfLwbtISQj5AU+JQhw=;
	b=Gpsm7uoc2E2vRQkR9yu+4d48KFbTcVsn6DXj+VedHscRpwWf9ak09fkOgEaD3lOJTHhZ0W
	la1Z4Qgu5pDUqAjJic/nsX3MqfgagfVm0QC53WhDMBye41gsRXaOeazNPOcxT5LrpyO5jU
	SQEdw1MFEaNM6MBLxSHM43GVxGCBsZM=
ARC-Authentication-Results: i=1;
	imf28.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=WO7OkeQM;
	spf=pass (imf28.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698907392; a=rsa-sha256;
	cv=none;
	b=p//lBlVMan6g9r2f8XcEjxCISBgYd9DQlz4mXjEvURQwVxhIUy3cGaEVC8xFAQXGTqBBEh
	7GPHxY1R8KLLWh4IR8Pn+5bAqp9OJyh11H2yihK9MCUH1nQsxyf6x/Ouh78iXha4h8VSuK
	APXFUXALxWiiSzhgvjbtAXMQrQRBMUs=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1698907391; x=1730443391;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version;
  bh=eFzutUwI91AlprFudf7uXPVWde/i5DfJ1a5Vzf3IEHM=;
  b=WO7OkeQM3i0tTstT1EmqLOYV65NFoFybYtZHGw6++vl27o1HQ3Fb+KG2
   hWJhIDMm/OB9epbVfMDTDxdTG7ttHUxIzvELMSNxCJn1yFv5YJfC8Wyt3
   qYCTqUuZ0gyGUZtelTopCGnfZvvP7EjFcsvoGsFXvZInBd5lp5pHTxUTx
   Z7bJ52vkUKMNcviX5RUYWmZhowqj+o7o2yx7HQmD38DrJ/QbUKKdnRHXo
   VxwUBD3T4TLhXO+YInZUMLEd8jWvufPx4xZQIcEW2na5hjSnDFTXdM+fr
   02CKbZoSMvY7p1dS9eHZW6eP2bhai9HrJwEKYDylagjdeFgcg05F8nrKG
   A==;
X-IronPort-AV: E=McAfee;i="6600,9927,10881"; a="474887761"
X-IronPort-AV: E=Sophos;i="6.03,270,1694761200"; 
   d="scan'208";a="474887761"
Received: from orsmga003.jf.intel.com ([10.7.209.27])
  by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Nov 2023 23:43:09 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10881"; a="711040332"
X-IronPort-AV: E=Sophos;i="6.03,270,1694761200"; 
   d="scan'208";a="711040332"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orsmga003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Nov 2023 23:43:05 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: <akpm@linux-foundation.org>,  <aneesh.kumar@linux.ibm.com>,
  <apopple@nvidia.com>,  <dave.hansen@intel.com>,
  <gourry.memverge@gmail.com>,  <gregkh@linuxfoundation.org>,
  <gregory.price@memverge.com>,  <hannes@cmpxchg.org>,
  <linux-cxl@vger.kernel.org>,  <linux-kernel@vger.kernel.org>,
  <linux-mm@kvack.org>,  <mhocko@suse.com>,  <rafael@kernel.org>,
  <shy828301@gmail.com>,  <tim.c.chen@intel.com>,  <weixugc@google.com>
Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave
In-Reply-To: <20231101092923.283-1-ravis.opensrc@micron.com> (Ravi
	Jonnalagadda's message of "Wed, 1 Nov 2023 14:59:23 +0530")
References: <87il6m6w2j.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<20231101092923.283-1-ravis.opensrc@micron.com>
Date: Thu, 02 Nov 2023 14:41:03 +0800
Message-ID: <87a5rw1wu8.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspamd-Queue-Id: 7F8B6C0020
X-Rspam-User: 
X-Stat-Signature: 8yx5e9ema3wkkr34akikzg55t3r3dhfw
X-Rspamd-Server: rspam01
X-HE-Tag: 1698907391-814594
X-HE-Meta: U2FsdGVkX18NQh/yxA9O3X20raDc5tFiDkpGEboQcl3wP66TAhNOi75ZqwqB0+89y3Jmt2JFbq2BM/Xa9y+H8mahkmKKpRvIOHQP6Ct+uaUxtp5BZZ15GOH6G+0iqoGweD8pn7HyJlvExcwgu5AddrhxrXuB+F4Ej8ngUPDutr4zB84gxvA0xzJOzDJt6qj+YfxUow01yy2aGHbHzFeKQmSvEENf5rqg6tOVSdv1NPDoIRkRtYGJJOiVtGmZDfIlUBbI2iEhhncW9YyXY3hZVMLB6NqIE3kJ/A7hQhM0cOV0E1gxHNgB+9cLbRdtX5y+HcV2zeqR7zNKiIB/+GQsbHGEH77Ytkchw51b20qpfiVn2V/2yHQsevsavdqe6awldLg67Ytuz95+1H9mFsAPRPI1WsA2Nw7MOXx7SrcB32vp3VFmIMR084KQCftRZaP3lj4w8NKCYqQQ2iKsyeH/eOgPEYXhhxm5QBf1MPiuonelP9eF0evfxXnz9F4lym838pT5M68uzARCmxVyFuNIRGdmfhg224uYMBZcO0+/VN6XrGxS31jFHiYdOjD8p8XzM+EmcF0PyNDIJ51DHljPrgiF5wksrRN71wmfv1hiR4h26yjiEn4oexS7s5h0p1etFtG8PxEG2ZvMbiJqdGXnbopsnpzt8YZAOcmyi8HnEh+myHRS9sI+WqFIz8jYGsXV3b9T1x0+BtUeeTxNO93avn3DUADMf5WSFD710tRUHy6uHPDMsRngapgoMYL1vUle8BTYy0U/DxJIZ1SuKej06KGutrv3yaSxD7NvrnPbQnY0G3G+HsCQGglYfE0oRhwRyQehlgjaIvASi38RQTLvrN7fVSyxnux02H7YEB5yalVApH1loX2jdwjGqavrrDbnmTCDg4DiN0IQwEvolQ1M8EDAgmlDgZtvrRIr74aQ+aDb9tAdvSUS1JIFfw9B3NeVLWEmZ9oLRXzh43YnKSz
 /ZPGHI5N
 6OsNTicmkQZ3Y/+lp9k26QdrvzULXvE6r+DLfIHyWTpbM9tOvv3Wk7PzgENStzKmb+4RQ3h9fnhgVhY/pQw8PIllni4XFWd95XZXti2Xs075lNlytlI2yU/yVsULRC8v8O/lSw+ba+Kcz2dcqfyibTe2KhxRiv4zeubPTFyOBTfjq8kEY+2SOdTMOwyGMFwD4sDcq0SIJXxefJJAidsm4e9GFYNxr5JK0ShJh1jFHVImNrutJksAaL18S6hXsy7NYDP6xEkUGg6qUX9bMM8yZxu0RJw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Ravi Jonnalagadda <ravis.opensrc@micron.com> writes:

>>> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
>>>> On Tue 31-10-23 11:21:42, Johannes Weiner wrote:
>>>> > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
>>>> > > On Mon 30-10-23 20:38:06, Gregory Price wrote:
>>
>>[snip]
>>
>>>>
>>>> > This hopefully also explains why it's a global setting. The usecase is
>>>> > different from conventional NUMA interleaving, which is used as a
>>>> > locality measure: spread shared data evenly between compute
>>>> > nodes. This one isn't about locality - the CXL tier doesn't have local
>>>> > compute. Instead, the optimal spread is based on hardware parameters,
>>>> > which is a global property rather than a per-workload one.
>>>>
>>>> Well, I am not convinced about that TBH. Sure it is probably a good fit
>>>> for this specific CXL usecase but it just doesn't fit into many others I
>>>> can think of - e.g. proportional use of those tiers based on the
>>>> workload - you get what you pay for.
>>>>
>>>> Is there any specific reason for not having a new interleave interface
>>>> which defines weights for the nodemask? Is this because the policy
>>>> itself is very dynamic or is this more driven by simplicity of use?
>>>
>>> A downside of *requiring* weights to be paired with the mempolicy is
>>> that it's then the application that would have to figure out the
>>> weights dynamically, instead of having a static host configuration. A
>>> policy of "I want to be spread for optimal bus bandwidth" translates
>>> between different hardware configurations, but optimal weights will
>>> vary depending on the type of machine a job runs on.
>>>
>>> That doesn't mean there couldn't be usecases for having weights as
>>> policy as well in other scenarios, like you allude to above. It's just
>>> so far such usecases haven't really materialized or spelled out
>>> concretely. Maybe we just want both - a global default, and the
>>> ability to override it locally.
>>
>>I think that this is a good idea.  The system-wise configuration with
>>reasonable default makes applications life much easier.  If more control
>>is needed, some kind of workload specific configuration can be added.
>
> Glad that we are in agreement here. For bandwidth expansion use cases
> that this interleave patchset is trying to cater to, most applications
> would have to follow the "reasanable defaults" for weights.
> The necessity for applications to choose different weights while
> interleaving would probably be to do capacity expansion which the
> default memory tiering implementation would anyway support and provide
> better latency.
>
>>And, instead of adding another memory policy, a cgroup-wise
>>configuration may be easier to be used.  The per-workload weight may
>>need to be adjusted when we deploying different combination of workloads
>>in the system.
>>
>>Another question is that should the weight be per-memory-tier or
>>per-node?  In this patchset, the weight is per-source-target-node
>>combination.  That is, the weight becomes a matrix instead of a vector.
>>IIUC, this is used to control cross-socket memory access in addition to
>>per-memory-type memory access.  Do you think the added complexity is
>>necessary?
>
> Pros and Cons of Node based interleave:
> Pros:
> 1. Weights can be defined for devices with different bandwidth and latency
> characteristics individually irrespective of which tier they fall into.
> 2. Defining the weight per-source-target-node would be necessary for multi
> socket systems where few devices may be closer to one socket rather than other.
> Cons:
> 1. Weights need to be programmed for all the nodes which can be tedious for
> systems with lot of NUMA nodes.

2. More complex, so need justification, for example, practical use case.

> Pros and Cons of Memory Tier based interleave:
> Pros:
> 1. Programming weight per initiator would apply for all the nodes in the tier.
> 2. Weights can be calculated considering the cumulative bandwidth of all
> the nodes in the tier and need to be programmed once for all the nodes in a
> given tier.
> 3. It may be useful in cases where numa nodes with similar latency and bandwidth
> characteristics increase, possibly with pooling use cases.

4. simpler.

> Cons:
> 1. If nodes with different bandwidth and latency characteristics are placed
> in same tier as seen in the current mainline kernel, it will be difficult to
> apply a correct interleave weight policy.
> 2. There will be a need for functionality to move nodes between different tiers
> or create new tiers to place such nodes for programming correct interleave weights.
> We are working on a patch to support it currently.

Thanks!  If we have such system, we will need this.

> 3. For systems where each numa node is having different characteristics,
> a single node might end up existing in different memory tier, which would be
> equivalent to node based interleaving.

No.  A node can only exist in one memory tier.

> On newer systems where all CXL memory from different devices under a
> port are combined to form single numa node, this scenario might be
> applicable.

You mean the different memory ranges of a NUMA node may have different
performance?  I don't think that we can deal with this.

> 4. Users may need to keep track of different memory tiers and what nodes are present
> in each tier for invoking interleave policy.

I don't think this is a con.  With node based solution, you need to know
your system too.

>>
>>> Could you elaborate on the 'get what you pay for' usecase you
>>> mentioned?
>>

--
Best Regards,
Huang, Ying