From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4A5C3C4332F
	for <linux-mm@archiver.kernel.org>; Mon, 30 Oct 2023 05:25:24 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8B9636B0182; Mon, 30 Oct 2023 01:25:23 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 840E96B0183; Mon, 30 Oct 2023 01:25:23 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6E0F86B0184; Mon, 30 Oct 2023 01:25:23 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 5C3B66B0182
	for <linux-mm@kvack.org>; Mon, 30 Oct 2023 01:25:23 -0400 (EDT)
Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 320FD40484
	for <linux-mm@kvack.org>; Mon, 30 Oct 2023 05:25:23 +0000 (UTC)
X-FDA: 81400989726.02.B8022EC
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126])
	by imf15.hostedemail.com (Postfix) with ESMTP id 3908BA000D
	for <linux-mm@kvack.org>; Mon, 30 Oct 2023 05:25:20 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=RtUYwbiN;
	spf=pass (imf15.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1698643520;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=bOqs6MF9XgW5LLEleyNcHpmuc6wG/VnNwDuUlTUjA78=;
	b=yi+M6eJc8Ip5vvM2v/A8vHixsYFeJHBPixEZrDQq8VkzLwIKUb6GfApZFd9g9VI1GZa+A9
	GoVv9Yob0UT8V3aS3/6BO+X/z5MzZemwF3eYuWtmNVcVFqk2+QCXaAqIOTy2UGTuJD6gE5
	YzzxVX5mJPXiFdstv4+x2xL+ZuCYjY4=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=RtUYwbiN;
	spf=pass (imf15.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698643520; a=rsa-sha256;
	cv=none;
	b=llrjUrqwiBiQQ2msW06Oa7FL/YnD6pmn5zPEORQ3RFNCNVJ6vNRXmnG6mxWyknrEyRlea4
	myYAVjWh/fD8fnK9Kz0uzWNo/mjeTWU5sSOq3fJVxerak6kwt9eqx5FE9tMUnVYGnJ4nM4
	JZda2058xytjooaKG28KL7tuo627oGQ=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1698643520; x=1730179520;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version;
  bh=q8KoSMB/hpLdN583LzLTy2WR4ufc7oF95Z3VOZQo2IE=;
  b=RtUYwbiNrpVjDYGjJSdVAB3kDFur1z47o84D1Nk4xRFGy+n48Hq75tDG
   Ht6PeSshuY89o7Ovc0x+18qTENLub3+m1ZhDwpF2UmPS/tzm9tneN72TS
   5jKRnkPcQV6r10cq/wTvTJUo35EJruiNqrRTEtFNXaw6JDpC7ZsLaM1P8
   t1ZNl1AsdOS5ARiqDAkHheK9m3DiC7qtMziaS7ZGmt45yT7HEfd0qvHty
   zD64dAvITOL143UphanHOFvtx1Xrx6qh1L1tUCCeuI0TTHL4Hsm6gn4Vh
   ptjM49REa0tOJRl7+Hgmbk2NzYYHT2JxTOzbzt3z8PFEk+L58/TunvCXv
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10878"; a="373065031"
X-IronPort-AV: E=Sophos;i="6.03,262,1694761200"; 
   d="scan'208";a="373065031"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Oct 2023 22:25:18 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10878"; a="736638803"
X-IronPort-AV: E=Sophos;i="6.03,262,1694761200"; 
   d="scan'208";a="736638803"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Oct 2023 22:25:14 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Gregory Price <gregory.price@memverge.com>
Cc: Gregory Price <gourry.memverge@gmail.com>,  <linux-mm@kvack.org>,
  <linux-kernel@vger.kernel.org>,  <linux-cxl@vger.kernel.org>,
  <akpm@linux-foundation.org>,  <sthanneeru@micron.com>,  Aneesh Kumar K.V
 <aneesh.kumar@linux.ibm.com>,  Wei Xu <weixugc@google.com>,  Alistair
 Popple <apopple@nvidia.com>,  Dan Williams <dan.j.williams@intel.com>,
  Dave Hansen <dave.hansen@intel.com>,  Johannes Weiner
 <hannes@cmpxchg.org>,  "Jonathan Cameron" <Jonathan.Cameron@huawei.com>,
  Michal Hocko <mhocko@kernel.org>,  "Tim Chen" <tim.c.chen@intel.com>,
  Yang Shi <shy828301@gmail.com>
Subject: Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving
In-Reply-To: <ZT8u2246+vkA/4F+@memverge.com> (Gregory Price's message of "Mon,
	30 Oct 2023 00:19:39 -0400")
References: <ZS33ClT00KsHKsXQ@memverge.com>
	<87edhrunvp.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZS9HSIrblel39qrt@memverge.com>
	<87fs25g6w3.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZTEud5K5T+dRQMiM@memverge.com>
	<87ttqidr7v.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZTfjqoEZXQWs/rxV@memverge.com>
	<87lebrec82.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZTlxxR0ntEzBPwre@memverge.com>
	<87a5s0df6p.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZT8u2246+vkA/4F+@memverge.com>
Date: Mon, 30 Oct 2023 13:23:11 +0800
Message-ID: <87sf5sbs5c.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspamd-Queue-Id: 3908BA000D
X-Rspam-User: 
X-Stat-Signature: 1jff1mn13jhg4hhshogt4jdb77xgcbez
X-Rspamd-Server: rspam01
X-HE-Tag: 1698643520-694171
X-HE-Meta: U2FsdGVkX1+Qi7ApRom00RMN/5ewqmA9Gxcmv2asx7xmFTFU5grh8zl9qh/3K5Fxec7HaMXlh3y5nWbEe8bydqSCgsLHzKAGqllAGTi4r/a/bSgxg/n9kJSbl+fcEQCA5QDwvzT7sjSVrA3nk5dYblCHmB1PY0+1yQGsEq4mEYriKS83oOpbxEpu0f5a07Nav86BZ4XjOnX/GWHd4LlG+mj8U6cJz1pvJpd/JVYfZJTf2ZijkDaCH8oV0xzK1tsN53rLl2CUEQWaLqcxGew8o7V6F3LNKHqpyux8TaVnul5aJGR6bMpGe+LajqafViUR3t2BzMA0VwVTaz9RpiInMfTSFUiQInSL2QPmqL7Ho1TA+eMExYLwp+kNIsOeb6gWWC6K3ToporBFhtqPvwVwB5ZhraDITMoPxi1JNzcc2oDX/UACVOBFC3PtM2SSIB5Cq5qrRJs3bzHLdbhyqtZDA04XzajZE0UztgLEBK8F33Rrp76+iAaH6ok8zovqa0pg7tpJfU8CyfBAKJylsq0ECIXwKrtYpTPDVPH3GMXEuON4oEwNlrxfHgndFaKk5JxwEG9c3dbPLuQJ72QAe33uCO+o7Sm8CvqxTaJRMYvytCYSIW5JOXuDa0POhlRmRxTQVq4Bqnq6ASa1esNO0byRoIqfjtwQCg5SM1ltSCBlMbx03yjpLU5g4Mjv+Exr2nw7dU6V4AxJHLlKUIyPYFNRx4E+2/decHo6C6ThRR9yPLg+pYbJzkQ2VGQXfwDPt7j1jEyVF55AziVe7YUxKbXFqsvaBqInYrGh9vNX3xdtBr6cHwP67p50rqb5rQsqdyPV7bc5J+iI7PqZ2r6V8GBD6wdXZlQ27YPm3b6mynIwaFtX4DMvB/xTMXIRUsV+StzDU6MMYrlbYubwfha+oa7LYTaKG/aGuvkkT49d/pkggHJ2ML9iXMQBlPmVVQySFFjc0fRia9vGoUcqM0z1MO9
 G3BQDgP0
 y2OrE04WowXMtGj6KV82Y4d9ui0aHrQZAc8duvyGCZBaIA+hp2/1AUcnTXMcjV2cZhvB2MHI15PxWp2YlNOhKQucV3nWK0UTOcZJ76AqcNPIVF7xTc2KoCb4+EPXErXaWTmvGz/AkOxthwQpakncs8SKiiXsO0V884dKB+ZNUv4/TxruhfvCRGZizGZNFEQXrucjcvmERyfJyXRZpnlNGJCk3Ifw17U6ebMCcGo0197YC9yuu72naO8ABHux8iDjsVJhVdkk22B3d7cdff0OMlAcgjRjx8SQOQg6pWf78aXZknI5izJ4wo+x+fNxLYhzbV99ktnDgSADuS8Pr3BsD+FUf2YQIcma2OaEQjkMv38vxyCMzq+qag5VL7UR+oMoKXCAHK1iCyYHyVjbMfon38XyH/A==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Gregory Price <gregory.price@memverge.com> writes:

> On Mon, Oct 30, 2023 at 10:20:14AM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> 
>> The extending adds complexity to the kernel code and changes the kernel
>> ABI.  So, IMHO, we need some real life use case to prove the added
>> complexity is necessary.
>> 
>> For example, in [1], Johannes showed the use case to support to add
>> per-memory-tier interleave weight.
>> 
>> [1] https://lore.kernel.org/all/20220607171949.85796-1-hannes@cmpxchg.org/
>> 
>> --
>> Best Regards,
>> Huang, Ying
>
> Sorry, I misunderstood your question.
>
> The use case is the same as the N:M interleave strategy between tiers,
> and in fact the proposal for weights was directly inspired by the patch
> you posted. We're searching for the best way to implement weights.
>
> We've discussed placing these weights in:
>
> 1) mempolicy :
>    https://lore.kernel.org/linux-cxl/20230914235457.482710-1-gregory.price@memverge.com/
>
> 2) tiers
>    https://lore.kernel.org/linux-cxl/20231009204259.875232-1-gregory.price@memverge.com/
>
> and now
> 3) the nodes themselves
>    RFC not posted yet
>
> The use case is the exact same as the patch you posted, which is to enable
> optimal distribution of memory to maximize memory bandwidth usage.
>
> The use case is straight forward - Consider a machine with the following
> numa nodes:
>
> 1) Socket 0 - DRAM - ~400GB/s bandwidth local, less cross-socket
> 2) Socket 1 - DRAM - ~400GB/s bandwidth local, less cross socket
> 3) CXL Memory Attached to Socket 0 with ~64GB/s per link.
> 4) CXL Memory Attached to Socket 1 with ~64GB/s per link.
>
> The goal is to enable mempolicy to implement weighted interleave such
> that a thread running on socket 0 can effectively spread its memory
> across each numa node (or some subset there-of) such that it maximizes
> its bandwidth usage across the various devices.
>
> For example, lets consider a system with only 1 & 2 (2 sockets w/ DRAM).
>
> On an Intel System with UPI, the "effective" bandwidth available for a
> task on Socket 0 is not 800GB/s, it's about 450-500GB/s split about
> 300/200 between the sockets (you never get the full amount, and UPI limits
> cross-socket bandwidth).
>
> Today `numactl --interleave` will split your memory 50:50 between
> sockets, which is just blatantly suboptimal.  In this case you would
> prefer a 3:2 distribution (literally weights of 3 and 2 respectively).
>
> The extension to CXL becomes obvious then, as each individual node,
> respective to its CPU placement, has a different optimal weight.
>
>
> Of course the question becomes "what if a task uses more threads than a
> single socket has to offer", and the answer there is essentially the
> same as the answer today:  Then that process must become "numa-aware" to
> make the best use of the available resources.
>
> However, for software capable of exhausting bandwidth with from a single
> socket (which on intel takes about 16-20 threads with certain access
> patterns), then a weighted-interleave system provided via some interface
> like `numactl --weighted-interleave` with weights either set in numa
> nodes or mempolicy is sufficient.

I think that these are all possible in theory.  Thanks for detailed
explanation!

Now the question is whether these issues are relevant in practice.
Whether are all workloads with the extreme high memory bandwidth
requirement NUMA-aware?  Or multi-process instead of multi-thread?
Whether is the cross-socket traffic avoided as much as possible in
practice?  I have no answer to these questions.  Do you have?  Or
someone else can answer them?

--
Best Regards,
Huang, Ying