From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9453BCDB474
	for <linux-mm@archiver.kernel.org>; Fri, 20 Oct 2023 06:13:53 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id F30548D01BD; Fri, 20 Oct 2023 02:13:52 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id EE0358D001D; Fri, 20 Oct 2023 02:13:52 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id DA8148D01BD; Fri, 20 Oct 2023 02:13:52 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id CB2C08D001D
	for <linux-mm@kvack.org>; Fri, 20 Oct 2023 02:13:52 -0400 (EDT)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 92E26807A1
	for <linux-mm@kvack.org>; Fri, 20 Oct 2023 06:13:52 +0000 (UTC)
X-FDA: 81364823904.28.EEF4D82
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.31])
	by imf18.hostedemail.com (Postfix) with ESMTP id 5BA531C000A
	for <linux-mm@kvack.org>; Fri, 20 Oct 2023 06:13:50 +0000 (UTC)
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=kESeb6Dk;
	spf=pass (imf18.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1697782430;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=I/wFvgbuAVJqGcHzudqfpPu0vyaXkrIYjS+j8y77GqI=;
	b=vRv9m2IuoN6OJDDvcNhMhPvncuJzmb/TchFwHiHViohe8O7Q1brPptRr8m+c3VZQx1LjXJ
	cwaTlEJbehW0xyOJyHMBpMhIz5zTs8PhEqEDAT9ajA/DlXBpRlibLYN67+ufiYqVhHo0XF
	p7JY7ssT2J3LqY1eMm/9ksy6Rzinbs8=
ARC-Authentication-Results: i=1;
	imf18.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=kESeb6Dk;
	spf=pass (imf18.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697782430; a=rsa-sha256;
	cv=none;
	b=XBcakw2JZiRAWsWrJDtOM5NsWsX3XJefqwhvkTXPne9ZtMc5KXEnz8+f2LnKSN8dmEvstF
	GKnj21m1WD30usUG/jQYM9ij+1Sy/d8kCj8yZzit6Qld91jDIdZVIVVAh0o5j6DBghP2EX
	wyxJfxsJ/4LEf3jxKUhMpXWP+vUrgds=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1697782430; x=1729318430;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version;
  bh=emDe/46ozX0hx7U6yivMfWjRH+gZmCG3CieXsBPTIhI=;
  b=kESeb6Dkcje0qRSgDV6tk5jOMFRUjsxO+obXPT9lcNnhNpKGIQDoa3Li
   gf8wC+fom2fJQRbUm9Ib7u7Sxl9bh9gBXWikNmtlv9ZmKguZSg0eHnVup
   7Y9ndAaUOnXZxH7tKPMpYQUXe4A+ktio5e3zwKWIBYiEM73X2zYLx/xzr
   auZVpgSBCPrh0NqYFh9YGdE5i1bGeHxKoi07WYms9R0iuZXF2QWeBL/sV
   eauQzR8RPtE93VD0OucoxNlGOsQzS+2NqYW2LOsghaxyhZzizfPP0NkUW
   c8IqjfPQLIuC5UyGmUsaUyEMnyxWh3cCn0kvtNVftieWTKSLft29Yugjq
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10868"; a="450662738"
X-IronPort-AV: E=Sophos;i="6.03,238,1694761200"; 
   d="scan'208";a="450662738"
Received: from fmsmga004.fm.intel.com ([10.253.24.48])
  by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Oct 2023 23:13:48 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10868"; a="827623898"
X-IronPort-AV: E=Sophos;i="6.03,238,1694761200"; 
   d="scan'208";a="827623898"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Oct 2023 23:13:43 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Gregory Price <gregory.price@memverge.com>
Cc: Gregory Price <gourry.memverge@gmail.com>,  <linux-mm@kvack.org>,
  <linux-kernel@vger.kernel.org>,  <linux-cxl@vger.kernel.org>,
  <akpm@linux-foundation.org>,  <sthanneeru@micron.com>,  Aneesh Kumar K.V
 <aneesh.kumar@linux.ibm.com>,  Wei Xu <weixugc@google.com>,  Alistair
 Popple <apopple@nvidia.com>,  Dan Williams <dan.j.williams@intel.com>,
  Dave Hansen <dave.hansen@intel.com>,  Johannes Weiner
 <hannes@cmpxchg.org>,  "Jonathan Cameron" <Jonathan.Cameron@huawei.com>,
  Michal Hocko <mhocko@kernel.org>,  "Tim Chen" <tim.c.chen@intel.com>,
  Yang Shi <shy828301@gmail.com>
Subject: Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving
In-Reply-To: <ZS9HSIrblel39qrt@memverge.com> (Gregory Price's message of "Tue,
	17 Oct 2023 22:47:36 -0400")
References: <20231009204259.875232-1-gregory.price@memverge.com>
	<87o7gzm22n.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZS3jQRnX4VIdyTL5@memverge.com>
	<87pm1cwcz5.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZS33ClT00KsHKsXQ@memverge.com>
	<87edhrunvp.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZS9HSIrblel39qrt@memverge.com>
Date: Fri, 20 Oct 2023 14:11:40 +0800
Message-ID: <87fs25g6w3.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspamd-Queue-Id: 5BA531C000A
X-Rspam-User: 
X-Stat-Signature: s6b9nmy4ri7js7hohj37satr6zfuwfym
X-Rspamd-Server: rspam01
X-HE-Tag: 1697782430-513211
X-HE-Meta: U2FsdGVkX1+gIDN54Qz08CHlikRmFbjlkDl3BUFVEpTrnVVuVef9FOFQTq1KvTScgPej8jqMhOBLeYXIOKjX8CdhLYPIs44n7J+8qTmtH+n4CBaJaeO/eaokXDkSfOadjEKzZtvaQ1o7latGUhisi3r+HvqeQDCeCgMqQJvT66cDBXuxnvAJBYl1bRlyssz529XE05srfLj7qfbi/wD2Z8NaAIvI9GgmEB6t5YNGfr0c2dATBvBCGK9SCi0U01vnO4+6+HFlWlV9Rym32U4aXg55UoEW8PobjSYnnU7nbRdbIziTWh3jHN8f3vgv6K3EbNvPp+0slWIMG3JiBo5XbuKWkcvuPqEQPa+uI/GnWoqyg61LgCraS+f5F9/zalYuZFUcWSHHAx+TWiKLlAJyqg6016Fd+c/Huqhts2ymGaNVryrnrt6igZWZv6UppR7oZ+R9zxdFR/LGSfIMSVyQfPHECrE9/Dy/84HfuynFnS3ctxSq0Frv73jhwCqSX/LRXPvusD55KYMUcyWQjIIwLz+mLZI1PD2RmzDD+KAoL9AhIaXwcXa5Vh+ReHyj6sff6SB8umNDRN9md66EZUcwW0Y/lCmdezuc565OzZkG6X5VLIZ+rNr8XazQCBxZ6u9iUXh++6B9+fL942ehZXobzIKKTMD1OaQxJg73IRwY4Oq/3loUSqU4B7akNJC341eNFdJ1CLDdBXqgp3+wx+XmLxxYrT1WcYrbO3rnvOwOruHwaH5NlgI64RPZ2z5iPMNqXyvYN6xHxROXe5NQm7mZM4n0LGkpqmAd1ew+b/tXGA5xSKqnar8G28LYrMB2LvrcomntR/OiPfvowEscuQeTku8vwSkRTIlY+EqbV5SkOnjyjwbWDwFsnbi9ZZs5qh+jjF/sGbhaykolRuXcmqDS+6xp4nsfQMu+G+jXX9XkNiG7FLqfTKTj1if+l9UZUqAq4em8QuZt6SeKcjzgMje
 Viny8uAA
 +n0Go+WjNA+19chpJ38fQEZ0Hydtkb1kLdQXa1ZReOKm2gEJvUuyImKpxwZLYJ5QUFWPkopasQXL8mg36fZSrrc3Ys93oyYh+uBDgSzTGHhFX2ed/RxM4IlEr8EsGk/1TdRD8ZmWFtkZNqB00iUIXsrO1pvFdzluWxWbWgrUe7l1tlGVTb+zbsqj6MhyOxKG9xvfXWgwr2V4obEffG9bQPd+IIydZK4myrYgeVeOpA00+JsIYiHyCLMNBK92noLpPTv7K7J6Kluv6g7VgMXpI7Cjsjw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Gregory Price <gregory.price@memverge.com> writes:

[snip]

> Example 1: A single-socket system with multiple CXL memory devices
> ===
> CPU Node: node0
> CXL Nodes: node1, node2
>
> Bandwidth attributes (in theory):
> node0 - 8 channels - ~307GB/s
> node1 - x16 link - 64GB/s
> node2 - x8 link - 32GB/s
>
> In a system like this, the optimal distribution of memory on an
> interleave for maximizing bandwidth is about 76%/16%/8%.
>
> for the sake of simplicity:  --weighted-interleave=0:76,1:16,0:8
> but realistically we could make the weights sysfs values in the node
>
> Regardless of the mechanism to engage this, the most effective way to
> capture this in the system is by applying weights to nodes, not tiers.
> If done in tiers, each node would be assigned to its own tier, making
> the mechanism equivalent. So you might as well simplify the whole thing
> and chop the memtier component out.
>
> Is this configuration realistic? *shrug* - technically possible. And in
> fact most hardware or driver based interleaving mechanisms would not
> really be able to manage an interleave region across these nodes, at
> least not without placing the x16 driver in x8 mode, or just having the
> wrong distribution %'s.
>
>
>
> Example 2: A dual-socket system with 1 CXL device per socket
> ===
> CPU Nodes: node0, node1
> CXL Nodes: node2, node3 (on sockets 0 and 1 respective)
>
> Bandwidth Attributes (in theory):
> nodes 0 & 1 - 8 channels - ~307GB/s ea.
> nodes 2 & 3 - x16 link - 64GB/s ea.
>
> This is similar to example #1, but with one difference:  A task running
> on node 0 should not treat nodes 0 and 1 the same, nor nodes 2 and 3.
> This is because on access to nodes 1 and 3, the cross-socket link (UPI,
> or whatever AMD calls it) becomes a bandwidth chokepoint.
>
> So from the perspective of node 0, the "real total" available bandwidth
> is about 307GB+64GB+(41.6GB * UPI Links) in the case of intel.  so the
> best result you could get is around 307+64+164=535GB/s if you have the
> full 4 links.
>
> You'd want to distribute the cross-socket traffic proportional to UPI,
> not the total.
>
> This leaves us with weights of:
>
> node0 - 57%
> node1 - 26%
> node2 - 12%
> node3 - 5%
>
> Again, naturally nodes are the place to carry the weights here. In this
> scenario, placing it in memory-tiers would require that 1 tier per node
> existed.

Does the workload run on CPU of node 0 only?  This appears unreasonable.
If the memory bandwidth requirement of the workload is so large that CXL
is used to expand bandwidth, why not run workload on CPU of node 1 and
use the full memory bandwidth of node 1?

If the workload run on CPU of node 0 and node 1, then the cross-socket
traffic should be minimized if possible.  That is, threads/processes on
node 0 should interleave memory of node 0 and node 2, while that on node
1 should interleave memory of node 1 and node 3.

But TBH, I lacks knowledge about the real life workloads.  So, my
understanding may be wrong.  Please correct me for any mistake.

--
Best Regards,
Huang, Ying

[snip]