From: Gregory Price <gregory.price@memverge.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-cxl@vger.kernel.org, akpm@linux-foundation.org,
sthanneeru@micron.com,
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
Wei Xu <weixugc@google.com>, Alistair Popple <apopple@nvidia.com>,
Dan Williams <dan.j.williams@intel.com>,
Dave Hansen <dave.hansen@intel.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Jonathan Cameron <Jonathan.Cameron@huawei.com>,
Michal Hocko <mhocko@kernel.org>, Tim Chen <tim.c.chen@intel.com>,
Yang Shi <shy828301@gmail.com>
Subject: Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving
Date: Thu, 19 Oct 2023 09:26:15 -0400 [thread overview]
Message-ID: <ZTEud5K5T+dRQMiM@memverge.com> (raw)
In-Reply-To: <87fs25g6w3.fsf@yhuang6-desk2.ccr.corp.intel.com>
On Fri, Oct 20, 2023 at 02:11:40PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
>
> >
[...snip...]
> > Example 2: A dual-socket system with 1 CXL device per socket
> > ===
> > CPU Nodes: node0, node1
> > CXL Nodes: node2, node3 (on sockets 0 and 1 respective)
> >
[...snip...]
> > This is similar to example #1, but with one difference: A task running
> > on node 0 should not treat nodes 0 and 1 the same, nor nodes 2 and 3.
[...snip...]
> > This leaves us with weights of:
> >
> > node0 - 57%
> > node1 - 26%
> > node2 - 12%
> > node3 - 5%
> >
>
> Does the workload run on CPU of node 0 only? This appears unreasonable.
Depends. if a user explicitly launches with `numactl --cpunodebind=0`
then yes, you can force a task (and all its children) to run on node0.
If a workload multi-threaded enough to run on both sockets, then you are
right that you'd want to basically limit cross-socket traffic by binding
individual threads to nodes that don't cross sockets - if at all
feasible this may not be feasible).
But at that point, we're getting into the area of numa-aware software.
That's a bit beyond the scope of this - which is to enable a coarse
grained interleaving solution that can easily be accessed with something
like `numactl --interleave` or `numactl --weighted-interleave`.
> If the memory bandwidth requirement of the workload is so large that CXL
> is used to expand bandwidth, why not run workload on CPU of node 1 and
> use the full memory bandwidth of node 1?
Settings are NOT one size fits all. You can certainly come up with another
scenario in which these weights are not optimal.
If we're running enough threads that we need multiple sockets to run
them concurrently, then the memory distribution weights become much more
complex. Without more precise control over task placement and
preventing task migration, you can't really get an "optimal" placement.
What I'm really saying is "Task placement is a more powerful function
for predicting performance than memory placement". However, user
software would need to implement a pseudo-scheduler and explicit data
placement to be the most optimized. Beyond this, there is only so much
we can do from a `numactl` perspective.
tl;dr: We can't get a perfect system here, because getting a best-case
for all possible scenarios is an probably undecidable problem. You will
always be able to generate an example wherein the system is not optimal.
>
> If the workload run on CPU of node 0 and node 1, then the cross-socket
> traffic should be minimized if possible. That is, threads/processes on
> node 0 should interleave memory of node 0 and node 2, while that on node
> 1 should interleave memory of node 1 and node 3.
This can be done with set_mempolicy() with MPOL_INTERLEAVE and set the
nodemask to the what you describe. Those tasks need to also prevent
themselves from being migrated as well. But this can absolutely be
done.
In this scenario, the weights need to be re-calculated to be based on
the bandwidth of the nodes in the mempolicy nodemask, which is what i
described in the last email.
~Gregory
next prev parent reply other threads:[~2023-10-20 16:33 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-09 20:42 Gregory Price
2023-10-09 20:42 ` [RFC PATCH v2 1/3] mm/memory-tiers: change mutex to rw semaphore Gregory Price
2023-10-09 20:42 ` [RFC PATCH v2 2/3] mm/memory-tiers: Introduce sysfs for tier interleave weights Gregory Price
2023-10-09 20:42 ` [RFC PATCH v2 3/3] mm/mempolicy: modify interleave mempolicy to use memtier weights Gregory Price
2023-10-11 21:15 ` [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving Matthew Wilcox
2023-10-10 1:07 ` Gregory Price
2023-10-16 7:57 ` Huang, Ying
2023-10-17 1:28 ` Gregory Price
2023-10-18 8:29 ` Huang, Ying
2023-10-17 2:52 ` Gregory Price
2023-10-19 6:28 ` Huang, Ying
2023-10-18 2:47 ` Gregory Price
2023-10-20 6:11 ` Huang, Ying
2023-10-19 13:26 ` Gregory Price [this message]
2023-10-23 2:09 ` Huang, Ying
2023-10-24 15:32 ` Gregory Price
2023-10-25 1:13 ` Huang, Ying
2023-10-25 19:51 ` Gregory Price
2023-10-30 2:20 ` Huang, Ying
2023-10-30 4:19 ` Gregory Price
2023-10-30 5:23 ` Huang, Ying
2023-10-18 8:31 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZTEud5K5T+dRQMiM@memverge.com \
--to=gregory.price@memverge.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=apopple@nvidia.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@intel.com \
--cc=gourry.memverge@gmail.com \
--cc=hannes@cmpxchg.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=shy828301@gmail.com \
--cc=sthanneeru@micron.com \
--cc=tim.c.chen@intel.com \
--cc=weixugc@google.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox