From: "Huang, Ying" <ying.huang@intel.com>
To: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Cc: <akpm@linux-foundation.org>, <aneesh.kumar@linux.ibm.com>,
<apopple@nvidia.com>, <dave.hansen@intel.com>,
<gourry.memverge@gmail.com>, <gregkh@linuxfoundation.org>,
<gregory.price@memverge.com>, <hannes@cmpxchg.org>,
<linux-cxl@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
<linux-mm@kvack.org>, <mhocko@suse.com>, <rafael@kernel.org>,
<shy828301@gmail.com>, <tim.c.chen@intel.com>,
<weixugc@google.com>
Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave
Date: Thu, 02 Nov 2023 14:41:03 +0800 [thread overview]
Message-ID: <87a5rw1wu8.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <20231101092923.283-1-ravis.opensrc@micron.com> (Ravi Jonnalagadda's message of "Wed, 1 Nov 2023 14:59:23 +0530")
Ravi Jonnalagadda <ravis.opensrc@micron.com> writes:
>>> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
>>>> On Tue 31-10-23 11:21:42, Johannes Weiner wrote:
>>>> > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
>>>> > > On Mon 30-10-23 20:38:06, Gregory Price wrote:
>>
>>[snip]
>>
>>>>
>>>> > This hopefully also explains why it's a global setting. The usecase is
>>>> > different from conventional NUMA interleaving, which is used as a
>>>> > locality measure: spread shared data evenly between compute
>>>> > nodes. This one isn't about locality - the CXL tier doesn't have local
>>>> > compute. Instead, the optimal spread is based on hardware parameters,
>>>> > which is a global property rather than a per-workload one.
>>>>
>>>> Well, I am not convinced about that TBH. Sure it is probably a good fit
>>>> for this specific CXL usecase but it just doesn't fit into many others I
>>>> can think of - e.g. proportional use of those tiers based on the
>>>> workload - you get what you pay for.
>>>>
>>>> Is there any specific reason for not having a new interleave interface
>>>> which defines weights for the nodemask? Is this because the policy
>>>> itself is very dynamic or is this more driven by simplicity of use?
>>>
>>> A downside of *requiring* weights to be paired with the mempolicy is
>>> that it's then the application that would have to figure out the
>>> weights dynamically, instead of having a static host configuration. A
>>> policy of "I want to be spread for optimal bus bandwidth" translates
>>> between different hardware configurations, but optimal weights will
>>> vary depending on the type of machine a job runs on.
>>>
>>> That doesn't mean there couldn't be usecases for having weights as
>>> policy as well in other scenarios, like you allude to above. It's just
>>> so far such usecases haven't really materialized or spelled out
>>> concretely. Maybe we just want both - a global default, and the
>>> ability to override it locally.
>>
>>I think that this is a good idea. The system-wise configuration with
>>reasonable default makes applications life much easier. If more control
>>is needed, some kind of workload specific configuration can be added.
>
> Glad that we are in agreement here. For bandwidth expansion use cases
> that this interleave patchset is trying to cater to, most applications
> would have to follow the "reasanable defaults" for weights.
> The necessity for applications to choose different weights while
> interleaving would probably be to do capacity expansion which the
> default memory tiering implementation would anyway support and provide
> better latency.
>
>>And, instead of adding another memory policy, a cgroup-wise
>>configuration may be easier to be used. The per-workload weight may
>>need to be adjusted when we deploying different combination of workloads
>>in the system.
>>
>>Another question is that should the weight be per-memory-tier or
>>per-node? In this patchset, the weight is per-source-target-node
>>combination. That is, the weight becomes a matrix instead of a vector.
>>IIUC, this is used to control cross-socket memory access in addition to
>>per-memory-type memory access. Do you think the added complexity is
>>necessary?
>
> Pros and Cons of Node based interleave:
> Pros:
> 1. Weights can be defined for devices with different bandwidth and latency
> characteristics individually irrespective of which tier they fall into.
> 2. Defining the weight per-source-target-node would be necessary for multi
> socket systems where few devices may be closer to one socket rather than other.
> Cons:
> 1. Weights need to be programmed for all the nodes which can be tedious for
> systems with lot of NUMA nodes.
2. More complex, so need justification, for example, practical use case.
> Pros and Cons of Memory Tier based interleave:
> Pros:
> 1. Programming weight per initiator would apply for all the nodes in the tier.
> 2. Weights can be calculated considering the cumulative bandwidth of all
> the nodes in the tier and need to be programmed once for all the nodes in a
> given tier.
> 3. It may be useful in cases where numa nodes with similar latency and bandwidth
> characteristics increase, possibly with pooling use cases.
4. simpler.
> Cons:
> 1. If nodes with different bandwidth and latency characteristics are placed
> in same tier as seen in the current mainline kernel, it will be difficult to
> apply a correct interleave weight policy.
> 2. There will be a need for functionality to move nodes between different tiers
> or create new tiers to place such nodes for programming correct interleave weights.
> We are working on a patch to support it currently.
Thanks! If we have such system, we will need this.
> 3. For systems where each numa node is having different characteristics,
> a single node might end up existing in different memory tier, which would be
> equivalent to node based interleaving.
No. A node can only exist in one memory tier.
> On newer systems where all CXL memory from different devices under a
> port are combined to form single numa node, this scenario might be
> applicable.
You mean the different memory ranges of a NUMA node may have different
performance? I don't think that we can deal with this.
> 4. Users may need to keep track of different memory tiers and what nodes are present
> in each tier for invoking interleave policy.
I don't think this is a con. With node based solution, you need to know
your system too.
>>
>>> Could you elaborate on the 'get what you pay for' usecase you
>>> mentioned?
>>
--
Best Regards,
Huang, Ying
next prev parent reply other threads:[~2023-11-02 6:43 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-31 0:38 Gregory Price
2023-10-31 0:38 ` [RFC PATCH v3 1/4] base/node.c: initialize the accessor list before registering Gregory Price
2023-10-31 0:38 ` [RFC PATCH v3 2/4] node: add accessors to sysfs when nodes are created Gregory Price
2023-10-31 0:38 ` [RFC PATCH v3 3/4] node: add interleave weights to node accessor Gregory Price
2023-10-31 0:38 ` [RFC PATCH v3 4/4] mm/mempolicy: modify interleave mempolicy to use node weights Gregory Price
2023-10-31 17:52 ` [EXT] " Srinivasulu Thanneeru
2023-10-31 18:23 ` Srinivasulu Thanneeru
2023-10-31 9:53 ` [RFC PATCH v3 0/4] Node Weights and Weighted Interleave Michal Hocko
2023-10-31 15:21 ` Johannes Weiner
2023-10-31 15:56 ` Michal Hocko
2023-10-31 4:27 ` Gregory Price
2023-11-01 13:45 ` Michal Hocko
2023-11-01 16:58 ` Gregory Price
2023-11-02 9:47 ` Michal Hocko
2023-11-02 3:18 ` Gregory Price
2023-11-03 7:45 ` Huang, Ying
2023-11-03 14:16 ` Jonathan Cameron
2023-11-06 3:20 ` Huang, Ying
2023-11-03 9:56 ` Michal Hocko
2023-11-02 18:21 ` Gregory Price
2023-11-03 16:59 ` Michal Hocko
2023-11-02 2:01 ` Huang, Ying
2023-10-31 16:22 ` Johannes Weiner
2023-10-31 4:29 ` Gregory Price
2023-11-01 2:34 ` Huang, Ying
2023-11-01 9:29 ` Ravi Jonnalagadda
2023-11-02 6:41 ` Huang, Ying [this message]
2023-11-02 9:35 ` Ravi Jonnalagadda
2023-11-02 14:13 ` Jonathan Cameron
2023-11-03 7:00 ` Huang, Ying
2023-11-01 13:56 ` Michal Hocko
2023-11-02 6:21 ` Huang, Ying
2023-11-02 9:30 ` Michal Hocko
2023-11-01 2:21 ` Huang, Ying
2023-11-01 14:01 ` Michal Hocko
2023-11-02 6:11 ` Huang, Ying
2023-11-02 9:28 ` Michal Hocko
2023-11-03 7:10 ` Huang, Ying
2023-11-03 9:39 ` Michal Hocko
2023-11-06 5:08 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87a5rw1wu8.fsf@yhuang6-desk2.ccr.corp.intel.com \
--to=ying.huang@intel.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=apopple@nvidia.com \
--cc=dave.hansen@intel.com \
--cc=gourry.memverge@gmail.com \
--cc=gregkh@linuxfoundation.org \
--cc=gregory.price@memverge.com \
--cc=hannes@cmpxchg.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=rafael@kernel.org \
--cc=ravis.opensrc@micron.com \
--cc=shy828301@gmail.com \
--cc=tim.c.chen@intel.com \
--cc=weixugc@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox