linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Gregory Price <gourry.memverge@gmail.com>,
	 linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-mm@kvack.org,  ying.huang@intel.com,
	akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com,
	 weixugc@google.com, apopple@nvidia.com, tim.c.chen@intel.com,
	dave.hansen@intel.com,  shy828301@gmail.com,
	gregkh@linuxfoundation.org, rafael@kernel.org,
	 Gregory Price <gregory.price@memverge.com>
Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave
Date: Tue, 31 Oct 2023 16:56:27 +0100	[thread overview]
Message-ID: <jgh5b5bm73qe7m3qmnsjo3drazgfaix3ycqmom5u6tfp6hcerj@ij4vftrutvrt> (raw)
In-Reply-To: <20231031152142.GA3029315@cmpxchg.org>

On Tue 31-10-23 11:21:42, Johannes Weiner wrote:
> On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
> > On Mon 30-10-23 20:38:06, Gregory Price wrote:
> > > This patchset implements weighted interleave and adds a new sysfs
> > > entry: /sys/devices/system/node/nodeN/accessM/il_weight.
> > > 
> > > The il_weight of a node is used by mempolicy to implement weighted
> > > interleave when `numactl --interleave=...` is invoked.  By default
> > > il_weight for a node is always 1, which preserves the default round
> > > robin interleave behavior.
> > > 
> > > Interleave weights may be set from 0-100, and denote the number of
> > > pages that should be allocated from the node when interleaving
> > > occurs.
> > > 
> > > For example, if a node's interleave weight is set to 5, 5 pages
> > > will be allocated from that node before the next node is scheduled
> > > for allocations.
> > 
> > I find this semantic rather weird TBH. First of all why do you think it
> > makes sense to have those weights global for all users? What if
> > different applications have different view on how to spred their
> > interleaved memory?
> > 
> > I do get that you might have a different tiers with largerly different
> > runtime characteristics but why would you want to interleave them into a
> > single mapping and have hard to predict runtime behavior?
> > 
> > [...]
> > > In this way it becomes possible to set an interleaving strategy
> > > that fits the available bandwidth for the devices available on
> > > the system. An example system:
> > > 
> > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
> > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
> > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
> > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex
> > > 
> > > In this setup, the effective weights for nodes 0-3 for a task
> > > running on Node 0 may be [60, 20, 10, 10].
> > > 
> > > This spreads memory out across devices which all have different
> > > latency and bandwidth attributes at a way that can maximize the
> > > available resources.
> > 
> > OK, so why is this any better than not using any memory policy rely
> > on demotion to push out cold memory down the tier hierarchy?
> > 
> > What is the actual real life usecase and what kind of benefits you can
> > present?
> 
> There are two things CXL gives you: additional capacity and additional
> bus bandwidth.
> 
> The promotion/demotion mechanism is good for the capacity usecase,
> where you have a nice hot/cold gradient in the workingset and want
> placement accordingly across faster and slower memory.
> 
> The interleaving is useful when you have a flatter workingset
> distribution and poorer access locality. In that case, the CPU caches
> are less effective and the workload can be bus-bound. The workload
> might fit entirely into DRAM, but concentrating it there is
> suboptimal. Fanning it out in proportion to the relative performance
> of each memory tier gives better resuls.
> 
> We experimented with datacenter workloads on such machines last year
> and found significant performance benefits:
> 
> https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/

Thanks, this is a useful insight.
 
> This hopefully also explains why it's a global setting. The usecase is
> different from conventional NUMA interleaving, which is used as a
> locality measure: spread shared data evenly between compute
> nodes. This one isn't about locality - the CXL tier doesn't have local
> compute. Instead, the optimal spread is based on hardware parameters,
> which is a global property rather than a per-workload one.

Well, I am not convinced about that TBH. Sure it is probably a good fit
for this specific CXL usecase but it just doesn't fit into many others I
can think of - e.g. proportional use of those tiers based on the
workload - you get what you pay for.

Is there any specific reason for not having a new interleave interface
which defines weights for the nodemask? Is this because the policy
itself is very dynamic or is this more driven by simplicity of use?

-- 
Michal Hocko
SUSE Labs


  reply	other threads:[~2023-10-31 15:56 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-31  0:38 Gregory Price
2023-10-31  0:38 ` [RFC PATCH v3 1/4] base/node.c: initialize the accessor list before registering Gregory Price
2023-10-31  0:38 ` [RFC PATCH v3 2/4] node: add accessors to sysfs when nodes are created Gregory Price
2023-10-31  0:38 ` [RFC PATCH v3 3/4] node: add interleave weights to node accessor Gregory Price
2023-10-31  0:38 ` [RFC PATCH v3 4/4] mm/mempolicy: modify interleave mempolicy to use node weights Gregory Price
2023-10-31 17:52   ` [EXT] " Srinivasulu Thanneeru
2023-10-31 18:23   ` Srinivasulu Thanneeru
2023-10-31  9:53 ` [RFC PATCH v3 0/4] Node Weights and Weighted Interleave Michal Hocko
2023-10-31 15:21   ` Johannes Weiner
2023-10-31 15:56     ` Michal Hocko [this message]
2023-10-31  4:27       ` Gregory Price
2023-11-01 13:45         ` Michal Hocko
2023-11-01 16:58           ` Gregory Price
2023-11-02  9:47             ` Michal Hocko
2023-11-02  3:18               ` Gregory Price
2023-11-03  7:45                 ` Huang, Ying
2023-11-03 14:16                   ` Jonathan Cameron
2023-11-06  3:20                     ` Huang, Ying
2023-11-03  9:56                 ` Michal Hocko
2023-11-02 18:21                   ` Gregory Price
2023-11-03 16:59                     ` Michal Hocko
2023-11-02  2:01         ` Huang, Ying
2023-10-31 16:22       ` Johannes Weiner
2023-10-31  4:29         ` Gregory Price
2023-11-01  2:34         ` Huang, Ying
2023-11-01  9:29           ` Ravi Jonnalagadda
2023-11-02  6:41             ` Huang, Ying
2023-11-02  9:35               ` Ravi Jonnalagadda
2023-11-02 14:13                 ` Jonathan Cameron
2023-11-03  7:00                 ` Huang, Ying
2023-11-01 13:56         ` Michal Hocko
2023-11-02  6:21           ` Huang, Ying
2023-11-02  9:30             ` Michal Hocko
2023-11-01  2:21       ` Huang, Ying
2023-11-01 14:01         ` Michal Hocko
2023-11-02  6:11           ` Huang, Ying
2023-11-02  9:28             ` Michal Hocko
2023-11-03  7:10               ` Huang, Ying
2023-11-03  9:39                 ` Michal Hocko
2023-11-06  5:08                   ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=jgh5b5bm73qe7m3qmnsjo3drazgfaix3ycqmom5u6tfp6hcerj@ij4vftrutvrt \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=apopple@nvidia.com \
    --cc=dave.hansen@intel.com \
    --cc=gourry.memverge@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=gregory.price@memverge.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rafael@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=tim.c.chen@intel.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox