linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Gregory Price <gregory.price@memverge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	 Gregory Price <gourry.memverge@gmail.com>,
	linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
	 linux-mm@kvack.org, ying.huang@intel.com,
	akpm@linux-foundation.org,  aneesh.kumar@linux.ibm.com,
	weixugc@google.com, apopple@nvidia.com, tim.c.chen@intel.com,
	 dave.hansen@intel.com, shy828301@gmail.com,
	gregkh@linuxfoundation.org,  rafael@kernel.org
Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave
Date: Fri, 3 Nov 2023 10:56:01 +0100	[thread overview]
Message-ID: <s72oio7nmez565i7h6fb4bdnhqkcablr34rz5gqteyrrf7yeux@lqrztvy35si5> (raw)
In-Reply-To: <ZUMVI4YG7mB54u0D@memverge.com>

On Wed 01-11-23 23:18:59, Gregory Price wrote:
> On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote:
> > On Wed 01-11-23 12:58:55, Gregory Price wrote:
> > > Basically consider: `numactl --interleave=all ...`
> > > 
> > > If `--weights=...`: when a node hotplug event occurs, there is no
> > > recourse for adding a weight for the new node (it will default to 1).
> > 
> > Correct and this is what I was asking about in an earlier email. How
> > much do we really need to consider this setup. Is this something nice to
> > have or does the nature of the technology requires to be fully dynamic
> > and expect new nodes coming up at any moment?
> >  
> 
> Dynamic Capacity is expected to cause a numa node to change size (in
> number of memory blocks) rather than cause numa nodes to come and go, so
> maybe handling the full node hotplug is a bit of an overreach.
> 
> Good call, I'll stop considering this problem for now.
> 
> > > If the node is removed from the system, I believe (need to validate
> > > this, but IIRC) the node will be removed from any registered cpusets.
> > > As a result, that falls down to mempolicy, and the node is removed.
> > 
> > I do not think we do anything like that. Userspace might decide to
> > change the numa mask when a node is offlined but I do not think we do
> > anything like that automagically.
> >
> 
> mpol_rebind_policy called by update_tasks_nodemask
> https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L319
> https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L2016
> 
> falls down from cpuset_hotplug_workfn:
> https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L3771

Ohh, have missed that. Thanks for the reference. Quite honestly I am not
sure this code is really a) necessary and b) ever exercised. For the
former I would argue that offline node could be treated as completely
depleted one. From the correctness POV it shouldn't make any difference
and I am rather skeptical it would have performance improvements. And
for the latter, full node offlines are really rare from experience. I
would be interested about actual real life usecases which do that
regularly. I do remember a certain HW vendor working on a hotplugable
system (both CPUs and memory) to reduce downtimes cause by misbehaving
CPUs/memoryu. This has turned out very impractical because of movable
memory requirements and also some HW limitations (like most HW attached
to Node0 which has turned out to be single point of failure anyway).

[...]
[...]
> > Moving the global policy to cgroups would make the main cocern of
> > different workloads looking for different policy less problamatic.
> > I didn't have much time to think that through but the main question is
> > how to sanely define hierarchical properties of those weights? This is
> > more of a resource distribution than enforcement so maybe a simple
> > inherit or overwrite (if you have a more specific needs) semantic makes
> > sense and it is sufficient.
> >
> 
> As a user I would assume it would operate much the same way as other
> nested cgroups, which is inherit by default (with subsets) or an
> explicit overwrite that can't exceed the higher level settings.

This would make it rather impractical because a default (everything set
to 1) would be cast in stone. As mentioned above this this not an
enforcement limit. So I _think_ that a simple hierarchical rule like
	cgroup_interleaving_mask(cgroup)
		interleaving_mask = (cgroup->interleaving_mask) ? : cgroup_interleaving_mask(parent_cgroup(cgroup))

So child cgroups could overwrite parent as they wish. If there is any
enforcement (like a cpuset) that would filter useable nodes and the
allocation policy would simply apply weights on those.

-- 
Michal Hocko
SUSE Labs


  parent reply	other threads:[~2023-11-03  9:56 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-31  0:38 Gregory Price
2023-10-31  0:38 ` [RFC PATCH v3 1/4] base/node.c: initialize the accessor list before registering Gregory Price
2023-10-31  0:38 ` [RFC PATCH v3 2/4] node: add accessors to sysfs when nodes are created Gregory Price
2023-10-31  0:38 ` [RFC PATCH v3 3/4] node: add interleave weights to node accessor Gregory Price
2023-10-31  0:38 ` [RFC PATCH v3 4/4] mm/mempolicy: modify interleave mempolicy to use node weights Gregory Price
2023-10-31 17:52   ` [EXT] " Srinivasulu Thanneeru
2023-10-31 18:23   ` Srinivasulu Thanneeru
2023-10-31  9:53 ` [RFC PATCH v3 0/4] Node Weights and Weighted Interleave Michal Hocko
2023-10-31 15:21   ` Johannes Weiner
2023-10-31 15:56     ` Michal Hocko
2023-10-31  4:27       ` Gregory Price
2023-11-01 13:45         ` Michal Hocko
2023-11-01 16:58           ` Gregory Price
2023-11-02  9:47             ` Michal Hocko
2023-11-02  3:18               ` Gregory Price
2023-11-03  7:45                 ` Huang, Ying
2023-11-03 14:16                   ` Jonathan Cameron
2023-11-06  3:20                     ` Huang, Ying
2023-11-03  9:56                 ` Michal Hocko [this message]
2023-11-02 18:21                   ` Gregory Price
2023-11-03 16:59                     ` Michal Hocko
2023-11-02  2:01         ` Huang, Ying
2023-10-31 16:22       ` Johannes Weiner
2023-10-31  4:29         ` Gregory Price
2023-11-01  2:34         ` Huang, Ying
2023-11-01  9:29           ` Ravi Jonnalagadda
2023-11-02  6:41             ` Huang, Ying
2023-11-02  9:35               ` Ravi Jonnalagadda
2023-11-02 14:13                 ` Jonathan Cameron
2023-11-03  7:00                 ` Huang, Ying
2023-11-01 13:56         ` Michal Hocko
2023-11-02  6:21           ` Huang, Ying
2023-11-02  9:30             ` Michal Hocko
2023-11-01  2:21       ` Huang, Ying
2023-11-01 14:01         ` Michal Hocko
2023-11-02  6:11           ` Huang, Ying
2023-11-02  9:28             ` Michal Hocko
2023-11-03  7:10               ` Huang, Ying
2023-11-03  9:39                 ` Michal Hocko
2023-11-06  5:08                   ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=s72oio7nmez565i7h6fb4bdnhqkcablr34rz5gqteyrrf7yeux@lqrztvy35si5 \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=apopple@nvidia.com \
    --cc=dave.hansen@intel.com \
    --cc=gourry.memverge@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=gregory.price@memverge.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rafael@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=tim.c.chen@intel.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox