From: Gregory Price <gregory.price@memverge.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
Gregory Price <gourry.memverge@gmail.com>,
linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
linux-mm@kvack.org, ying.huang@intel.com,
akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com,
weixugc@google.com, apopple@nvidia.com, tim.c.chen@intel.com,
dave.hansen@intel.com, shy828301@gmail.com,
gregkh@linuxfoundation.org, rafael@kernel.org
Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave
Date: Wed, 1 Nov 2023 12:58:55 -0400 [thread overview]
Message-ID: <ZUKDz5NpMsoyzWtZ@memverge.com> (raw)
In-Reply-To: <pmxrljwp4ayl3fcu7rxm6prbumgb5l3lwb75lqfipmxxxwnqfo@nb5qjcxw22gp>
On Wed, Nov 01, 2023 at 02:45:50PM +0100, Michal Hocko wrote:
> On Tue 31-10-23 00:27:04, Gregory Price wrote:
[... snip ...]
> >
> > The downside of doing it in mempolicy is...
> > 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a
> > non-trivial task. It is very "current-task" centric.
>
> True. Cpusets is the way to make it less process centric but that comes
> with its own constains (namely which NUMA policies are supported).
>
> > 2) Barring a change to mempolicy to be sysfs friendly, the options for
> > implementing weights in the mempolicy are either a) new flag and
> > setting every weight individually in many syscalls, or b) a new
> > syscall (set_mempolicy2), which is what I demonstrated in the RFC.
>
> Yes, that would likely require a new syscall.
>
> > 3) mempolicy is also subject to cgroup nodemasks, and as a result you
> > end up with a rats nest of interactions between mempolicy nodemasks
> > changing as a result of cgroup migrations, nodes potentially coming
> > and going (hotplug under CXL), and others I'm probably forgetting.
>
> Is this really any different from what you are proposing though?
>
In only one manner: An external user can set the weight of a node that
is added later on. If it is implemented in mempolicy, then this is not
possible.
Basically consider: `numactl --interleave=all ...`
If `--weights=...`: when a node hotplug event occurs, there is no
recourse for adding a weight for the new node (it will default to 1).
Maybe the answer is "Best effort, sorry" and we don't handle that
situation. That doesn't seem entirely unreasonable.
At least with weights in node (or cgroup, or memtier, whatever) it
provides the ability to set that weight outside the mempolicy context.
> > weight, or should you reset it? If a new node comes into the node
> > mask... what weight should you set? I did not have answers to these
> > questions.
>
> I am not really sure I follow you. Are you talking about cpuset
> nodemask changes or memory hotplug here.
>
Actually both - slightly different context.
If the weights are implemented in mempolicy, if the cpuset nodemask
changes then the mempolicy nodemask changes with it.
If the node is removed from the system, I believe (need to validate
this, but IIRC) the node will be removed from any registered cpusets.
As a result, that falls down to mempolicy, and the node is removed.
Not entirely sure what happens if a node is added. The only case where
I think that is relevant is when cpuset is empty ("all") and mempolicy
is set to something like `--interleave=all`. In this case, it's
possible that the new node will simply have a default weight (1), and if
weights are implemented in mempolicy only there is no recourse for changing
it.
> > It was recommended to explore placing it in tiers instead, so I took a
> > crack at it here:
> >
> > https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@memverge.com/
> >
> > This had similar issue with the idea of hotplug nodes: if you give a
> > tier a weight, and one or more of the nodes goes away/comes back... what
> > should you do with the weight? Split it up among the remaining nodes?
> > Rebalance? Etc.
>
> How is this any different from node becoming depleted? You cannot
> really expect that you get memory you are asking for and you can easily
> end up getting memory from a different node instead.
>
... snip ...
> Maybe I am missing something really crucial here but I do not see how
> this fundamentally changes anything.
>
> Memory hotremove
... snip ...
> Memory hotadd
... snip ...
> But, that requires that interleave policy nodemask is assuming future
> nodes going online and put them to the mask.
>
The difference is the nodemask changes in mempolicy and cpuset. If a
node is removed entirely from the nodemask, and then it comes back
(through cpuset or something), then "what do you do with it"?
If memory is depleted but opens up later - the interleave policy starts
working as intended again. If a node disappears and comes back... that
bit of plumbing is a bit more complex.
So yes, the "assuming future nodes going online and put them into the
mask" is the concern I have. A node being added/removed from the
nodemask specifically different plumbing issues than just depletion.
If that's really not a concern and we're happy to just let it be OBO
until an actual use case for handling node hotplug for weighting, then
mempolicy-based-weighting alone seems more than sufficient.
> > I am not against implementing it in mempolicy (as proof: my first RFC).
> > I am simply searching for the acceptable way to implement it.
> >
> > One of the benefits of having it set as a global setting is that weights
> > can be automatically generated from HMAT/HMEM information (ACPI tables)
> > and programs already using MPOL_INTERLEAVE will have a direct benefit.
>
> Right. This is understood. My main concern is whether this is outweights
> the limitations of having a _global_ policy _only_. Historically a single
> global policy usually led to finding ways how to make that more scoped
> (usually through cgroups).
>
Maybe the answer here is put it in cgroups + mempolicy, and don't handle
hotplug? This is an easy shift my this patch to cgroups, and then
pulling my syscall patch forward to add weights directly to mempolicy.
I think the interleave code stays pretty much the same, the only
difference would be where the task gets the weight from:
if (policy->mode == WEIGHTED_INTERLEAVE)
weight = pol->weight[target_node]
else
cgroups.get_weight(from_node, target_node)
~Gregory
next prev parent reply other threads:[~2023-11-01 16:59 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-31 0:38 Gregory Price
2023-10-31 0:38 ` [RFC PATCH v3 1/4] base/node.c: initialize the accessor list before registering Gregory Price
2023-10-31 0:38 ` [RFC PATCH v3 2/4] node: add accessors to sysfs when nodes are created Gregory Price
2023-10-31 0:38 ` [RFC PATCH v3 3/4] node: add interleave weights to node accessor Gregory Price
2023-10-31 0:38 ` [RFC PATCH v3 4/4] mm/mempolicy: modify interleave mempolicy to use node weights Gregory Price
2023-10-31 17:52 ` [EXT] " Srinivasulu Thanneeru
2023-10-31 18:23 ` Srinivasulu Thanneeru
2023-10-31 9:53 ` [RFC PATCH v3 0/4] Node Weights and Weighted Interleave Michal Hocko
2023-10-31 15:21 ` Johannes Weiner
2023-10-31 15:56 ` Michal Hocko
2023-10-31 4:27 ` Gregory Price
2023-11-01 13:45 ` Michal Hocko
2023-11-01 16:58 ` Gregory Price [this message]
2023-11-02 9:47 ` Michal Hocko
2023-11-02 3:18 ` Gregory Price
2023-11-03 7:45 ` Huang, Ying
2023-11-03 14:16 ` Jonathan Cameron
2023-11-06 3:20 ` Huang, Ying
2023-11-03 9:56 ` Michal Hocko
2023-11-02 18:21 ` Gregory Price
2023-11-03 16:59 ` Michal Hocko
2023-11-02 2:01 ` Huang, Ying
2023-10-31 16:22 ` Johannes Weiner
2023-10-31 4:29 ` Gregory Price
2023-11-01 2:34 ` Huang, Ying
2023-11-01 9:29 ` Ravi Jonnalagadda
2023-11-02 6:41 ` Huang, Ying
2023-11-02 9:35 ` Ravi Jonnalagadda
2023-11-02 14:13 ` Jonathan Cameron
2023-11-03 7:00 ` Huang, Ying
2023-11-01 13:56 ` Michal Hocko
2023-11-02 6:21 ` Huang, Ying
2023-11-02 9:30 ` Michal Hocko
2023-11-01 2:21 ` Huang, Ying
2023-11-01 14:01 ` Michal Hocko
2023-11-02 6:11 ` Huang, Ying
2023-11-02 9:28 ` Michal Hocko
2023-11-03 7:10 ` Huang, Ying
2023-11-03 9:39 ` Michal Hocko
2023-11-06 5:08 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZUKDz5NpMsoyzWtZ@memverge.com \
--to=gregory.price@memverge.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=apopple@nvidia.com \
--cc=dave.hansen@intel.com \
--cc=gourry.memverge@gmail.com \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=rafael@kernel.org \
--cc=shy828301@gmail.com \
--cc=tim.c.chen@intel.com \
--cc=weixugc@google.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox