linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Gregory Price <gregory.price@memverge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	 Gregory Price <gourry.memverge@gmail.com>,
	linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
	 linux-mm@kvack.org, ying.huang@intel.com,
	akpm@linux-foundation.org,  aneesh.kumar@linux.ibm.com,
	weixugc@google.com, apopple@nvidia.com, tim.c.chen@intel.com,
	 dave.hansen@intel.com, shy828301@gmail.com,
	gregkh@linuxfoundation.org,  rafael@kernel.org
Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave
Date: Wed, 1 Nov 2023 14:45:50 +0100	[thread overview]
Message-ID: <pmxrljwp4ayl3fcu7rxm6prbumgb5l3lwb75lqfipmxxxwnqfo@nb5qjcxw22gp> (raw)
In-Reply-To: <ZUCCGJgrqqk87aGN@memverge.com>

On Tue 31-10-23 00:27:04, Gregory Price wrote:
> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
> 
> > > This hopefully also explains why it's a global setting. The usecase is
> > > different from conventional NUMA interleaving, which is used as a
> > > locality measure: spread shared data evenly between compute
> > > nodes. This one isn't about locality - the CXL tier doesn't have local
> > > compute. Instead, the optimal spread is based on hardware parameters,
> > > which is a global property rather than a per-workload one.
> > 
> > Well, I am not convinced about that TBH. Sure it is probably a good fit
> > for this specific CXL usecase but it just doesn't fit into many others I
> > can think of - e.g. proportional use of those tiers based on the
> > workload - you get what you pay for.
> > 
> > Is there any specific reason for not having a new interleave interface
> > which defines weights for the nodemask? Is this because the policy
> > itself is very dynamic or is this more driven by simplicity of use?
> > 
> 
> I had originally implemented it this way while experimenting with new
> mempolicies.
> 
> https://lore.kernel.org/linux-cxl/20231003002156.740595-5-gregory.price@memverge.com/
> 
> The downside of doing it in mempolicy is...
> 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a
>    non-trivial task.  It is very "current-task" centric.

True. Cpusets is the way to make it less process centric but that comes
with its own constains (namely which NUMA policies are supported).
 
> 2) Barring a change to mempolicy to be sysfs friendly, the options for
>    implementing weights in the mempolicy are either a) new flag and
>    setting every weight individually in many syscalls, or b) a new
>    syscall (set_mempolicy2), which is what I demonstrated in the RFC.

Yes, that would likely require a new syscall.
 
> 3) mempolicy is also subject to cgroup nodemasks, and as a result you
>    end up with a rats nest of interactions between mempolicy nodemasks
>    changing as a result of cgroup migrations, nodes potentially coming
>    and going (hotplug under CXL), and others I'm probably forgetting.

Is this really any different from what you are proposing though?

>    Basically:  If a node leaves the nodemask, should you retain the
>    weight, or should you reset it? If a new node comes into the node
>    mask... what weight should you set? I did not have answers to these
>    questions.

I am not really sure I follow you. Are you talking about cpuset
nodemask changes or memory hotplug here.

> It was recommended to explore placing it in tiers instead, so I took a
> crack at it here: 
> 
> https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@memverge.com/
> 
> This had similar issue with the idea of hotplug nodes: if you give a
> tier a weight, and one or more of the nodes goes away/comes back... what
> should you do with the weight?  Split it up among the remaining nodes?
> Rebalance? Etc.

How is this any different from node becoming depleted? You cannot
really expect that you get memory you are asking for and you can easily
end up getting memory from a different node instead.
 
> The result of this discussion lead us to simply say "What if we place
> the weights directly in the node".  And that lead us to this RFC.

Maybe I am missing something really crucial here but I do not see how
this fundamentally changes anything.

Memory hotremove (or mere node memory depletion) is not really a thing
because interleaving is a best effort operation so you have to live with
memory not being strictly distributed per your preferences.

Memory hotadd will be easier to manage because you just update a single
place after node is hotadded rather than gazillions partial policies.
But, that requires that interleave policy nodemask is assuming future
nodes going online and put them to the mask.

> I am not against implementing it in mempolicy (as proof: my first RFC).
> I am simply searching for the acceptable way to implement it.
> 
> One of the benefits of having it set as a global setting is that weights
> can be automatically generated from HMAT/HMEM information (ACPI tables)
> and programs already using MPOL_INTERLEAVE will have a direct benefit.

Right. This is understood. My main concern is whether this is outweights
the limitations of having a _global_ policy _only_. Historically a single
global policy usually led to finding ways how to make that more scoped
(usually through cgroups).
 
> I have been considering whether MPOL_WEIGHTED_INTERLEAVE should be added
> along side this patch so that MPOL_INTERLEAVE is left entirely alone.
> 
> Happy to discuss more,
> ~Gregory

-- 
Michal Hocko
SUSE Labs


  reply	other threads:[~2023-11-01 13:45 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-31  0:38 Gregory Price
2023-10-31  0:38 ` [RFC PATCH v3 1/4] base/node.c: initialize the accessor list before registering Gregory Price
2023-10-31  0:38 ` [RFC PATCH v3 2/4] node: add accessors to sysfs when nodes are created Gregory Price
2023-10-31  0:38 ` [RFC PATCH v3 3/4] node: add interleave weights to node accessor Gregory Price
2023-10-31  0:38 ` [RFC PATCH v3 4/4] mm/mempolicy: modify interleave mempolicy to use node weights Gregory Price
2023-10-31 17:52   ` [EXT] " Srinivasulu Thanneeru
2023-10-31 18:23   ` Srinivasulu Thanneeru
2023-10-31  9:53 ` [RFC PATCH v3 0/4] Node Weights and Weighted Interleave Michal Hocko
2023-10-31 15:21   ` Johannes Weiner
2023-10-31 15:56     ` Michal Hocko
2023-10-31  4:27       ` Gregory Price
2023-11-01 13:45         ` Michal Hocko [this message]
2023-11-01 16:58           ` Gregory Price
2023-11-02  9:47             ` Michal Hocko
2023-11-02  3:18               ` Gregory Price
2023-11-03  7:45                 ` Huang, Ying
2023-11-03 14:16                   ` Jonathan Cameron
2023-11-06  3:20                     ` Huang, Ying
2023-11-03  9:56                 ` Michal Hocko
2023-11-02 18:21                   ` Gregory Price
2023-11-03 16:59                     ` Michal Hocko
2023-11-02  2:01         ` Huang, Ying
2023-10-31 16:22       ` Johannes Weiner
2023-10-31  4:29         ` Gregory Price
2023-11-01  2:34         ` Huang, Ying
2023-11-01  9:29           ` Ravi Jonnalagadda
2023-11-02  6:41             ` Huang, Ying
2023-11-02  9:35               ` Ravi Jonnalagadda
2023-11-02 14:13                 ` Jonathan Cameron
2023-11-03  7:00                 ` Huang, Ying
2023-11-01 13:56         ` Michal Hocko
2023-11-02  6:21           ` Huang, Ying
2023-11-02  9:30             ` Michal Hocko
2023-11-01  2:21       ` Huang, Ying
2023-11-01 14:01         ` Michal Hocko
2023-11-02  6:11           ` Huang, Ying
2023-11-02  9:28             ` Michal Hocko
2023-11-03  7:10               ` Huang, Ying
2023-11-03  9:39                 ` Michal Hocko
2023-11-06  5:08                   ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pmxrljwp4ayl3fcu7rxm6prbumgb5l3lwb75lqfipmxxxwnqfo@nb5qjcxw22gp \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=apopple@nvidia.com \
    --cc=dave.hansen@intel.com \
    --cc=gourry.memverge@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=gregory.price@memverge.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rafael@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=tim.c.chen@intel.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox