From: "Huang, Ying" <ying.huang@intel.com>
To: Gregory Price <gregory.price@memverge.com>,
Michal Hocko <mhocko@suse.com>
Cc: "tj@kernel.org" <tj@kernel.org>,
John Groves <john@jagalactic.com>,
Gregory Price <gourry.memverge@gmail.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"lizefan.x@bytedance.com" <lizefan.x@bytedance.com>,
"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
"corbet@lwn.net" <corbet@lwn.net>,
"roman.gushchin@linux.dev" <roman.gushchin@linux.dev>,
"shakeelb@google.com" <shakeelb@google.com>,
"muchun.song@linux.dev" <muchun.song@linux.dev>,
"jgroves@micron.com" <jgroves@micron.com>
Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control
Date: Wed, 15 Nov 2023 13:56:53 +0800 [thread overview]
Message-ID: <87o7fveeze.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <ZVOzMEtDYB4l8qFy@memverge.com> (Gregory Price's message of "Tue, 14 Nov 2023 12:49:36 -0500")
Gregory Price <gregory.price@memverge.com> writes:
> On Tue, Nov 14, 2023 at 06:01:13PM +0100, Michal Hocko wrote:
>> On Tue 14-11-23 10:50:51, Gregory Price wrote:
>> > On Tue, Nov 14, 2023 at 10:43:13AM +0100, Michal Hocko wrote:
>> [...]
>> > > That being said, I still believe that a cgroup based interface is a much
>> > > better choice over a global one. Cpusets seem to be a good fit as the
>> > > controller does control memory placement wrt NUMA interfaces.
>> >
>> > I think cpusets is a non-starter due to the global spinlock required when
>> > reading informaiton from it:
>> >
>> > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L391
>>
>> Right, our current cpuset implementation indeed requires callback lock
>> from the page allocator. But that is an implementation detail. I do not
>> remember bug reports about the lock being a bottle neck though. If
>> anything cpusets lock optimizations would be win also for users who do
>> not want to use weighted interleave interface.
>
> Definitely agree, but that's a rather large increase of scope :[
>
> We could consider a push-model similar to how cpuset nodemasks are
> pushed down to mempolicies, rather than a pull-model of having
> mempolicy read directly from cpusets, at least until cpusets lock
> optimization is undertaken.
>
> This pattern looks like a wart to me, which is why I avoided it, but the
> locking implications on the pull-model make me sad.
>
> Would like to point out that Tejun pushed back on implementing weights
> in cgroups (regardless of subcomponent), so I think we need to come
> to a consensus on where this data should live in a "more global"
> context (cpusets, memcg, nodes, etc) before I go mucking around
> further.
>
> So far we have:
> * mempolicy: updating weights is a very complicated undertaking,
> and no (good) way to do this from outside the task.
> would be better to have a coarser grained control.
>
> New syscall is likely needed to add/set weights in the
> per-task mempolicy, or bite the bullet on set_mempolicy2
> and make the syscall extensible for the future.
>
> * memtiers: tier=node when devices are already interleaved or when all
> devices are different, so why add yet another layer of
> complexity if other constructs already exist. Additionally,
> you lose task-placement relative weighting (or it becomes
> very complex to implement.
Because we usually have multiple nodes in one mem-tier, I still think
mem-tier-based interface is simpler than node-based. But, it seems more
complex to introduce mem-tier into mempolicy. Especially if we have
per-task weights. So, I am fine to go with node-based interface.
> * cgroups: "this doesn't involve dynamic resource accounting /
> enforcement at all" and "these aren't resource
> allocations, it's unclear what the hierarchical
> relationship mean".
>
> * node: too global, explore smaller scope first then expand.
Why is it too global? I understand that it doesn't cover all possible
use cases (although I don't know whether these use cases are practical
or not). But it can provide a reasonable default per-node weight based
on available node performance information (such as, HMAT, CDAT, etc.).
And, quite some workloads can just use it. I think this is an useful
feature.
> For now I think there is consensus that mempolicy should have weights
> per-task regardless of how the more-global mechanism is defined, so i'll
> go ahead and put up another RFC for some options on that in the next
> week or so.
>
> The limitations on the first pass will be that only the task is capable
> of re-weighting should cpusets.mems or the nodemask change.
--
Best Regards,
Huang, Ying
next prev parent reply other threads:[~2023-11-15 5:59 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-09 0:25 Gregory Price
2023-11-09 0:25 ` [RFC PATCH v4 1/3] mm/memcontrol: implement memcg.interleave_weights Gregory Price
2023-11-09 0:25 ` [RFC PATCH v4 2/3] mm/mempolicy: implement weighted interleave Gregory Price
2023-11-10 15:26 ` Ravi Jonnalagadda
2023-11-09 0:25 ` [RFC PATCH v4 3/3] Documentation: sysfs entries for cgroup.memory.interleave_weights Gregory Price
2023-11-09 10:02 ` [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control Michal Hocko
2023-11-09 15:10 ` Gregory Price
2023-11-09 16:34 ` Gregory Price
2023-11-10 9:05 ` Michal Hocko
2023-11-10 21:24 ` Gregory Price
[not found] ` <klhcqksrg7uvdrf6hoi5tegifycjltz2kx2d62hapmw3ulr7oa@woibsnrpgox4>
2023-11-09 22:48 ` John Groves
2023-11-10 22:05 ` tj
2023-11-10 22:29 ` Gregory Price
2023-11-11 3:05 ` tj
2023-11-11 3:42 ` Gregory Price
2023-11-11 11:16 ` tj
2023-11-11 23:54 ` Dan Williams
2023-11-13 2:22 ` Gregory Price
2023-11-14 9:43 ` Michal Hocko
2023-11-14 15:50 ` Gregory Price
2023-11-14 17:01 ` Michal Hocko
2023-11-14 17:49 ` Gregory Price
2023-11-15 5:56 ` Huang, Ying [this message]
2023-12-04 3:33 ` Gregory Price
2023-12-04 8:19 ` Huang, Ying
2023-12-04 13:50 ` Gregory Price
2023-12-05 9:01 ` Huang, Ying
2023-12-05 14:47 ` Gregory Price
2023-12-06 0:50 ` Huang, Ying
2023-12-06 2:01 ` Gregory Price
2023-11-10 6:16 ` Huang, Ying
2023-11-10 19:54 ` Gregory Price
2023-11-13 1:31 ` Huang, Ying
2023-11-13 2:28 ` Gregory Price
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87o7fveeze.fsf@yhuang6-desk2.ccr.corp.intel.com \
--to=ying.huang@intel.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=gourry.memverge@gmail.com \
--cc=gregory.price@memverge.com \
--cc=hannes@cmpxchg.org \
--cc=jgroves@micron.com \
--cc=john@jagalactic.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizefan.x@bytedance.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=shakeelb@google.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox