Re: [RFC] proc: Add a new isolated /proc/pid/mempolicy type.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Abel Wu <wuyun.abel@bytedance.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Zhongkun He <hezhongkun.hzk@bytedance.com>,
	corbet@lwn.net, akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org
Subject: Re: [RFC] proc: Add a new isolated /proc/pid/mempolicy type.
Date: Wed, 28 Sep 2022 11:09:47 +0800	[thread overview]
Message-ID: <4e2aa5c2-3d8c-2a2f-691b-218e23e7271f@bytedance.com> (raw)
In-Reply-To: <YzMBnKUo8ny9S/7+@dhcp22.suse.cz>

On 9/27/22 9:58 PM, Michal Hocko wrote:
> On Tue 27-09-22 21:07:02, Abel Wu wrote:
>> On 9/27/22 6:49 PM, Michal Hocko wrote:
>>> On Tue 27-09-22 11:20:54, Abel Wu wrote:
>>> [...]
>>>>>> Btw.in order to add per-thread-group mempolicy, is it possible to add
>>>>>> mempolicy in mm_struct?
>>>>>
>>>>> I dunno. This would make the mempolicy interface even more confusing.
>>>>> Per mm behavior makes a lot of sense but we already do have per-thread
>>>>> semantic so I would stick to it rather than introducing a new semantic.
>>>>>
>>>>> Why is this really important?
>>>>
>>>> We want soft control on memory footprint of background jobs by applying
>>>> NUMA preferences when necessary, so the impact on different NUMA nodes
>>>> can be managed to some extent. These NUMA preferences are given by the
>>>> control panel, and it might not be suitable to overwrite the tasks with
>>>> specific memory policies already (or vice versa).
>>>
>>> Maybe the answer is somehow implicit but I do not really see any
>>> argument for the per thread-group semantic here. In other words why a
>>> new interface has to cover more than the local [sg]et_mempolicy?
>>> I can see convenience as one potential argument. Also if there is a
>>> requirement to change the policy in atomic way then this would require a
>>> single syscall.
>>
>> Convenience is not our major concern. A well-tuned workload can have
>> specific memory policies for different tasks/vmas in one process, and
>> this can be achieved by set_mempolicy()/mbind() respectively. While
>> other workloads are not, they don't care where the memory residents,
>> so the impact they brought on the co-located workloads might vary in
>> different NUMA nodes.
>>
>> The control panel, which has a full knowledge of workload profiling,
>> may want to interfere the behavior of the non-mempolicied processes
>> by giving them NUMA preferences, to better serve the co-located jobs.
>>
>> So in this scenario, a process's memory policy can be assigned by two
>> objects dynamically:
>>
>>   a) the process itself, through set_mempolicy()/mbind()
>>   b) the control panel, but API is not available right now
>>
>> Considering the two policies should not fight each other, it sounds
>> reasonable to introduce a new syscall to assign memory policy to a
>> process through struct mm_struct.
> 
> So you want to allow restoring the original local policy if the external
> one is disabled?

Pretty much, but the internal policies are expected to have precedence
over the external ones, since they are set for some reason to meet their
specific requirements. The external ones are used only when there is no
internal policy active.

> 
> Anyway, pidfd_$FOO behavior should be semantically very similar to the
> original $FOO. Moving from per-task to per-mm is a major shift in the
> semantic.  I can imagine to have a dedicated flag for the syscall to
> enforce the policy to the full thread group. But having a different
> semantic is both tricky and also constrained because per-thread binding
> is then impossible.

Agreed. What about a syscall only apply to per-mm? There are precedents
like process_madvice(2).

next prev parent reply	other threads:[~2022-09-28  3:09 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-26  9:10 hezhongkun
2022-09-26  9:56 ` Michal Hocko
2022-09-26 12:53   ` [External] " Zhongkun He
2022-09-26 14:08     ` Michal Hocko
2022-09-27  3:20       ` Abel Wu
2022-09-27 10:49         ` Michal Hocko
2022-09-27 13:07           ` [External] " Abel Wu
2022-09-27 13:58             ` Michal Hocko
2022-09-28  3:09               ` Abel Wu [this message]
2022-09-30  8:54                 ` Michal Hocko
2022-09-28 23:39       ` [External] " Randy Dunlap

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4e2aa5c2-3d8c-2a2f-691b-218e23e7271f@bytedance.com \
    --to=wuyun.abel@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=hezhongkun.hzk@bytedance.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox