Re: [PATCH v3 0/2] memcontrol: support cgroup level OOM protection

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "程垲涛 Chengkaitao Cheng" <chengkaitao@didiglobal.com>
To: Michal Hocko <mhocko@suse.com>
Cc: "tj@kernel.org" <tj@kernel.org>,
	"lizefan.x@bytedance.com" <lizefan.x@bytedance.com>,
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
	"corbet@lwn.net" <corbet@lwn.net>,
	"roman.gushchin@linux.dev" <roman.gushchin@linux.dev>,
	"shakeelb@google.com" <shakeelb@google.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"brauner@kernel.org" <brauner@kernel.org>,
	"muchun.song@linux.dev" <muchun.song@linux.dev>,
	"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
	"zhengqi.arch@bytedance.com" <zhengqi.arch@bytedance.com>,
	"ebiederm@xmission.com" <ebiederm@xmission.com>,
	"Liam.Howlett@oracle.com" <Liam.Howlett@oracle.com>,
	"chengzhihao1@huawei.com" <chengzhihao1@huawei.com>,
	"pilgrimtao@gmail.com" <pilgrimtao@gmail.com>,
	"haolee.swjtu@gmail.com" <haolee.swjtu@gmail.com>,
	"yuzhao@google.com" <yuzhao@google.com>,
	"willy@infradead.org" <willy@infradead.org>,
	"vasily.averin@linux.dev" <vasily.averin@linux.dev>,
	"vbabka@suse.cz" <vbabka@suse.cz>,
	"surenb@google.com" <surenb@google.com>,
	"sfr@canb.auug.org.au" <sfr@canb.auug.org.au>,
	"mcgrof@kernel.org" <mcgrof@kernel.org>,
	"feng.tang@intel.com" <feng.tang@intel.com>,
	"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH v3 0/2] memcontrol: support cgroup level OOM protection
Date: Tue, 9 May 2023 06:50:59 +0000	[thread overview]
Message-ID: <900EF82B-9899-46DD-9ACC-16D82D9B7A3F@didiglobal.com> (raw)
In-Reply-To: <ZFkEqhAs7FELUO3a@dhcp22.suse.cz>

At 2023-05-08 22:18:18, "Michal Hocko" <mhocko@suse.com> wrote:
>On Mon 08-05-23 09:08:25, 程垲涛 Chengkaitao Cheng wrote:
>> At 2023-05-07 18:11:58, "Michal Hocko" <mhocko@suse.com> wrote:
>> >On Sat 06-05-23 19:49:46, chengkaitao wrote:
>> >
>> >That being said, make sure you describe your usecase more thoroughly.
>> >Please also make sure you describe the intended heuristic of the knob.
>> >It is not really clear from the description how this fits hierarchical
>> >behavior of cgroups. I would be especially interested in the semantics
>> >of non-leaf memcgs protection as they do not have any actual processes
>> >to protect.
>> >
>> >Also there have been concerns mentioned in v2 discussion and it would be
>> >really appreciated to summarize how you have dealt with them.
>> >
>> >Please also note that many people are going to be slow in responding
>> >this week because of LSFMM conference
>> >(https://events.linuxfoundation.org/lsfmm/)
>> 
>> Here is a more detailed comparison and introduction of the old oom_score_adj
>> mechanism and the new oom_protect mechanism,
>> 1. The regulating granularity of oom_protect is smaller than that of oom_score_adj.
>> On a 512G physical machine, the minimum granularity adjusted by oom_score_adj
>> is 512M, and the minimum granularity adjusted by oom_protect is one page (4K).
>> 2. It may be simple to create a lightweight parent process and uniformly set the 
>> oom_score_adj of some important processes, but it is not a simple matter to make 
>> multi-level settings for tens of thousands of processes on the physical machine 
>> through the lightweight parent processes. We may need a huge table to record the 
>> value of oom_score_adj maintained by all lightweight parent processes, and the 
>> user process limited by the parent process has no ability to change its own 
>> oom_score_adj, because it does not know the details of the huge table. The new 
>> patch adopts the cgroup mechanism. It does not need any parent process to manage 
>> oom_score_adj. the settings between each memcg are independent of each other, 
>> making it easier to plan the OOM order of all processes. Due to the unique nature 
>> of memory resources, current Service cloud vendors are not oversold in memory 
>> planning. I would like to use the new patch to try to achieve the possibility of 
>> oversold memory resources.
>
>OK, this is more specific about the usecase. Thanks! So essentially what
>it boils down to is that you are handling many containers (memcgs from
>our POV) and they have different priorities. You want to overcommit the
>memory to the extend that global ooms are not an unexpected event. Once
>that happens the total memory consumption of a specific memcg is less
>important than its "priority". You define that priority by the excess of
>the memory usage above a user defined threshold. Correct?

It's correct.

>Your cover letter mentions that then "all processes in the cgroup as a
>whole". That to me reads as oom.group oom killer policy. But a brief
>look into the patch suggests you are still looking at specific tasks and
>this has been a concern in the previous version of the patch because
>memcg accounting and per-process accounting are detached.

I think the memcg accounting may be more reasonable, as its memory 
statistics are more comprehensive, similar to active page cache, which 
also increases the probability of OOM-kill. In the new patch, all the 
shared memory will also consume the oom_protect quota of the memcg, 
and the process's oom_protect quota of the memcg will decrease.

>> 3. I conducted a test and deployed an excessive number of containers on a physical 
>> machine, By setting the oom_score_adj value of all processes in the container to 
>> a positive number through dockerinit, even processes that occupy very little memory 
>> in the container are easily killed, resulting in a large number of invalid kill behaviors. 
>> If dockerinit is also killed unfortunately, it will trigger container self-healing, and the 
>> container will rebuild, resulting in more severe memory oscillations. The new patch 
>> abandons the behavior of adding an equal amount of oom_score_adj to each process 
>> in the container and adopts a shared oom_protect quota for all processes in the container. 
>> If a process in the container is killed, the remaining other processes will receive more 
>> oom_protect quota, making it more difficult for the remaining processes to be killed.
>> In my test case, the new patch reduced the number of invalid kill behaviors by 70%.
>> 4. oom_score_adj is a global configuration that cannot achieve a kill order that only 
>> affects a certain memcg-oom-killer. However, the oom_protect mechanism inherits 
>> downwards, and user can only change the kill order of its own memcg oom, but the 
>> kill order of their parent memcg-oom-killer or global-oom-killer will not be affected
>
>Yes oom_score_adj has shortcomings.
>
>> In the final discussion of patch v2, we discussed that although the adjustment range 
>> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
>> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is 
>> clumsy at best. In order to solve this problem in the new patch, I introduced a new 
>> indicator oom_kill_inherit, which counts the number of times the local and child 
>> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing 
>> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the 
>> value of oom_protect to achieve the best.
>
>What does the best mean in this context?

I have created a new indicator oom_kill_inherit that maintains a negative correlation 
with memory.oom.protect, so we have a ruler to measure the optimal value of 
memory.oom.protect.

>> about the semantics of non-leaf memcgs protection,
>> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally 
>> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
>
>So the non-leaf memcg is never used as a target? What if the workload is
>distributed over several sub-groups? Our current oom.group
>implementation traverses the tree to find a common ancestor in the oom
>domain with the oom.group.

If the oom_protect quota of the parent non-leaf memcg is less than the sum of 
sub-groups oom_protect quota, the oom_protect quota of each sub-group will 
be proportionally reduced
If the oom_protect quota of the parent non-leaf memcg is greater than the sum 
of sub-groups oom_protect quota, the oom_protect quota of each sub-group 
will be proportionally increased
The purpose of doing so is that users can set oom_protect quota according to 
their own needs, and the system management process can set appropriate 
oom_protect quota on the parent non-leaf memcg as the final cover, so that 
the system management process can indirectly manage all user processes.

>All that being said and with the usecase described more specifically. I
>can see that memcg based oom victim selection makes some sense. That
>menas that it is always a memcg selected and all tasks withing killed.
>Memcg based protection can be used to evaluate which memcg to choose and
>the overall scheme should be still manageable. It would indeed resemble
>memory protection for the regular reclaim.
>
>One thing that is still not really clear to me is to how group vs.
>non-group ooms could be handled gracefully. Right now we can handle that
>because the oom selection is still process based but with the protection
>this will become more problematic as explained previously. Essentially
>we would need to enforce the oom selection to be memcg based for all
>memcgs. Maybe a mount knob? What do you think?

There is a function in the patch to determine whether the oom_protect 
mechanism is enabled. All memory.oom.protect nodes default to 0, so the function 
<is_root_oom_protect> returns 0 by default. The oom_protect  mechanism will 
only take effect when "root_mem_cgroup->memory.children_oom_protect_usage" 
is not 0, and only memcg with memory.oom.protect node set will take effect.

+bool is_root_oom_protect(void)
+{
+	if (mem_cgroup_disabled())
+		return 0;
+
+	return !!atomic_long_read(&root_mem_cgroup->memory.children_oom_protect_usage);
+}
I don't know if there is some problems with my understanding?

-- 
Thanks for your comment!
chengkaitao

next prev parent reply	other threads:[~2023-05-09  6:51 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-06 11:49 chengkaitao
2023-05-06 11:49 ` [PATCH v3 1/2] mm: memcontrol: protect the memory in cgroup from being oom killed chengkaitao
2023-05-06 11:49 ` [PATCH v3 2/2] memcg: add oom_kill_inherit event indicator chengkaitao
2023-05-07 10:11 ` [PATCH v3 0/2] memcontrol: support cgroup level OOM protection Michal Hocko
2023-05-08  9:08   ` 程垲涛 Chengkaitao Cheng
2023-05-08 14:18     ` Michal Hocko
2023-05-09  6:50       ` 程垲涛 Chengkaitao Cheng [this message]
2023-05-22 13:03         ` Michal Hocko
2023-05-25  7:35           ` 程垲涛 Chengkaitao Cheng
2023-05-29 14:02             ` Michal Hocko
     [not found]               ` <C5E5137F-8754-40CC-9F0C-0EB3D8AC1EC2@didiglobal.com>
2023-06-13  8:16                 ` Michal Hocko
     [not found]       ` <CAJD7tkaw_7vYACsyzAtY9L0ZVC0B=XJEWgG=Ad_dOtL_pBDDvQ@mail.gmail.com>
2023-06-13  8:27         ` Michal Hocko
2023-06-13  8:36           ` Yosry Ahmed
2023-06-13 12:06             ` Michal Hocko
2023-06-13 20:24               ` Yosry Ahmed
2023-06-15 10:39                 ` Michal Hocko
2023-06-16  1:44                   ` Yosry Ahmed
2023-06-13  8:40           ` tj

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=900EF82B-9899-46DD-9ACC-16D82D9B7A3F@didiglobal.com \
    --to=chengkaitao@didiglobal.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=chengzhihao1@huawei.com \
    --cc=corbet@lwn.net \
    --cc=ebiederm@xmission.com \
    --cc=feng.tang@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=haolee.swjtu@gmail.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=mcgrof@kernel.org \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=pilgrimtao@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=sfr@canb.auug.org.au \
    --cc=shakeelb@google.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=vasily.averin@linux.dev \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox