From: "程垲涛 Chengkaitao Cheng" <chengkaitao@didiglobal.com>
To: Michal Hocko <mhocko@suse.com>
Cc: "tj@kernel.org" <tj@kernel.org>,
"lizefan.x@bytedance.com" <lizefan.x@bytedance.com>,
"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
"corbet@lwn.net" <corbet@lwn.net>,
"roman.gushchin@linux.dev" <roman.gushchin@linux.dev>,
"shakeelb@google.com" <shakeelb@google.com>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"brauner@kernel.org" <brauner@kernel.org>,
"muchun.song@linux.dev" <muchun.song@linux.dev>,
"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
"zhengqi.arch@bytedance.com" <zhengqi.arch@bytedance.com>,
"ebiederm@xmission.com" <ebiederm@xmission.com>,
"Liam.Howlett@oracle.com" <Liam.Howlett@oracle.com>,
"chengzhihao1@huawei.com" <chengzhihao1@huawei.com>,
"pilgrimtao@gmail.com" <pilgrimtao@gmail.com>,
"haolee.swjtu@gmail.com" <haolee.swjtu@gmail.com>,
"yuzhao@google.com" <yuzhao@google.com>,
"willy@infradead.org" <willy@infradead.org>,
"vasily.averin@linux.dev" <vasily.averin@linux.dev>,
"vbabka@suse.cz" <vbabka@suse.cz>,
"surenb@google.com" <surenb@google.com>,
"sfr@canb.auug.org.au" <sfr@canb.auug.org.au>,
"mcgrof@kernel.org" <mcgrof@kernel.org>,
"feng.tang@intel.com" <feng.tang@intel.com>,
"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH v3 0/2] memcontrol: support cgroup level OOM protection
Date: Tue, 9 May 2023 06:50:59 +0000 [thread overview]
Message-ID: <900EF82B-9899-46DD-9ACC-16D82D9B7A3F@didiglobal.com> (raw)
In-Reply-To: <ZFkEqhAs7FELUO3a@dhcp22.suse.cz>
At 2023-05-08 22:18:18, "Michal Hocko" <mhocko@suse.com> wrote:
>On Mon 08-05-23 09:08:25, 程垲涛 Chengkaitao Cheng wrote:
>> At 2023-05-07 18:11:58, "Michal Hocko" <mhocko@suse.com> wrote:
>> >On Sat 06-05-23 19:49:46, chengkaitao wrote:
>> >
>> >That being said, make sure you describe your usecase more thoroughly.
>> >Please also make sure you describe the intended heuristic of the knob.
>> >It is not really clear from the description how this fits hierarchical
>> >behavior of cgroups. I would be especially interested in the semantics
>> >of non-leaf memcgs protection as they do not have any actual processes
>> >to protect.
>> >
>> >Also there have been concerns mentioned in v2 discussion and it would be
>> >really appreciated to summarize how you have dealt with them.
>> >
>> >Please also note that many people are going to be slow in responding
>> >this week because of LSFMM conference
>> >(https://events.linuxfoundation.org/lsfmm/)
>>
>> Here is a more detailed comparison and introduction of the old oom_score_adj
>> mechanism and the new oom_protect mechanism,
>> 1. The regulating granularity of oom_protect is smaller than that of oom_score_adj.
>> On a 512G physical machine, the minimum granularity adjusted by oom_score_adj
>> is 512M, and the minimum granularity adjusted by oom_protect is one page (4K).
>> 2. It may be simple to create a lightweight parent process and uniformly set the
>> oom_score_adj of some important processes, but it is not a simple matter to make
>> multi-level settings for tens of thousands of processes on the physical machine
>> through the lightweight parent processes. We may need a huge table to record the
>> value of oom_score_adj maintained by all lightweight parent processes, and the
>> user process limited by the parent process has no ability to change its own
>> oom_score_adj, because it does not know the details of the huge table. The new
>> patch adopts the cgroup mechanism. It does not need any parent process to manage
>> oom_score_adj. the settings between each memcg are independent of each other,
>> making it easier to plan the OOM order of all processes. Due to the unique nature
>> of memory resources, current Service cloud vendors are not oversold in memory
>> planning. I would like to use the new patch to try to achieve the possibility of
>> oversold memory resources.
>
>OK, this is more specific about the usecase. Thanks! So essentially what
>it boils down to is that you are handling many containers (memcgs from
>our POV) and they have different priorities. You want to overcommit the
>memory to the extend that global ooms are not an unexpected event. Once
>that happens the total memory consumption of a specific memcg is less
>important than its "priority". You define that priority by the excess of
>the memory usage above a user defined threshold. Correct?
It's correct.
>Your cover letter mentions that then "all processes in the cgroup as a
>whole". That to me reads as oom.group oom killer policy. But a brief
>look into the patch suggests you are still looking at specific tasks and
>this has been a concern in the previous version of the patch because
>memcg accounting and per-process accounting are detached.
I think the memcg accounting may be more reasonable, as its memory
statistics are more comprehensive, similar to active page cache, which
also increases the probability of OOM-kill. In the new patch, all the
shared memory will also consume the oom_protect quota of the memcg,
and the process's oom_protect quota of the memcg will decrease.
>> 3. I conducted a test and deployed an excessive number of containers on a physical
>> machine, By setting the oom_score_adj value of all processes in the container to
>> a positive number through dockerinit, even processes that occupy very little memory
>> in the container are easily killed, resulting in a large number of invalid kill behaviors.
>> If dockerinit is also killed unfortunately, it will trigger container self-healing, and the
>> container will rebuild, resulting in more severe memory oscillations. The new patch
>> abandons the behavior of adding an equal amount of oom_score_adj to each process
>> in the container and adopts a shared oom_protect quota for all processes in the container.
>> If a process in the container is killed, the remaining other processes will receive more
>> oom_protect quota, making it more difficult for the remaining processes to be killed.
>> In my test case, the new patch reduced the number of invalid kill behaviors by 70%.
>> 4. oom_score_adj is a global configuration that cannot achieve a kill order that only
>> affects a certain memcg-oom-killer. However, the oom_protect mechanism inherits
>> downwards, and user can only change the kill order of its own memcg oom, but the
>> kill order of their parent memcg-oom-killer or global-oom-killer will not be affected
>
>Yes oom_score_adj has shortcomings.
>
>> In the final discussion of patch v2, we discussed that although the adjustment range
>> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
>> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is
>> clumsy at best. In order to solve this problem in the new patch, I introduced a new
>> indicator oom_kill_inherit, which counts the number of times the local and child
>> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing
>> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the
>> value of oom_protect to achieve the best.
>
>What does the best mean in this context?
I have created a new indicator oom_kill_inherit that maintains a negative correlation
with memory.oom.protect, so we have a ruler to measure the optimal value of
memory.oom.protect.
>> about the semantics of non-leaf memcgs protection,
>> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally
>> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
>
>So the non-leaf memcg is never used as a target? What if the workload is
>distributed over several sub-groups? Our current oom.group
>implementation traverses the tree to find a common ancestor in the oom
>domain with the oom.group.
If the oom_protect quota of the parent non-leaf memcg is less than the sum of
sub-groups oom_protect quota, the oom_protect quota of each sub-group will
be proportionally reduced
If the oom_protect quota of the parent non-leaf memcg is greater than the sum
of sub-groups oom_protect quota, the oom_protect quota of each sub-group
will be proportionally increased
The purpose of doing so is that users can set oom_protect quota according to
their own needs, and the system management process can set appropriate
oom_protect quota on the parent non-leaf memcg as the final cover, so that
the system management process can indirectly manage all user processes.
>All that being said and with the usecase described more specifically. I
>can see that memcg based oom victim selection makes some sense. That
>menas that it is always a memcg selected and all tasks withing killed.
>Memcg based protection can be used to evaluate which memcg to choose and
>the overall scheme should be still manageable. It would indeed resemble
>memory protection for the regular reclaim.
>
>One thing that is still not really clear to me is to how group vs.
>non-group ooms could be handled gracefully. Right now we can handle that
>because the oom selection is still process based but with the protection
>this will become more problematic as explained previously. Essentially
>we would need to enforce the oom selection to be memcg based for all
>memcgs. Maybe a mount knob? What do you think?
There is a function in the patch to determine whether the oom_protect
mechanism is enabled. All memory.oom.protect nodes default to 0, so the function
<is_root_oom_protect> returns 0 by default. The oom_protect mechanism will
only take effect when "root_mem_cgroup->memory.children_oom_protect_usage"
is not 0, and only memcg with memory.oom.protect node set will take effect.
+bool is_root_oom_protect(void)
+{
+ if (mem_cgroup_disabled())
+ return 0;
+
+ return !!atomic_long_read(&root_mem_cgroup->memory.children_oom_protect_usage);
+}
I don't know if there is some problems with my understanding?
--
Thanks for your comment!
chengkaitao
next prev parent reply other threads:[~2023-05-09 6:51 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-06 11:49 chengkaitao
2023-05-06 11:49 ` [PATCH v3 1/2] mm: memcontrol: protect the memory in cgroup from being oom killed chengkaitao
2023-05-06 11:49 ` [PATCH v3 2/2] memcg: add oom_kill_inherit event indicator chengkaitao
2023-05-07 10:11 ` [PATCH v3 0/2] memcontrol: support cgroup level OOM protection Michal Hocko
2023-05-08 9:08 ` 程垲涛 Chengkaitao Cheng
2023-05-08 14:18 ` Michal Hocko
2023-05-09 6:50 ` 程垲涛 Chengkaitao Cheng [this message]
2023-05-22 13:03 ` Michal Hocko
2023-05-25 7:35 ` 程垲涛 Chengkaitao Cheng
2023-05-29 14:02 ` Michal Hocko
[not found] ` <C5E5137F-8754-40CC-9F0C-0EB3D8AC1EC2@didiglobal.com>
2023-06-13 8:16 ` Michal Hocko
[not found] ` <CAJD7tkaw_7vYACsyzAtY9L0ZVC0B=XJEWgG=Ad_dOtL_pBDDvQ@mail.gmail.com>
2023-06-13 8:27 ` Michal Hocko
2023-06-13 8:36 ` Yosry Ahmed
2023-06-13 12:06 ` Michal Hocko
2023-06-13 20:24 ` Yosry Ahmed
2023-06-15 10:39 ` Michal Hocko
2023-06-16 1:44 ` Yosry Ahmed
2023-06-13 8:40 ` tj
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=900EF82B-9899-46DD-9ACC-16D82D9B7A3F@didiglobal.com \
--to=chengkaitao@didiglobal.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=cgroups@vger.kernel.org \
--cc=chengzhihao1@huawei.com \
--cc=corbet@lwn.net \
--cc=ebiederm@xmission.com \
--cc=feng.tang@intel.com \
--cc=hannes@cmpxchg.org \
--cc=haolee.swjtu@gmail.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizefan.x@bytedance.com \
--cc=mcgrof@kernel.org \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=pilgrimtao@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=sfr@canb.auug.org.au \
--cc=shakeelb@google.com \
--cc=surenb@google.com \
--cc=tj@kernel.org \
--cc=vasily.averin@linux.dev \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox