From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5ACBC77B73 for ; Mon, 22 May 2023 13:05:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3CC80900003; Mon, 22 May 2023 09:05:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 37CC2900002; Mon, 22 May 2023 09:05:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 26BC3900003; Mon, 22 May 2023 09:05:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 197CE900002 for ; Mon, 22 May 2023 09:05:13 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id DD14D80309 for ; Mon, 22 May 2023 13:05:12 +0000 (UTC) X-FDA: 80817911664.22.30704BA Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by imf21.hostedemail.com (Postfix) with ESMTP id C416E1C0078 for ; Mon, 22 May 2023 13:03:52 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=suVckYdz; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf21.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684760633; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DrtDbupRjXTYRj2tZ8j8I/X6g23GzYeO+ha6ND6W5ms=; b=ZpqyhUqYHFyJ2v/a6etPdJ9qq0G8abJwOKgu4UQMmXp3uUQfWfFjemY6FiDt15rVKzfmb6 Tn39jlShVzZDD90Xs1pUoYLJBtm66bicRMiNEUPjEd7oaUpnzsVZKqoESJnkmOXYeZIWOr Sfp6sfPS5MCBOwJLsUK9IyuRkmhFmJE= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=suVckYdz; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf21.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684760633; a=rsa-sha256; cv=none; b=S440olYSnbay5Qua41Ql69TvNYYpIG/pFOblvsSU0Jpw4V16tcu13pzPmYfWi7vH9RUhg/ 3KG3v1Vgkebe1u60uglqZI/xQS8CBwePmJ2tDZWHHp1+vqiWtR3S3jRlon7HM9P1uPFQo7 pUuZcEE+Av2VnDwsOpwil+eL5/zdXPQ= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id EE5231FEFD; Mon, 22 May 2023 13:03:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1684760630; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=DrtDbupRjXTYRj2tZ8j8I/X6g23GzYeO+ha6ND6W5ms=; b=suVckYdzUWeSehtbUYtrY8aIrry6qsyz9sDDBEHs1h3Eua4XcbY3irusmUqS2No/hcQeaS ek/Vzt88mLX32eaY6tLLxK52XXBMgBFktKrdb06SvcVK6PRDeIeb2Y7AB96DlM1FXN66jC hJivbKMhHDfz7qqloyaIPFKnuABQ2C0= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id BE0E813776; Mon, 22 May 2023 13:03:50 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id Ci3tLTZoa2RRbQAAMHmgww (envelope-from ); Mon, 22 May 2023 13:03:50 +0000 Date: Mon, 22 May 2023 15:03:50 +0200 From: Michal Hocko To: =?utf-8?B?56iL5Z6y5rab?= Chengkaitao Cheng Cc: "tj@kernel.org" , "lizefan.x@bytedance.com" , "hannes@cmpxchg.org" , "corbet@lwn.net" , "roman.gushchin@linux.dev" , "shakeelb@google.com" , "akpm@linux-foundation.org" , "brauner@kernel.org" , "muchun.song@linux.dev" , "viro@zeniv.linux.org.uk" , "zhengqi.arch@bytedance.com" , "ebiederm@xmission.com" , "Liam.Howlett@oracle.com" , "chengzhihao1@huawei.com" , "pilgrimtao@gmail.com" , "haolee.swjtu@gmail.com" , "yuzhao@google.com" , "willy@infradead.org" , "vasily.averin@linux.dev" , "vbabka@suse.cz" , "surenb@google.com" , "sfr@canb.auug.org.au" , "mcgrof@kernel.org" , "feng.tang@intel.com" , "cgroups@vger.kernel.org" , "linux-doc@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "linux-mm@kvack.org" Subject: Re: [PATCH v3 0/2] memcontrol: support cgroup level OOM protection Message-ID: References: <900EF82B-9899-46DD-9ACC-16D82D9B7A3F@didiglobal.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <900EF82B-9899-46DD-9ACC-16D82D9B7A3F@didiglobal.com> X-Rspamd-Queue-Id: C416E1C0078 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: 6i9axsh8upqr6edn93hbqoiza731399h X-HE-Tag: 1684760632-60320 X-HE-Meta: U2FsdGVkX1+iBF6fr9OHehKCm8y9TXmVqn5xlUzDWYx3w3PPn3l+y4xe+einq+998ZOcsgEUHCK3XoaUMgPi76JJat70QdSeiTcxqlvWhXuoAvQBq5rCTeijgtwffUVf6FC3H9usriUr5EzLUtZ4RqMeOMjOH7P9C7+Dekzn0zNvAcLc7IqOJdy53v34PJKm9M9DDWk6uHHxj4p5SK82rDwuobw1bn2kADFGUv9mX8Mv1ISHpLl9toZYXbTsm/gurO6E8hsB8OQPLWGeZZFZ4f6SwU8MJiDcVSHx8aXZJMTXxDG8kxkyjfFhChIybfdkuR7rYQJ6BVT++HhcYDXtUNwab/RHA+nDfUDPXYeD52uH+omuvJzwJ0N5F2yIVfpejj8KYnK9eQ7TLS5WDuRGXbYn+Ka7oZEB9/zwUQSHVnJURoQaKEXYD6z3+wPQN8ABBBVCgV94u54/DPG7H4PPrfeF41YSQczqEBjXWiqeNZlnaVRvEEs24eZl+XERB+1gK/0ghmakX9yB9dPs42PuTeHX1xuJTBoQgNOsMHqZVs+Asj5vtO7klo7WMkqf7pACnRDNO4MdvxxWxbGhGEFDdXsD/5aTzDyt5pJRyZYGAUcgCwHeOT7i7rzR3No3comzogO+gVGSdOsaGHoIBpgVAbyGxSVcCqzV2O84AeMREGch3pSKgZ6yCMwU9oaUVGYekivULY0yiy148KlpXnC+fLamHSkHaTul18lV3S1FDJRsZE76plbLq7nv8/4g3LR3o6p5g9j0uys2fMfu2LMQEbcauF1SHlPcwGoq4wLdGzUy/eTMBIMr2qjnerZRjNIPdSRCTIgZ9F0MxNgjJp+g/EQjwNP6FwEzmWzFEsLpwCG8KNMcv952kDFOyuJ3UUoffgcCy0fA533Hufi8stsioSWgL1L9Y1A0DqPrePcvonxnb+gKirmVWzmlwXQaN0KOQv3PfzBKGyQLFYN7+nr MrFEMYz3 Wf7TsIGUpyVJSxG/MRKaMIpZXCYUox5qz1mg0IIbfgiH+hVjSqfDG1qHHa3lT2w0+6L2MAs7kemMmlT4EWzm4eT7aDRaDESiyNKSF5RyryCEPbm3jr5lnzprKBc+k36+ncA4mkhRlppW9yt6+kRarm/zx/pvdAzstV0kYw3aX+ug/W7L3g5LH/IgGChaxKN9015ug7rF4bxadKFkjrQf/FlWJsdz1sOSEgOXkNKDDQZ2wjclMWYstJ23c3nmOIs/kPAfZj9SvyV89hc4q48csdPTVXaq2ihK6XjyUuj6q6OKLfnvusbSdl0V4IJ7WkIeyErLzVUufoOObxD01ucRhaYAX9sNEaWXM1ZTXfRy2Gof2MgtP5CNurWgQbQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: [Sorry for a late reply but I was mostly offline last 2 weeks] On Tue 09-05-23 06:50:59, 程垲涛 Chengkaitao Cheng wrote: > At 2023-05-08 22:18:18, "Michal Hocko" wrote: [...] > >Your cover letter mentions that then "all processes in the cgroup as a > >whole". That to me reads as oom.group oom killer policy. But a brief > >look into the patch suggests you are still looking at specific tasks and > >this has been a concern in the previous version of the patch because > >memcg accounting and per-process accounting are detached. > > I think the memcg accounting may be more reasonable, as its memory > statistics are more comprehensive, similar to active page cache, which > also increases the probability of OOM-kill. In the new patch, all the > shared memory will also consume the oom_protect quota of the memcg, > and the process's oom_protect quota of the memcg will decrease. I am sorry but I do not follow. Could you elaborate please? Are you arguing for per memcg or per process metrics? [...] > >> In the final discussion of patch v2, we discussed that although the adjustment range > >> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases > >> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is > >> clumsy at best. In order to solve this problem in the new patch, I introduced a new > >> indicator oom_kill_inherit, which counts the number of times the local and child > >> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing > >> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the > >> value of oom_protect to achieve the best. > > > >What does the best mean in this context? > > I have created a new indicator oom_kill_inherit that maintains a negative correlation > with memory.oom.protect, so we have a ruler to measure the optimal value of > memory.oom.protect. An example might help here. > >> about the semantics of non-leaf memcgs protection, > >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally > >> calculate the new effective oom_protect quota based on non-leaf memcg's quota. > > > >So the non-leaf memcg is never used as a target? What if the workload is > >distributed over several sub-groups? Our current oom.group > >implementation traverses the tree to find a common ancestor in the oom > >domain with the oom.group. > > If the oom_protect quota of the parent non-leaf memcg is less than the sum of > sub-groups oom_protect quota, the oom_protect quota of each sub-group will > be proportionally reduced > If the oom_protect quota of the parent non-leaf memcg is greater than the sum > of sub-groups oom_protect quota, the oom_protect quota of each sub-group > will be proportionally increased > The purpose of doing so is that users can set oom_protect quota according to > their own needs, and the system management process can set appropriate > oom_protect quota on the parent non-leaf memcg as the final cover, so that > the system management process can indirectly manage all user processes. I guess that you are trying to say that the oom protection has a standard hierarchical behavior. And that is fine, well, in fact it is mandatory for any control knob to have a sane hierarchical properties. But that doesn't address my above question. Let me try again. When is a non-leaf memcg potentially selected as the oom victim? It doesn't have any tasks directly but it might be a suitable target to kill a multi memcg based workload (e.g. a full container). > >All that being said and with the usecase described more specifically. I > >can see that memcg based oom victim selection makes some sense. That > >menas that it is always a memcg selected and all tasks withing killed. > >Memcg based protection can be used to evaluate which memcg to choose and > >the overall scheme should be still manageable. It would indeed resemble > >memory protection for the regular reclaim. > > > >One thing that is still not really clear to me is to how group vs. > >non-group ooms could be handled gracefully. Right now we can handle that > >because the oom selection is still process based but with the protection > >this will become more problematic as explained previously. Essentially > >we would need to enforce the oom selection to be memcg based for all > >memcgs. Maybe a mount knob? What do you think? > > There is a function in the patch to determine whether the oom_protect > mechanism is enabled. All memory.oom.protect nodes default to 0, so the function > returns 0 by default. How can an admin determine what is the current oom detection logic? -- Michal Hocko SUSE Labs