From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 68317C88CB2 for ; Tue, 13 Jun 2023 08:16:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C70508E0005; Tue, 13 Jun 2023 04:16:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C203D8E0002; Tue, 13 Jun 2023 04:16:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AE7CF8E0005; Tue, 13 Jun 2023 04:16:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id A0D098E0002 for ; Tue, 13 Jun 2023 04:16:58 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 6038D40420 for ; Tue, 13 Jun 2023 08:16:58 +0000 (UTC) X-FDA: 80897018916.16.8502F94 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf02.hostedemail.com (Postfix) with ESMTP id 558DD8001C for ; Tue, 13 Jun 2023 08:16:55 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=MwwIhts8; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf02.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.28 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1686644215; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1Oj/gcuuB44r6saYYM7CYnUwY1Lak3+wZfGJTIYPEpw=; b=X/NpP6/L0war+JX5X/nS/ryenOxLrf2HVMUdFwQRIXNRNrKEeE/nuIFdRmRfXac2rCuG7u MEKwgW1Hap1h5aMcxAhui/hVQD+iQJ0orLJrpPFAu3qc9hATGhM3rh00KNALS7x9tazxI3 EN4dBvlV4FPipaiQvyd6qQgtLwpdmB0= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=MwwIhts8; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf02.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.28 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686644215; a=rsa-sha256; cv=none; b=e8bX0VsRARKAeRILmHM1HlykZ7qXpsGRfliH8/zK2ZHJG/6AtqJW51LmzzJYkwlXVUTjIF wuQuvQr51nyRJjb+c3wkAEu/4E+lCsKt7GMRHwCGX0yS7ci9Jt/sAQXUyrBSzufjMn4sNf IYGvtlT02tP1f6QOO1wx5D2c6mvKTNc= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 834BA2236B; Tue, 13 Jun 2023 08:16:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1686644213; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1Oj/gcuuB44r6saYYM7CYnUwY1Lak3+wZfGJTIYPEpw=; b=MwwIhts8/ksGQ47ht4E2VgYu7vHERjHXPmrPc+xTUrNjOP//oJ4Jns6B18yOlVVdlrQNBV +c8ihibL2UT+8LsO50aFG22VHGRM8Xp8m4+cOfuTsegVVd7Neti6y6dYed0/EmcWKQAXCp AlUXHHS8sxcJ4Aw5OHi9PVaH072+ac8= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 6C30813483; Tue, 13 Jun 2023 08:16:53 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id G6RbGvUliGS3LQAAMHmgww (envelope-from ); Tue, 13 Jun 2023 08:16:53 +0000 Date: Tue, 13 Jun 2023 10:16:53 +0200 From: Michal Hocko To: =?utf-8?B?56iL5Z6y5rab?= Chengkaitao Cheng Cc: "tj@kernel.org" , "lizefan.x@bytedance.com" , "hannes@cmpxchg.org" , "corbet@lwn.net" , "roman.gushchin@linux.dev" , "shakeelb@google.com" , "akpm@linux-foundation.org" , "brauner@kernel.org" , "muchun.song@linux.dev" , "viro@zeniv.linux.org.uk" , "zhengqi.arch@bytedance.com" , "ebiederm@xmission.com" , "Liam.Howlett@oracle.com" , "chengzhihao1@huawei.com" , "pilgrimtao@gmail.com" , "haolee.swjtu@gmail.com" , "yuzhao@google.com" , "willy@infradead.org" , "vasily.averin@linux.dev" , "vbabka@suse.cz" , "surenb@google.com" , "sfr@canb.auug.org.au" , "mcgrof@kernel.org" , "feng.tang@intel.com" , "cgroups@vger.kernel.org" , "linux-doc@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "linux-mm@kvack.org" Subject: Re: [PATCH v3 0/2] memcontrol: support cgroup level OOM protection Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Queue-Id: 558DD8001C X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: 7kqztsn7ngtpqkqttg4e7jw1o8iubm6q X-HE-Tag: 1686644215-84140 X-HE-Meta: U2FsdGVkX1/OKhKyj490lhyeqSPDs5mQOpWuswkRZFuywdRB7oxl5cr0+8N068I7NGd7JZme1hLH/NMqvBuZJt7KE5Vb2OpQX8bYb0pm8/FqAJ4yJFNxDgwQT9tc319YnxJQKM6cU9BUu0R2Xhl88SrAd5GVUZZDma1EjhqnpsA7FJqSURZrYwOH+Uz/6P3CxFWkr+vglJRQ0FIHHRxooTzONyNrL8ZU9JEM+Q/VKH7cKOpmF1zQezLOOyOdLULJZz0Q6M7qGn/D6BwKRW5kANgOCBtHAl5soeqhy1Kg0fuODndSlG1iefKO4SsnIV5t9UJCtUB4ZwiQ3OYWokNKh6OLXlazn7dz9f0S2vl35VwRfIdft5w9nWR2UwyPN2cunOs6RMwWyC8mQHyg7ylyYuf52V8i8VhVBbQ75YlCFPcP/S7sZc64d8Q0DIT94dWUi5evsB6ubvifKBMSwoMj1ENseTOCegNxURDnYlE9OJb33O0qYv7Sfu7Qc4DLcdwt6dsXiYLvsvFGugeTzowpcJ8JzhTz20cluk6c2397jvN5bqrrIRncPTAsUyZxt4pise14J8m/UGyy3NXU5HmwPxgE3+qotkC2jlNLVWAU2pB4yAWd0HtuDg2Ek3M/kD8vp5Krl+/9hDoFpd/oJHD0qSASBiX0vxjkfFAp4RBD4Dpc1saw9FMrcwRpvXwxHuPAvojImCz2JwU2bsa3kCxxW6QSx6XVrzfKheLYNBc4INqAeXf4O1aUl/3HwHQMdvoxY4eG3X7xxSLHtPZVbNTlDmhB1lx1bX11J9qWJvV1t9/ByqC0t8arvl+GYnKEMQTaVDpw1h9NHHDdC3HOIwYZJQnXcJzqqlrJnsefnBMDmb2j+m3tWn2DFixIgjwq2+SBcVxTYu3szwhBw/G+dGiyQCoFv8n0/V6TmsFTHvEPuQQUmh3Xg9H4FigG9VQsr0jwnHbH5qVf1sVNppK3xOr YhIPGOKH SdGaK6Os9fDU5yuOuCV6dOq2k+bpoh5shabMX4G2YbBsf9bo2U5XMXWmv73esQ04FK77HccbHsrf9HmZLwGQAS8+nol4zeX+unJ8L2ZyadRIPcPxzm/c24VQaXfYeuZYZGPa8XOoOmuYkgtF77k41vOhBCMCdMrwyYYQn5EWfVQ1+jtTU4qOauiwFDhXkIDAUUui1thCAD0bi4iLFalzwsgDX+JNjhC6/oknKOidMhMDrzF5liS8/PyuEHX42xD7M3AI06llAVjFbsp4Vtk4/sXuNjaJ2jwDTf9xwTvYaLGanTi9qzYvgCx/rsevEzoOrQn4O4FZvJJnkXg113pf0WEnSIuOMn7lFvoEv64PKhc54vK2+Q00Pwq5d7A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sun 04-06-23 08:05:53, 程垲涛 Chengkaitao Cheng wrote: > At 2023-05-29 22:02:47, "Michal Hocko" wrote: > >On Thu 25-05-23 07:35:41, 程垲涛 Chengkaitao Cheng wrote: > >> At 2023-05-22 21:03:50, "Michal Hocko" wrote: > >[...] > >> >> I have created a new indicator oom_kill_inherit that maintains a negative correlation > >> >> with memory.oom.protect, so we have a ruler to measure the optimal value of > >> >> memory.oom.protect. > >> > > >> >An example might help here. > >> > >> In my testing case, by adjusting memory.oom.protect, I was able to significantly > >> reduce the oom_kill_inherit of the corresponding cgroup. In a physical machine > >> with severely oversold memory, I divided all cgroups into three categories and > >> controlled their probability of being selected by the oom-killer to 0%,% 20, > >> and 80%, respectively. > > > >I might be just dense but I am lost. Can we focus on the barebone > >semantic of the group oom selection and killing first. No magic > >auto-tuning at this stage please. > > > >> >> >> about the semantics of non-leaf memcgs protection, > >> >> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally > >> >> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota. > >> >> > > >> >> >So the non-leaf memcg is never used as a target? What if the workload is > >> >> >distributed over several sub-groups? Our current oom.group > >> >> >implementation traverses the tree to find a common ancestor in the oom > >> >> >domain with the oom.group. > >> >> > >> >> If the oom_protect quota of the parent non-leaf memcg is less than the sum of > >> >> sub-groups oom_protect quota, the oom_protect quota of each sub-group will > >> >> be proportionally reduced > >> >> If the oom_protect quota of the parent non-leaf memcg is greater than the sum > >> >> of sub-groups oom_protect quota, the oom_protect quota of each sub-group > >> >> will be proportionally increased > >> >> The purpose of doing so is that users can set oom_protect quota according to > >> >> their own needs, and the system management process can set appropriate > >> >> oom_protect quota on the parent non-leaf memcg as the final cover, so that > >> >> the system management process can indirectly manage all user processes. > >> > > >> >I guess that you are trying to say that the oom protection has a > >> >standard hierarchical behavior. And that is fine, well, in fact it is > >> >mandatory for any control knob to have a sane hierarchical properties. > >> >But that doesn't address my above question. Let me try again. When is a > >> >non-leaf memcg potentially selected as the oom victim? It doesn't have > >> >any tasks directly but it might be a suitable target to kill a multi > >> >memcg based workload (e.g. a full container). > >> > >> If nonleaf memcg have the higher memory usage and the smaller > >> memory.oom.protect, it will have the higher the probability being > >> selected by the killer. If the non-leaf memcg is selected as the oom > >> victim, OOM-killer will continue to select the appropriate child > >> memcg downwards until the leaf memcg is selected. > > > >Parent memcg has more or equal memory charged than its child(ren) by > >definition. Let me try to ask differently. Say you have the following > >hierarchy > > > > root > > / \ > > container_A container_B > > (oom.prot=100M) (oom.prot=200M) > > (usage=120M) (usage=180M) > > / | \ > > A B C > > / \ > > C1 C2 > > > > > >container_B is protected so it should be excluded. Correct? So we are at > >container_A to chose from. There are multiple ways the system and > >continer admin might want to achieve. > >1) system admin might want to shut down the whole container. > >2) continer admin might want to shut the whole container down > >3) cont. admin might want to shut down a whole sub group (e.g. C as it > > is a self contained workload and killing portion of it will put it into > > inconsistent state). > >4) cont. admin might want to kill the most excess cgroup with tasks (i.e. a > > leaf memcg). > >5) admin might want to kill a process in the most excess memcg. > > > >Now we already have oom.group thingy that can drive the group killing > >policy but it is not really clear how you want to incorporate that to > >the protection. > > > >Again, I think that an oom.protection makes sense but the semantic has > >to be very carefully thought through because it is quite easy to create > >corner cases and weird behavior. I also think that oom.group has to be > >consistent with the protection. > > The barebone semantic of the function implemented by my patch are > summarized as follows: > Memcg only allows processes in the memcg to be selected by their > ancestor's OOM killer when the memory usage exceeds "oom.protect" I am sure you would need to break this expectation if there is no such memcg with tasks available or do you panic the system in that case in the global case and retry for ever for the memcg oom? > It should be noted that "oom.protect" and "oom.group" are completely > different things, and kneading them together may make the explanation > more confusing. I am not suggesting to tight those two together by any means. I am merely saying that those two have to be mutually cooperative and still represent a reasonable semantic. Please have a look at above example usecases and try to explain how the memory protection fits in here as you have defined and implemented it. -- Michal Hocko SUSE Labs