From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7869EC7EE29 for ; Thu, 25 May 2023 17:25:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AF942900007; Thu, 25 May 2023 13:25:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AA966900002; Thu, 25 May 2023 13:25:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 94947900007; Thu, 25 May 2023 13:25:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 858E6900002 for ; Thu, 25 May 2023 13:25:42 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 50CD4C0BAC for ; Thu, 25 May 2023 17:25:42 +0000 (UTC) X-FDA: 80829454524.10.27955CF Received: from mail-ej1-f50.google.com (mail-ej1-f50.google.com [209.85.218.50]) by imf23.hostedemail.com (Postfix) with ESMTP id 306D8140026 for ; Thu, 25 May 2023 17:25:39 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=Bl1TEhPW; spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.50 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1685035540; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wavdum6yYAqpFeopZHwVVFgnm4JMHiQDU/53zz3FxBY=; b=08NqRx3FcIYk9OpG5UBmQLAtU/RensRdtcw7ZSOk8Ik0KKwrYxpaPsSV4+5bQt4glFCjPd MdIEd9aF2EP6QMNsvcXs2K5zqjUM/KJ2ZNyi4CDyhdZbUnLTsdEIj/24F/4ksOdHoUICnG 3ETE2VAdyYiU76Bwdm7LEuq0LeWot7k= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1685035540; a=rsa-sha256; cv=none; b=0kUR9Jp6BWeYDjra96MCNJ9zNzNKnvnTR7KaohEVhh7Br2787NAYYhlLwpDZEZ1UPOVrsP 1KvTwlQII9TSqffPnmbyyWEDMd3ff+O9s8Y+MiMzaxHBqB3ig23xKaotK3qtb5wExv9HxK 7M+5/c6eocHFgAsNk+6ErKPUaBawxAE= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=Bl1TEhPW; spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.50 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ej1-f50.google.com with SMTP id a640c23a62f3a-96f9cfa7eddso163863166b.2 for ; Thu, 25 May 2023 10:25:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1685035538; x=1687627538; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=wavdum6yYAqpFeopZHwVVFgnm4JMHiQDU/53zz3FxBY=; b=Bl1TEhPW0arhZ2CDlClZlJpVBNSR8TYuPWHXVWoQBYjf1WNWDR2Ka9/xkJ7821e7dg 0/WvnVOaZfAh7WhZLYkfwMRmjghgDCEsIgLrSmiDUx5WyuAVmxbhFHraBfnsYYIN2zln C2vdT8m17QI4hl3wn8xNPzILKe+28s7RRxTULDPYSrWyS77OMALzV21dg4TwCyZFTgWR jctT1+sBuj/bkJ12e/NOdglvvePzS/amFvp5qKnZ0qjMYYsOKHrh0EA5lxoAPVL5NBvr mo44iMtrt8dTLuLiZ8cJ9ddSldIyVJhYxPSYMNBbYh5JGp5xaG5SgzM5Sw/qHCPX1bXN xvyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685035538; x=1687627538; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wavdum6yYAqpFeopZHwVVFgnm4JMHiQDU/53zz3FxBY=; b=QXRyKH5TNDbj23NvMDYM5J3WH2YF5rcBuhHjXDZv6sxgT1IyJDyfW7Hgmo3O9VGzbo gaLQ1KSxlBF5YgsISjCn2mcmhmyztsM2K7hv/zQThTXhldI7gM0qpNGsXYIevVe1ynm0 V06U0Xee/X+SZoDEtviFIOgbfkW4KeHOKX5DylSeszYmqOr0lVmtnqosHc70tLedJW7t N81NZKwpXA6hksjY1lilYAZ/iaQUQIwYHNtZ7QGp0U5ioKvckqx3NUVtSNmV5lWKYLmB qLbJq2b9XT6e+ABxyeautT8JmA3V+ZiKv0QA7ov3JC9QVUhnU6Un2wrY58YVi8k4W7Yv /jaA== X-Gm-Message-State: AC+VfDxxE8B/gNLpPsDQR/fnpVTDaF5JS8us18GCEZ7a+nd8k3dF8pw9 LDA7pWe0cohBU/RJgdd9AeqR5M2bNTZCLl6ASj3NC5lh1z7cBccsHyQ= X-Google-Smtp-Source: ACHHUZ6Ui5kgEQBgjsGW/abGwW/6LSNUNvdC9IkyTBIGvkXtbAtZVyDWbuuhLnEuEJEVY52cPEeQOivGWEI+sXKKbBw= X-Received: by 2002:a17:907:2da8:b0:96f:d154:54f7 with SMTP id gt40-20020a1709072da800b0096fd15454f7mr2878524ejc.42.1685035197205; Thu, 25 May 2023 10:19:57 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yosry Ahmed Date: Thu, 25 May 2023 10:19:20 -0700 Message-ID: Subject: Re: [PATCH v4 0/2] memcontrol: support cgroup level OOM protection To: =?UTF-8?B?56iL5Z6y5rabIENoZW5na2FpdGFvIENoZW5n?= Cc: "tj@kernel.org" , "lizefan.x@bytedance.com" , "hannes@cmpxchg.org" , "corbet@lwn.net" , "mhocko@kernel.org" , "roman.gushchin@linux.dev" , "shakeelb@google.com" , "akpm@linux-foundation.org" , "brauner@kernel.org" , "muchun.song@linux.dev" , "viro@zeniv.linux.org.uk" , "zhengqi.arch@bytedance.com" , "ebiederm@xmission.com" , "Liam.Howlett@oracle.com" , "chengzhihao1@huawei.com" , "pilgrimtao@gmail.com" , "haolee.swjtu@gmail.com" , "yuzhao@google.com" , "willy@infradead.org" , "vasily.averin@linux.dev" , "vbabka@suse.cz" , "surenb@google.com" , "sfr@canb.auug.org.au" , "mcgrof@kernel.org" , "feng.tang@intel.com" , "cgroups@vger.kernel.org" , "linux-doc@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "linux-mm@kvack.org" , David Rientjes Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 306D8140026 X-Rspam-User: X-Stat-Signature: zkcwd58keuaf5rxpfat8uprt8cxos19h X-Rspamd-Server: rspam03 X-HE-Tag: 1685035539-285348 X-HE-Meta: U2FsdGVkX1+01ru/AHwYM/gwUMGXk1YxNmJ1LkrLnzib9m/3U1l3l2hm8JIfoHZotrYeFi5fmgc4Iksv9/hv0htNqNYDJYA0Yt+PjQmz44IyaH8KzWeR5+L7wYO4mz+oaj+UE0ou0pnnaeg45TD4/LQK/+M0SYSVdtulttesx0h0QVbjSI+LtBLG3a+L2a+X/Avo1S+c/SKyMn953Bm/dXVojekOT4lSE1iqFM+cv1EiCyt1IlI8swbWG7EGljJsXRknIk7N90COdSCxf8pQW/UAi7L08/iV4yQBuPpHu0oBm/3/5OI8qsIBq2XoXWskvAi37RfXlLfkQMO/BYU1DBpueHop+9hiKdyTJvbqMGtvWcXpv0D/uPNYGZop7MuG170TgmxW+iQYO+HQIYAFS6/e7LVNAQT4wkWkV4uZJ9/8YFjdjXnEAKmI5ELPfxwJYV9ZfSAfbnyUQ/kLsaKV/casNjALkpjXbazsyMNplXrXXIYY88wLxPWblTmnkZe59hwuYcdTL3qjTUN1/WEA+2HrWtx6Y55Nl6Sf8yGRrpkIxv3w1B4z6WGdG+UPL7kwGd4QeC3cTiiFPjAyolHs2O5URSol6OSFEwQIpm5Fr2/CoTgeKhRhS4bSlSMIfM3g/rt3ezcND28vSHyXqxklTCjSUhLD0Yi/UsmWhHpxy5rJdBHKXWVHEsLvGnt7+rolzMltVdbfJ9otSvcoyv5kuvJKwDYhGOM+XfHGX12Y2paceKvbxB8yN8HmuL9mrhqc4CMW71BCnreIUOa+bI0rMxpwYB0ubhP2K2IqJj8AXeXntwtMCnphMjEI2RnVQOmhRkdxtAdrsQLSy+wnv5djTCc2PcTiiXrGH5FEMMpPV49hL2o41Fws8wKtAq6bR28s4rICxTPQ3fRDGRPWesM4/RF4iEqy7bHEfPHG0FHVsDjIyTs8/CyYScwF0TM1yAG/Fn2iGt34GOX2TaQqjj+ 0o3DhSPB uqoel4Hyg75BzXfKgwatxNNIyvetFt2r9dS3064Z54nW5riOuXK4qJ86u3TSPcxRID3fsgm+i//uHGAAYeRXH/Z8ded6g3ami/O6ZEmX6cusr9o+KM0PnIqq9DSHXyYw5QelyC+qZ/PhnelNijDvd/iZcqU1fanXHOUM6uk1SRMU/1P+Bn/9mITGFzw3TYKQKuulcqoHm2rCeYeL1LHh+VleVRfUBhZ8hBRsB/MY8waep6oFzdvq5v/c4PSQsZg9noDVKkGDbPGbdebWeCfJePQ8/sGkhFF+HQxCyhq8Og9LCHw2qjSFISeBHpAbkKfYspUAGTFQ60CzBvE8NEZrOJN2abovZcu5cvdywCdxLE1bpSPLR1QuRyJi7raEHA+Q0hgLcL+rCBsHghhJEavTF2LjnCpxCxoaAe8pgppFGKhOJnsltn2iOPv03CUaj6eufb/anMnwaGwvGfr9Y+O2P2VLm9g== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, May 25, 2023 at 1:19=E2=80=AFAM =E7=A8=8B=E5=9E=B2=E6=B6=9B Chengka= itao Cheng wrote: > > At 2023-05-24 06:02:55, "Yosry Ahmed" wrote: > >On Sat, May 20, 2023 at 2:52=E2=80=AFAM =E7=A8=8B=E5=9E=B2=E6=B6=9B Chen= gkaitao Cheng > > wrote: > >> > >> At 2023-05-20 06:04:26, "Yosry Ahmed" wrote: > >> >On Wed, May 17, 2023 at 10:12=E2=80=AFPM =E7=A8=8B=E5=9E=B2=E6=B6=9B = Chengkaitao Cheng > >> > wrote: > >> >> > >> >> At 2023-05-18 04:42:12, "Yosry Ahmed" wrote= : > >> >> >On Wed, May 17, 2023 at 3:01=E2=80=AFAM =E7=A8=8B=E5=9E=B2=E6=B6= =9B Chengkaitao Cheng > >> >> > wrote: > >> >> >> > >> >> >> At 2023-05-17 16:09:50, "Yosry Ahmed" wr= ote: > >> >> >> >On Wed, May 17, 2023 at 1:01=E2=80=AFAM =E7=A8=8B=E5=9E=B2=E6= =B6=9B Chengkaitao Cheng > >> >> >> > wrote: > >> >> >> >> > >> >> >> > >> >> >> Killing processes in order of memory usage cannot effectively pr= otect > >> >> >> important processes. Killing processes in a user-defined priorit= y order > >> >> >> will result in a large number of OOM events and still not being = able to > >> >> >> release enough memory. I have been searching for a balance betwe= en > >> >> >> the two methods, so that their shortcomings are not too obvious. > >> >> >> The biggest advantage of memcg is its tree topology, and I also = hope > >> >> >> to make good use of it. > >> >> > > >> >> >For us, killing processes in a user-defined priority order works w= ell. > >> >> > > >> >> >It seems like to tune memory.oom.protect you use oom_kill_inherit = to > >> >> >observe how many times this memcg has been killed due to a limit i= n an > >> >> >ancestor. Wouldn't it be more straightforward to specify the prior= ity > >> >> >of protections among memcgs? > >> >> > > >> >> >For example, if you observe multiple memcgs being OOM killed due t= o > >> >> >hitting an ancestor limit, you will need to decide which of them t= o > >> >> >increase memory.oom.protect for more, based on their importance. > >> >> >Otherwise, if you increase all of them, then there is no point if = all > >> >> >the memory is protected, right? > >> >> > >> >> If all memory in memcg is protected, its meaning is similar to that= of the > >> >> highest priority memcg in your approach, which is ultimately killed= or > >> >> never killed. > >> > > >> >Makes sense. I believe it gets a bit trickier when you want to > >> >describe relative ordering between memcgs using memory.oom.protect. > >> > >> Actually, my original intention was not to use memory.oom.protect to > >> achieve relative ordering between memcgs, it was just a feature that > >> happened to be achievable. My initial idea was to protect a certain > >> proportion of memory in memcg from being killed, and through the > >> method, physical memory can be reasonably planned. Both the physical > >> machine manager and container manager can add some unimportant > >> loads beyond the oom.protect limit, greatly improving the oversold > >> rate of memory. In the worst case scenario, the physical machine can > >> always provide all the memory limited by memory.oom.protect for memcg. > >> > >> On the other hand, I also want to achieve relative ordering of interna= l > >> processes in memcg, not just a unified ordering of all memcgs on > >> physical machines. > > > >For us, having a strict priority ordering-based selection is > >essential. We have different tiers of jobs of different importance, > >and a job of higher priority should not be killed before a lower > >priority task if possible, no matter how much memory either of them is > >using. Protecting memcgs solely based on their usage can be useful in > >some scenarios, but not in a system where you have different tiers of > >jobs running with strict priority ordering. > > If you want to run with strict priority ordering, it can also be achieved= , > but it may be quite troublesome. The directory structure shown below > can achieve the goal. > > root > / \ > cgroup A cgroup B > (protect=3Dmax) (protect=3D0) > / \ > cgroup C cgroup D > (protect=3Dmax) (protect=3D0) > / \ > cgroup E cgroup F > (protect=3Dmax) (protect=3D0) > > Oom kill order: F > E > C > A This requires restructuring the cgroup hierarchy which comes with a lot of other factors, I don't think that's practically an option. > > As mentioned earlier, "running with strict priority ordering" may be > some extreme issues, that requires the manager to make a choice. We have been using strict priority ordering in our fleet for many years now and we depend on it. Some jobs are simply more important than others, regardless of their usage. > > >> > >> >> >In this case, wouldn't it be easier to just tell the OOM killer th= e > >> >> >relative priority among the memcgs? > >> >> > > >> >> >> > >> >> >> >If this approach works for you (or any other audience), that's = great, > >> >> >> >I can share more details and perhaps we can reach something tha= t we > >> >> >> >can both use :) > >> >> >> > >> >> >> If you have a good idea, please share more details or show some = code. > >> >> >> I would greatly appreciate it > >> >> > > >> >> >The code we have needs to be rebased onto a different version and > >> >> >cleaned up before it can be shared, but essentially it is as > >> >> >described. > >> >> > > >> >> >(a) All processes and memcgs start with a default score. > >> >> >(b) Userspace can specify scores for memcgs and processes. A highe= r > >> >> >score means higher priority (aka less score gets killed first). > >> >> >(c) The OOM killer essentially looks for the memcg with the lowest > >> >> >scores to kill, then among this memcg, it looks for the process wi= th > >> >> >the lowest score. Ties are broken based on usage, so essentially i= f > >> >> >all processes/memcgs have the default score, we fallback to the > >> >> >current OOM behavior. > >> >> > >> >> If memory oversold is severe, all processes of the lowest priority > >> >> memcg may be killed before selecting other memcg processes. > >> >> If there are 1000 processes with almost zero memory usage in > >> >> the lowest priority memcg, 1000 invalid kill events may occur. > >> >> To avoid this situation, even for the lowest priority memcg, > >> >> I will leave him a very small oom.protect quota. > >> > > >> >I checked internally, and this is indeed something that we see from > >> >time to time. We try to avoid that with userspace OOM killing, but > >> >it's not 100% effective. > >> > > >> >> > >> >> If faced with two memcgs with the same total memory usage and > >> >> priority, memcg A has more processes but less memory usage per > >> >> single process, and memcg B has fewer processes but more > >> >> memory usage per single process, then when OOM occurs, the > >> >> processes in memcg B may continue to be killed until all processes > >> >> in memcg B are killed, which is unfair to memcg B because memcg A > >> >> also occupies a large amount of memory. > >> > > >> >I believe in this case we will kill one process in memcg B, then the > >> >usage of memcg A will become higher, so we will pick a process from > >> >memcg A next. > >> > >> If there is only one process in memcg A and its memory usage is higher > >> than any other process in memcg B, but the total memory usage of > >> memcg A is lower than that of memcg B. In this case, if the OOM-killer > >> still chooses the process in memcg A. it may be unfair to memcg A. > >> > >> >> Dose your approach have these issues? Killing processes in a > >> >> user-defined priority is indeed easier and can work well in most ca= ses, > >> >> but I have been trying to solve the cases that it cannot cover. > >> > > >> >The first issue is relatable with our approach. Let me dig more info > >> >from our internal teams and get back to you with more details. > > -- > Thanks for your comment! > chengkaitao > >