From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30081C7EE26 for ; Tue, 23 May 2023 22:03:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 921A4900002; Tue, 23 May 2023 18:03:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8D09C6B0075; Tue, 23 May 2023 18:03:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7985F900002; Tue, 23 May 2023 18:03:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 67CB06B0074 for ; Tue, 23 May 2023 18:03:36 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 30749C0821 for ; Tue, 23 May 2023 22:03:36 +0000 (UTC) X-FDA: 80822897232.29.2B7E732 Received: from mail-ej1-f49.google.com (mail-ej1-f49.google.com [209.85.218.49]) by imf29.hostedemail.com (Postfix) with ESMTP id 0A56A12001B for ; Tue, 23 May 2023 22:03:33 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=aYxN790+; spf=pass (imf29.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684879414; a=rsa-sha256; cv=none; b=o0gELSJOUddFAiC+rdt5QvddXSAxmPm+bS5hqtkkdRmbjb+c8vWNYgemiuQqdKlzEAx1wl 55Qaip3FOadItkzHYiBp7ztVFpLtJ6QXqQrmmhG292CUKQBeVWRI9uN4R+TR99Eg7vpI2o cUoPz6m22I0MZQJRVIM8bI48/dKBKHM= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=aYxN790+; spf=pass (imf29.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684879414; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=S3/mPMNJ7/h7rqHqGun24hHWu5NE/mF12cbPJ8Cl0jU=; b=VNp4XSrrq0U8S3M5lME5YsXOuQ6PfsPT7Ki2+5Tsb6DZIUrnLJb8JsTzHc4rv3N1uhBLVT RmH/2bQnAywPI2ByfP5yhHcxN0G6tXi7v2VAsQEOm3SEpXhrU6dKn8j5IanxMQ6s/Q05hf qYQY9hKo6liPKwe65IN23R+0HaMwhvE= Received: by mail-ej1-f49.google.com with SMTP id a640c23a62f3a-96fe88cd2fcso24435466b.1 for ; Tue, 23 May 2023 15:03:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1684879412; x=1687471412; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=S3/mPMNJ7/h7rqHqGun24hHWu5NE/mF12cbPJ8Cl0jU=; b=aYxN790+J554PL0/3OW2aefHRP/LSi7dZW7T3QLjGyI+bv2qU5rZXasC8nFYcaWe/L f4tj7bCOgnpk7pykJqTG7i0bvAT3Cwvxwj9ptsiN8D9EF85TPt34UlwDIZq9c5B5WBID zHHRH7mkSLFlokOVMW/o+eQMmw0wJ1qPFzziAw5IPd3ps19BxBzmdreH87NRIWktCC5j ch/f5Z07n0h36ceShdVSLYfr9dRshXuNWnV+ObT4M5ha3OqQBOFPTpr8JwAlDS6KVafo Tzq8nlvl0ck3HYb1AF64mhDlW8dj8rAGBVTikzzLKDzmWRN1IsfUqgZoQFeyVURJTMsW 3UHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684879412; x=1687471412; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=S3/mPMNJ7/h7rqHqGun24hHWu5NE/mF12cbPJ8Cl0jU=; b=edqLv+cOgxXPECdkli3iYlkWxagPYZ3GoNJU1KI5c8Wy9/ZVZLFsESOwwWFfJCq5uz D4thgNURzFaQ7XlVV9y8r6per+a0VGdIYaR3Ch8eYT4ehFko1CJOoMRxMxHs8zT97VG1 o79S0ZYhZLryPp7kL6MlGY4kFPy30SRuYsoBa54kQnlYmM6hr7ANu8Gfn15gOjK7z52l ZXnTyO8ponmD7yfJwpx1y06C5nQToDR8knofWQyo4AClwO3KfJpjpmgtpKjGX7/ijYTU CUdyLsEir0FA7q2kjZWMxKYXwTxyZF08tXH8RroY9RcH+qWx2YwPXaIyl43imJwylR5f P1Dg== X-Gm-Message-State: AC+VfDwY4y+WslIJ9G4G15BPzb9Vvixcw7BVfOksdURn4C5XjevX4b1Q Q+HLlJ226x63l4VRaUHch52NQYLG7M7BKPqC6MY+iw== X-Google-Smtp-Source: ACHHUZ6bWQQJTpzhU6WaLsUhzz2pNvgVXRIJ+sev91yWzcFsyqudHvKqDOT3DIyvGufJlx5Q87lN+jnd33841n6FhuY= X-Received: by 2002:a17:907:8a26:b0:96f:a0ee:113c with SMTP id sc38-20020a1709078a2600b0096fa0ee113cmr13953384ejc.19.1684879412141; Tue, 23 May 2023 15:03:32 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yosry Ahmed Date: Tue, 23 May 2023 15:02:55 -0700 Message-ID: Subject: Re: [PATCH v4 0/2] memcontrol: support cgroup level OOM protection To: =?UTF-8?B?56iL5Z6y5rabIENoZW5na2FpdGFvIENoZW5n?= Cc: "tj@kernel.org" , "lizefan.x@bytedance.com" , "hannes@cmpxchg.org" , "corbet@lwn.net" , "mhocko@kernel.org" , "roman.gushchin@linux.dev" , "shakeelb@google.com" , "akpm@linux-foundation.org" , "brauner@kernel.org" , "muchun.song@linux.dev" , "viro@zeniv.linux.org.uk" , "zhengqi.arch@bytedance.com" , "ebiederm@xmission.com" , "Liam.Howlett@oracle.com" , "chengzhihao1@huawei.com" , "pilgrimtao@gmail.com" , "haolee.swjtu@gmail.com" , "yuzhao@google.com" , "willy@infradead.org" , "vasily.averin@linux.dev" , "vbabka@suse.cz" , "surenb@google.com" , "sfr@canb.auug.org.au" , "mcgrof@kernel.org" , "feng.tang@intel.com" , "cgroups@vger.kernel.org" , "linux-doc@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "linux-mm@kvack.org" , David Rientjes Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 0A56A12001B X-Stat-Signature: hz5yojdept5qddqsinkhhwww3hk7o1oy X-Rspam-User: X-HE-Tag: 1684879413-905368 X-HE-Meta: U2FsdGVkX18NMFcb4iMuxjniz0cUOOa3tWs0NKHEGPGIHqHzA5HobxdemowUv86M84Ug/UDeS5pfg4ZFeDElkqBDYeGcVqKJZSCeVTKNO+VGh8WSHH8tSNWRIFW57huwhfd0vRZkp/yCvD/jk6iw0Gby1pVcBGHTCA/x+FYe9dzy7Gf0a3JJZyaeRb3MQtnRVy3K6EbfCkCXc0yAx+4ks2/NSKCDEMb/dQYTT/B/oinxpFlq9jGZ6bqptqpAOrUOSGZdULhAaLA428Rmia7F481YtW6iACLZecksrgE6UviiK6Q/AhHyXdmj3HRjwttKttfyqQkgJVqXr1LXqi/85yegDhTSkvB7tTAv5ZSF42iU9MRjwSDaOydGWMEw3VWQReUIdVWIqprG0JgKnTUDulY42oH5Y+AaNRujIHyBARoGvdJzQAvMgLoCUxKGt4z1KhRaTw9lOKjNGI0GKGJGP02ju6PK0FSqBR6EsPOPvCOJ+YWtS/Fn54ZYXHwl/w09irNauihYqHcuQj5eDM2aJxmhaL/gf/Xom9Sj8eOYeMOSPSdMiHZDA5HogQDN8L8ROo3MutPHb569M2IWJcNBKqKwWFgn/yDJlt0R51DtvWv825GndNNV/yP86J0MCfG64iGCEsxsq1R86pdK6H+aOTG9+WlnD7GlelqkEmjgCZnVuZfw8MHs2ZlV6r1jz+6K9V2tb2day4ABHO9itoHExUxM2iN1m9fcIZgMi5D84cBYZlPGNTGB2z2USO5bZlU30unq7c6VHCB+aS8UrlV4F//SOS6hGCUfnR8y7JtdS0M8YLcP6CuUSX9J33AWh6TRB2no1X12MNtzsg7bnVC/4qKfV2NnIVuRsPG3ovW/BSrUmRJAuVfEsq84SBOn01ZEVqTmMnJY0hfFX7No/RX/9MerqAbR+n3C5ysaG7xAKXxXweI5cpmdVJXNYq7xZlXNLKkVcQXSUy6myJNQbE3 Gt1T/Z0f vv802NNodaeBLXqTBmIMByosLzuGLtO6kwBHRv+iAoKaF5cThOHgSkkoxB286C2Vyh6052hTE4ZeTxMDM62jB8JSdhR5Qzf4MCq1crfDuf3OsfR6yk7i53+69zr440bFVZixRW60c9jDLZAPKE7xiPLDQhHVZ6oIr6xXzVOBZXpYbKs2HMA8zXWVRJ8MDP9IVRULZICgvxRysNG16HHkDHDLaOXLW4CYSpCR2GJFa0uEIOiim8PYO/lj2f7Yqsc90ORbmuUxgPsrhBdSmz5MJD3oZUYL5DMKnRMN0jznWWjR090VVls+lv6kblSMFyiC1uvNOcdlQPObf23tLee1dcXpDcCCr4VYTOG5dFbw1SXZgXU+Dew1J/xUBtc4msfkkw+LtUq+mBDkATUbbrQCojrMkOICqccUosAjg/cLMGCcfwK32NBBmz2Xno3HI1pZss/PNHcGWrVKi7sXVVgtjdoARTg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, May 20, 2023 at 2:52=E2=80=AFAM =E7=A8=8B=E5=9E=B2=E6=B6=9B Chengka= itao Cheng wrote: > > At 2023-05-20 06:04:26, "Yosry Ahmed" wrote: > >On Wed, May 17, 2023 at 10:12=E2=80=AFPM =E7=A8=8B=E5=9E=B2=E6=B6=9B Che= ngkaitao Cheng > > wrote: > >> > >> At 2023-05-18 04:42:12, "Yosry Ahmed" wrote: > >> >On Wed, May 17, 2023 at 3:01=E2=80=AFAM =E7=A8=8B=E5=9E=B2=E6=B6=9B C= hengkaitao Cheng > >> > wrote: > >> >> > >> >> At 2023-05-17 16:09:50, "Yosry Ahmed" wrote= : > >> >> >On Wed, May 17, 2023 at 1:01=E2=80=AFAM =E7=A8=8B=E5=9E=B2=E6=B6= =9B Chengkaitao Cheng > >> >> > wrote: > >> >> >> > >> >> > >> >> Killing processes in order of memory usage cannot effectively prote= ct > >> >> important processes. Killing processes in a user-defined priority o= rder > >> >> will result in a large number of OOM events and still not being abl= e to > >> >> release enough memory. I have been searching for a balance between > >> >> the two methods, so that their shortcomings are not too obvious. > >> >> The biggest advantage of memcg is its tree topology, and I also hop= e > >> >> to make good use of it. > >> > > >> >For us, killing processes in a user-defined priority order works well= . > >> > > >> >It seems like to tune memory.oom.protect you use oom_kill_inherit to > >> >observe how many times this memcg has been killed due to a limit in a= n > >> >ancestor. Wouldn't it be more straightforward to specify the priority > >> >of protections among memcgs? > >> > > >> >For example, if you observe multiple memcgs being OOM killed due to > >> >hitting an ancestor limit, you will need to decide which of them to > >> >increase memory.oom.protect for more, based on their importance. > >> >Otherwise, if you increase all of them, then there is no point if all > >> >the memory is protected, right? > >> > >> If all memory in memcg is protected, its meaning is similar to that of= the > >> highest priority memcg in your approach, which is ultimately killed or > >> never killed. > > > >Makes sense. I believe it gets a bit trickier when you want to > >describe relative ordering between memcgs using memory.oom.protect. > > Actually, my original intention was not to use memory.oom.protect to > achieve relative ordering between memcgs, it was just a feature that > happened to be achievable. My initial idea was to protect a certain > proportion of memory in memcg from being killed, and through the > method, physical memory can be reasonably planned. Both the physical > machine manager and container manager can add some unimportant > loads beyond the oom.protect limit, greatly improving the oversold > rate of memory. In the worst case scenario, the physical machine can > always provide all the memory limited by memory.oom.protect for memcg. > > On the other hand, I also want to achieve relative ordering of internal > processes in memcg, not just a unified ordering of all memcgs on > physical machines. For us, having a strict priority ordering-based selection is essential. We have different tiers of jobs of different importance, and a job of higher priority should not be killed before a lower priority task if possible, no matter how much memory either of them is using. Protecting memcgs solely based on their usage can be useful in some scenarios, but not in a system where you have different tiers of jobs running with strict priority ordering. > > >> >In this case, wouldn't it be easier to just tell the OOM killer the > >> >relative priority among the memcgs? > >> > > >> >> > >> >> >If this approach works for you (or any other audience), that's gre= at, > >> >> >I can share more details and perhaps we can reach something that w= e > >> >> >can both use :) > >> >> > >> >> If you have a good idea, please share more details or show some cod= e. > >> >> I would greatly appreciate it > >> > > >> >The code we have needs to be rebased onto a different version and > >> >cleaned up before it can be shared, but essentially it is as > >> >described. > >> > > >> >(a) All processes and memcgs start with a default score. > >> >(b) Userspace can specify scores for memcgs and processes. A higher > >> >score means higher priority (aka less score gets killed first). > >> >(c) The OOM killer essentially looks for the memcg with the lowest > >> >scores to kill, then among this memcg, it looks for the process with > >> >the lowest score. Ties are broken based on usage, so essentially if > >> >all processes/memcgs have the default score, we fallback to the > >> >current OOM behavior. > >> > >> If memory oversold is severe, all processes of the lowest priority > >> memcg may be killed before selecting other memcg processes. > >> If there are 1000 processes with almost zero memory usage in > >> the lowest priority memcg, 1000 invalid kill events may occur. > >> To avoid this situation, even for the lowest priority memcg, > >> I will leave him a very small oom.protect quota. > > > >I checked internally, and this is indeed something that we see from > >time to time. We try to avoid that with userspace OOM killing, but > >it's not 100% effective. > > > >> > >> If faced with two memcgs with the same total memory usage and > >> priority, memcg A has more processes but less memory usage per > >> single process, and memcg B has fewer processes but more > >> memory usage per single process, then when OOM occurs, the > >> processes in memcg B may continue to be killed until all processes > >> in memcg B are killed, which is unfair to memcg B because memcg A > >> also occupies a large amount of memory. > > > >I believe in this case we will kill one process in memcg B, then the > >usage of memcg A will become higher, so we will pick a process from > >memcg A next. > > If there is only one process in memcg A and its memory usage is higher > than any other process in memcg B, but the total memory usage of > memcg A is lower than that of memcg B. In this case, if the OOM-killer > still chooses the process in memcg A. it may be unfair to memcg A. > > >> Dose your approach have these issues? Killing processes in a > >> user-defined priority is indeed easier and can work well in most cases= , > >> but I have been trying to solve the cases that it cannot cover. > > > >The first issue is relatable with our approach. Let me dig more info > >from our internal teams and get back to you with more details. > > -- > Thanks for your comment! > chengkaitao > >