From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7869EC7EE29
	for <linux-mm@archiver.kernel.org>; Thu, 25 May 2023 17:25:43 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id AF942900007; Thu, 25 May 2023 13:25:42 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AA966900002; Thu, 25 May 2023 13:25:42 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 94947900007; Thu, 25 May 2023 13:25:42 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 858E6900002
	for <linux-mm@kvack.org>; Thu, 25 May 2023 13:25:42 -0400 (EDT)
Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 50CD4C0BAC
	for <linux-mm@kvack.org>; Thu, 25 May 2023 17:25:42 +0000 (UTC)
X-FDA: 80829454524.10.27955CF
Received: from mail-ej1-f50.google.com (mail-ej1-f50.google.com [209.85.218.50])
	by imf23.hostedemail.com (Postfix) with ESMTP id 306D8140026
	for <linux-mm@kvack.org>; Thu, 25 May 2023 17:25:39 +0000 (UTC)
Authentication-Results: imf23.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=Bl1TEhPW;
	spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.50 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1685035540;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=wavdum6yYAqpFeopZHwVVFgnm4JMHiQDU/53zz3FxBY=;
	b=08NqRx3FcIYk9OpG5UBmQLAtU/RensRdtcw7ZSOk8Ik0KKwrYxpaPsSV4+5bQt4glFCjPd
	MdIEd9aF2EP6QMNsvcXs2K5zqjUM/KJ2ZNyi4CDyhdZbUnLTsdEIj/24F/4ksOdHoUICnG
	3ETE2VAdyYiU76Bwdm7LEuq0LeWot7k=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1685035540; a=rsa-sha256;
	cv=none;
	b=0kUR9Jp6BWeYDjra96MCNJ9zNzNKnvnTR7KaohEVhh7Br2787NAYYhlLwpDZEZ1UPOVrsP
	1KvTwlQII9TSqffPnmbyyWEDMd3ff+O9s8Y+MiMzaxHBqB3ig23xKaotK3qtb5wExv9HxK
	7M+5/c6eocHFgAsNk+6ErKPUaBawxAE=
ARC-Authentication-Results: i=1;
	imf23.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=Bl1TEhPW;
	spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.50 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
Received: by mail-ej1-f50.google.com with SMTP id a640c23a62f3a-96f9cfa7eddso163863166b.2
        for <linux-mm@kvack.org>; Thu, 25 May 2023 10:25:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1685035538; x=1687627538;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=wavdum6yYAqpFeopZHwVVFgnm4JMHiQDU/53zz3FxBY=;
        b=Bl1TEhPW0arhZ2CDlClZlJpVBNSR8TYuPWHXVWoQBYjf1WNWDR2Ka9/xkJ7821e7dg
         0/WvnVOaZfAh7WhZLYkfwMRmjghgDCEsIgLrSmiDUx5WyuAVmxbhFHraBfnsYYIN2zln
         C2vdT8m17QI4hl3wn8xNPzILKe+28s7RRxTULDPYSrWyS77OMALzV21dg4TwCyZFTgWR
         jctT1+sBuj/bkJ12e/NOdglvvePzS/amFvp5qKnZ0qjMYYsOKHrh0EA5lxoAPVL5NBvr
         mo44iMtrt8dTLuLiZ8cJ9ddSldIyVJhYxPSYMNBbYh5JGp5xaG5SgzM5Sw/qHCPX1bXN
         xvyQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1685035538; x=1687627538;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=wavdum6yYAqpFeopZHwVVFgnm4JMHiQDU/53zz3FxBY=;
        b=QXRyKH5TNDbj23NvMDYM5J3WH2YF5rcBuhHjXDZv6sxgT1IyJDyfW7Hgmo3O9VGzbo
         gaLQ1KSxlBF5YgsISjCn2mcmhmyztsM2K7hv/zQThTXhldI7gM0qpNGsXYIevVe1ynm0
         V06U0Xee/X+SZoDEtviFIOgbfkW4KeHOKX5DylSeszYmqOr0lVmtnqosHc70tLedJW7t
         N81NZKwpXA6hksjY1lilYAZ/iaQUQIwYHNtZ7QGp0U5ioKvckqx3NUVtSNmV5lWKYLmB
         qLbJq2b9XT6e+ABxyeautT8JmA3V+ZiKv0QA7ov3JC9QVUhnU6Un2wrY58YVi8k4W7Yv
         /jaA==
X-Gm-Message-State: AC+VfDxxE8B/gNLpPsDQR/fnpVTDaF5JS8us18GCEZ7a+nd8k3dF8pw9
	LDA7pWe0cohBU/RJgdd9AeqR5M2bNTZCLl6ASj3NC5lh1z7cBccsHyQ=
X-Google-Smtp-Source: ACHHUZ6Ui5kgEQBgjsGW/abGwW/6LSNUNvdC9IkyTBIGvkXtbAtZVyDWbuuhLnEuEJEVY52cPEeQOivGWEI+sXKKbBw=
X-Received: by 2002:a17:907:2da8:b0:96f:d154:54f7 with SMTP id
 gt40-20020a1709072da800b0096fd15454f7mr2878524ejc.42.1685035197205; Thu, 25
 May 2023 10:19:57 -0700 (PDT)
MIME-Version: 1.0
References: <CAJD7tkZwCreOS_XxDM_9mOTBo=Gatr12r1xtc64B_e5+HJhRqg@mail.gmail.com>
 <B438A058-7C4A-46B3-B6FB-4CF32BD7D294@didiglobal.com>
In-Reply-To: <B438A058-7C4A-46B3-B6FB-4CF32BD7D294@didiglobal.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Thu, 25 May 2023 10:19:20 -0700
Message-ID: <CAJD7tkaQdSTDX0Q7zvvYrA3Y4TcvLdWKnN3yc8VpfWRpUjcYBw@mail.gmail.com>
Subject: Re: [PATCH v4 0/2] memcontrol: support cgroup level OOM protection
To: =?UTF-8?B?56iL5Z6y5rabIENoZW5na2FpdGFvIENoZW5n?= <chengkaitao@didiglobal.com>
Cc: "tj@kernel.org" <tj@kernel.org>, "lizefan.x@bytedance.com" <lizefan.x@bytedance.com>, 
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>, "corbet@lwn.net" <corbet@lwn.net>, 
	"mhocko@kernel.org" <mhocko@kernel.org>, "roman.gushchin@linux.dev" <roman.gushchin@linux.dev>, 
	"shakeelb@google.com" <shakeelb@google.com>, 
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>, "brauner@kernel.org" <brauner@kernel.org>, 
	"muchun.song@linux.dev" <muchun.song@linux.dev>, "viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>, 
	"zhengqi.arch@bytedance.com" <zhengqi.arch@bytedance.com>, 
	"ebiederm@xmission.com" <ebiederm@xmission.com>, "Liam.Howlett@oracle.com" <Liam.Howlett@oracle.com>, 
	"chengzhihao1@huawei.com" <chengzhihao1@huawei.com>, "pilgrimtao@gmail.com" <pilgrimtao@gmail.com>, 
	"haolee.swjtu@gmail.com" <haolee.swjtu@gmail.com>, "yuzhao@google.com" <yuzhao@google.com>, 
	"willy@infradead.org" <willy@infradead.org>, "vasily.averin@linux.dev" <vasily.averin@linux.dev>, 
	"vbabka@suse.cz" <vbabka@suse.cz>, "surenb@google.com" <surenb@google.com>, 
	"sfr@canb.auug.org.au" <sfr@canb.auug.org.au>, "mcgrof@kernel.org" <mcgrof@kernel.org>, 
	"feng.tang@intel.com" <feng.tang@intel.com>, "cgroups@vger.kernel.org" <cgroups@vger.kernel.org>, 
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>, 
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, 
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>, "linux-mm@kvack.org" <linux-mm@kvack.org>, 
	David Rientjes <rientjes@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 306D8140026
X-Rspam-User: 
X-Stat-Signature: zkcwd58keuaf5rxpfat8uprt8cxos19h
X-Rspamd-Server: rspam03
X-HE-Tag: 1685035539-285348
X-HE-Meta: U2FsdGVkX1+01ru/AHwYM/gwUMGXk1YxNmJ1LkrLnzib9m/3U1l3l2hm8JIfoHZotrYeFi5fmgc4Iksv9/hv0htNqNYDJYA0Yt+PjQmz44IyaH8KzWeR5+L7wYO4mz+oaj+UE0ou0pnnaeg45TD4/LQK/+M0SYSVdtulttesx0h0QVbjSI+LtBLG3a+L2a+X/Avo1S+c/SKyMn953Bm/dXVojekOT4lSE1iqFM+cv1EiCyt1IlI8swbWG7EGljJsXRknIk7N90COdSCxf8pQW/UAi7L08/iV4yQBuPpHu0oBm/3/5OI8qsIBq2XoXWskvAi37RfXlLfkQMO/BYU1DBpueHop+9hiKdyTJvbqMGtvWcXpv0D/uPNYGZop7MuG170TgmxW+iQYO+HQIYAFS6/e7LVNAQT4wkWkV4uZJ9/8YFjdjXnEAKmI5ELPfxwJYV9ZfSAfbnyUQ/kLsaKV/casNjALkpjXbazsyMNplXrXXIYY88wLxPWblTmnkZe59hwuYcdTL3qjTUN1/WEA+2HrWtx6Y55Nl6Sf8yGRrpkIxv3w1B4z6WGdG+UPL7kwGd4QeC3cTiiFPjAyolHs2O5URSol6OSFEwQIpm5Fr2/CoTgeKhRhS4bSlSMIfM3g/rt3ezcND28vSHyXqxklTCjSUhLD0Yi/UsmWhHpxy5rJdBHKXWVHEsLvGnt7+rolzMltVdbfJ9otSvcoyv5kuvJKwDYhGOM+XfHGX12Y2paceKvbxB8yN8HmuL9mrhqc4CMW71BCnreIUOa+bI0rMxpwYB0ubhP2K2IqJj8AXeXntwtMCnphMjEI2RnVQOmhRkdxtAdrsQLSy+wnv5djTCc2PcTiiXrGH5FEMMpPV49hL2o41Fws8wKtAq6bR28s4rICxTPQ3fRDGRPWesM4/RF4iEqy7bHEfPHG0FHVsDjIyTs8/CyYScwF0TM1yAG/Fn2iGt34GOX2TaQqjj+
 0o3DhSPB
 uqoel4Hyg75BzXfKgwatxNNIyvetFt2r9dS3064Z54nW5riOuXK4qJ86u3TSPcxRID3fsgm+i//uHGAAYeRXH/Z8ded6g3ami/O6ZEmX6cusr9o+KM0PnIqq9DSHXyYw5QelyC+qZ/PhnelNijDvd/iZcqU1fanXHOUM6uk1SRMU/1P+Bn/9mITGFzw3TYKQKuulcqoHm2rCeYeL1LHh+VleVRfUBhZ8hBRsB/MY8waep6oFzdvq5v/c4PSQsZg9noDVKkGDbPGbdebWeCfJePQ8/sGkhFF+HQxCyhq8Og9LCHw2qjSFISeBHpAbkKfYspUAGTFQ60CzBvE8NEZrOJN2abovZcu5cvdywCdxLE1bpSPLR1QuRyJi7raEHA+Q0hgLcL+rCBsHghhJEavTF2LjnCpxCxoaAe8pgppFGKhOJnsltn2iOPv03CUaj6eufb/anMnwaGwvGfr9Y+O2P2VLm9g==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, May 25, 2023 at 1:19=E2=80=AFAM =E7=A8=8B=E5=9E=B2=E6=B6=9B Chengka=
itao Cheng
<chengkaitao@didiglobal.com> wrote:
>
> At 2023-05-24 06:02:55, "Yosry Ahmed" <yosryahmed@google.com> wrote:
> >On Sat, May 20, 2023 at 2:52=E2=80=AFAM =E7=A8=8B=E5=9E=B2=E6=B6=9B Chen=
gkaitao Cheng
> ><chengkaitao@didiglobal.com> wrote:
> >>
> >> At 2023-05-20 06:04:26, "Yosry Ahmed" <yosryahmed@google.com> wrote:
> >> >On Wed, May 17, 2023 at 10:12=E2=80=AFPM =E7=A8=8B=E5=9E=B2=E6=B6=9B =
Chengkaitao Cheng
> >> ><chengkaitao@didiglobal.com> wrote:
> >> >>
> >> >> At 2023-05-18 04:42:12, "Yosry Ahmed" <yosryahmed@google.com> wrote=
:
> >> >> >On Wed, May 17, 2023 at 3:01=E2=80=AFAM =E7=A8=8B=E5=9E=B2=E6=B6=
=9B Chengkaitao Cheng
> >> >> ><chengkaitao@didiglobal.com> wrote:
> >> >> >>
> >> >> >> At 2023-05-17 16:09:50, "Yosry Ahmed" <yosryahmed@google.com> wr=
ote:
> >> >> >> >On Wed, May 17, 2023 at 1:01=E2=80=AFAM =E7=A8=8B=E5=9E=B2=E6=
=B6=9B Chengkaitao Cheng
> >> >> >> ><chengkaitao@didiglobal.com> wrote:
> >> >> >> >>
> >> >> >>
> >> >> >> Killing processes in order of memory usage cannot effectively pr=
otect
> >> >> >> important processes. Killing processes in a user-defined priorit=
y order
> >> >> >> will result in a large number of OOM events and still not being =
able to
> >> >> >> release enough memory. I have been searching for a balance betwe=
en
> >> >> >> the two methods, so that their shortcomings are not too obvious.
> >> >> >> The biggest advantage of memcg is its tree topology, and I also =
hope
> >> >> >> to make good use of it.
> >> >> >
> >> >> >For us, killing processes in a user-defined priority order works w=
ell.
> >> >> >
> >> >> >It seems like to tune memory.oom.protect you use oom_kill_inherit =
to
> >> >> >observe how many times this memcg has been killed due to a limit i=
n an
> >> >> >ancestor. Wouldn't it be more straightforward to specify the prior=
ity
> >> >> >of protections among memcgs?
> >> >> >
> >> >> >For example, if you observe multiple memcgs being OOM killed due t=
o
> >> >> >hitting an ancestor limit, you will need to decide which of them t=
o
> >> >> >increase memory.oom.protect for more, based on their importance.
> >> >> >Otherwise, if you increase all of them, then there is no point if =
all
> >> >> >the memory is protected, right?
> >> >>
> >> >> If all memory in memcg is protected, its meaning is similar to that=
 of the
> >> >> highest priority memcg in your approach, which is ultimately killed=
 or
> >> >> never killed.
> >> >
> >> >Makes sense. I believe it gets a bit trickier when you want to
> >> >describe relative ordering between memcgs using memory.oom.protect.
> >>
> >> Actually, my original intention was not to use memory.oom.protect to
> >> achieve relative ordering between memcgs, it was just a feature that
> >> happened to be achievable. My initial idea was to protect a certain
> >> proportion of memory in memcg from being killed, and through the
> >> method, physical memory can be reasonably planned. Both the physical
> >> machine manager and container manager can add some unimportant
> >> loads beyond the oom.protect limit, greatly improving the oversold
> >> rate of memory. In the worst case scenario, the physical machine can
> >> always provide all the memory limited by memory.oom.protect for memcg.
> >>
> >> On the other hand, I also want to achieve relative ordering of interna=
l
> >> processes in memcg, not just a unified ordering of all memcgs on
> >> physical machines.
> >
> >For us, having a strict priority ordering-based selection is
> >essential. We have different tiers of jobs of different importance,
> >and a job of higher priority should not be killed before a lower
> >priority task if possible, no matter how much memory either of them is
> >using. Protecting memcgs solely based on their usage can be useful in
> >some scenarios, but not in a system where you have different tiers of
> >jobs running with strict priority ordering.
>
> If you want to run with strict priority ordering, it can also be achieved=
,
> but it may be quite troublesome. The directory structure shown below
> can achieve the goal.
>
>              root
>            /      \
>    cgroup A       cgroup B
> (protect=3Dmax)    (protect=3D0)
>                 /          \
>            cgroup C      cgroup D
>         (protect=3Dmax)   (protect=3D0)
>                        /          \
>                   cgroup E      cgroup F
>                (protect=3Dmax)   (protect=3D0)
>
> Oom kill order: F > E > C > A

This requires restructuring the cgroup hierarchy which comes with a
lot of other factors, I don't think that's practically an option.

>
> As mentioned earlier, "running with strict priority ordering" may be
> some extreme issues, that requires the manager to make a choice.

We have been using strict priority ordering in our fleet for many
years now and we depend on it. Some jobs are simply more important
than others, regardless of their usage.

>
> >>
> >> >> >In this case, wouldn't it be easier to just tell the OOM killer th=
e
> >> >> >relative priority among the memcgs?
> >> >> >
> >> >> >>
> >> >> >> >If this approach works for you (or any other audience), that's =
great,
> >> >> >> >I can share more details and perhaps we can reach something tha=
t we
> >> >> >> >can both use :)
> >> >> >>
> >> >> >> If you have a good idea, please share more details or show some =
code.
> >> >> >> I would greatly appreciate it
> >> >> >
> >> >> >The code we have needs to be rebased onto a different version and
> >> >> >cleaned up before it can be shared, but essentially it is as
> >> >> >described.
> >> >> >
> >> >> >(a) All processes and memcgs start with a default score.
> >> >> >(b) Userspace can specify scores for memcgs and processes. A highe=
r
> >> >> >score means higher priority (aka less score gets killed first).
> >> >> >(c) The OOM killer essentially looks for the memcg with the lowest
> >> >> >scores to kill, then among this memcg, it looks for the process wi=
th
> >> >> >the lowest score. Ties are broken based on usage, so essentially i=
f
> >> >> >all processes/memcgs have the default score, we fallback to the
> >> >> >current OOM behavior.
> >> >>
> >> >> If memory oversold is severe, all processes of the lowest priority
> >> >> memcg may be killed before selecting other memcg processes.
> >> >> If there are 1000 processes with almost zero memory usage in
> >> >> the lowest priority memcg, 1000 invalid kill events may occur.
> >> >> To avoid this situation, even for the lowest priority memcg,
> >> >> I will leave him a very small oom.protect quota.
> >> >
> >> >I checked internally, and this is indeed something that we see from
> >> >time to time. We try to avoid that with userspace OOM killing, but
> >> >it's not 100% effective.
> >> >
> >> >>
> >> >> If faced with two memcgs with the same total memory usage and
> >> >> priority, memcg A has more processes but less memory usage per
> >> >> single process, and memcg B has fewer processes but more
> >> >> memory usage per single process, then when OOM occurs, the
> >> >> processes in memcg B may continue to be killed until all processes
> >> >> in memcg B are killed, which is unfair to memcg B because memcg A
> >> >> also occupies a large amount of memory.
> >> >
> >> >I believe in this case we will kill one process in memcg B, then the
> >> >usage of memcg A will become higher, so we will pick a process from
> >> >memcg A next.
> >>
> >> If there is only one process in memcg A and its memory usage is higher
> >> than any other process in memcg B, but the total memory usage of
> >> memcg A is lower than that of memcg B. In this case, if the OOM-killer
> >> still chooses the process in memcg A. it may be unfair to memcg A.
> >>
> >> >> Dose your approach have these issues? Killing processes in a
> >> >> user-defined priority is indeed easier and can work well in most ca=
ses,
> >> >> but I have been trying to solve the cases that it cannot cover.
> >> >
> >> >The first issue is relatable with our approach. Let me dig more info
> >> >from our internal teams and get back to you with more details.
>
> --
> Thanks for your comment!
> chengkaitao
>
>