From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4EE69C77B7D
	for <linux-mm@archiver.kernel.org>; Wed, 17 May 2023 06:59:48 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B37CF900004; Wed, 17 May 2023 02:59:47 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AE8F0900003; Wed, 17 May 2023 02:59:47 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 98903900004; Wed, 17 May 2023 02:59:47 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 872B2900003
	for <linux-mm@kvack.org>; Wed, 17 May 2023 02:59:47 -0400 (EDT)
Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 50C618047C
	for <linux-mm@kvack.org>; Wed, 17 May 2023 06:59:47 +0000 (UTC)
X-FDA: 80798846814.23.4CA326F
Received: from mail-ed1-f49.google.com (mail-ed1-f49.google.com [209.85.208.49])
	by imf27.hostedemail.com (Postfix) with ESMTP id 04FDF4000C
	for <linux-mm@kvack.org>; Wed, 17 May 2023 06:59:44 +0000 (UTC)
Authentication-Results: imf27.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=2SsYOgEz;
	spf=pass (imf27.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1684306785;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=JGx4zd5rQMRATd0xK+XVo93efYW/fcp7pqkP6+0hcGw=;
	b=2w/45D1G2Vnh+bUsXxpMNkIC1xHzFnOzHAd4Q3S2W2/lK62jlwmpTtxTFsWu99esLYQVRU
	jwBbuI5rwfZo9GN0LSBbWKNeCHV+wBE5oGGr+rJBqSCpzyfUFPq9FIYVImudJl1gpsLApC
	SCd4NP4FgGajnAaHeEhsZ4e+29Vqqz0=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684306785; a=rsa-sha256;
	cv=none;
	b=NrA78iJhqAsdswjsHp2dLS0pEnfbo71wExxSIvhRZYHbRYD9VDjm7pjy6ApAy5oSP0UlpY
	5OgSM/WT6TXgoYWZWO/1cs71dHRrd982rFQMSDGBZHh+52RmoYMklu7Ie/yNn3M62Zu4g6
	s6QB14BK95zwNNBWujc0ub76j2bxh2c=
ARC-Authentication-Results: i=1;
	imf27.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=2SsYOgEz;
	spf=pass (imf27.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
Received: by mail-ed1-f49.google.com with SMTP id 4fb4d7f45d1cf-510d1972d5aso336715a12.0
        for <linux-mm@kvack.org>; Tue, 16 May 2023 23:59:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1684306783; x=1686898783;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=JGx4zd5rQMRATd0xK+XVo93efYW/fcp7pqkP6+0hcGw=;
        b=2SsYOgEzYAdEdsVcJc0zc5v4rnNcYCybZoYkrZDREZDQeufKegzADVVbvov9FQxl8V
         iDD4DvGc/ZITjzcxaTE9ewa/3cg1IEIjeNT1326lWrXtC2hh4n9CzpsKuchekRFd+JNv
         2piZvnhgCVyBz69rIj4CrJCQfzaMasFoa83TcXdiR1ZQGZUGoA72A9LYL6Flc0dvdkjJ
         xwAFBTZkIFCS2dScadPqUF5OxDGq7THeIR6AakdtnhQgUDH/ZqKMSljmUO2ZBCdGUI82
         FjIzhHy8ygkxubJBz7+MJAvxywznneQ865aBvD0P685cHKmQGof5llmvwhD4YzHdhji8
         4mrg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1684306783; x=1686898783;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=JGx4zd5rQMRATd0xK+XVo93efYW/fcp7pqkP6+0hcGw=;
        b=GVlQM5oC+USv7eQEve0K6RZAHEBeKw8oEu+dgqyRh0APpWLmllLvuuw/m1HZsokNtd
         xHmWtHbe1iRGprCv/u47Yutt/q0S1huXiyqjmi3/HsFp9G6fElrp1J9VvKJqPQQMsfvN
         6q/purynO5ddNe6qr+MLWQ4W3s0c1Jea77gLVOBpeefkwDDphLyofbj2R5ihaYeOCh4z
         SVtHNWVcAesHe1MPvz40y5HaI0dsv+dzs+W5xQjCzc/BfeZmflJH9GY2gXaNP7yMqOou
         nuP8APl92QNS6dpo6Tv1UeOLOVOXWkkxdP5+ArTgJdY34+EyGA9j9Fx1VcayK+qmXSnf
         iuew==
X-Gm-Message-State: AC+VfDx43WKCkt2GwbrJ4nRa24XXCo6sNr63AtmSgXgP3maKkHb9uV5P
	qWKzQqsyhgYnCpefPChH+CQALT56hO/nieezhdRq1A==
X-Google-Smtp-Source: ACHHUZ5w3ZCwAGTk8d9kz/D0S1+V+qD99DQvz3oDFkqZStQ3/CD/pIKqlreUsv8Mn9/3IyobCC3akVuN86b2DFFk7AE=
X-Received: by 2002:a17:907:96a1:b0:966:a691:678d with SMTP id
 hd33-20020a17090796a100b00966a691678dmr34096524ejc.51.1684306783178; Tue, 16
 May 2023 23:59:43 -0700 (PDT)
MIME-Version: 1.0
References: <20230517032032.76334-1-chengkaitao@didiglobal.com>
In-Reply-To: <20230517032032.76334-1-chengkaitao@didiglobal.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Tue, 16 May 2023 23:59:06 -0700
Message-ID: <CAJD7tkYPGwAFo0mrhq5twsVquwFwkhOyPwsZJtECw-5HAXtQrg@mail.gmail.com>
Subject: Re: [PATCH v4 0/2] memcontrol: support cgroup level OOM protection
To: chengkaitao <chengkaitao@didiglobal.com>
Cc: tj@kernel.org, lizefan.x@bytedance.com, hannes@cmpxchg.org, corbet@lwn.net, 
	mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, 
	akpm@linux-foundation.org, brauner@kernel.org, muchun.song@linux.dev, 
	viro@zeniv.linux.org.uk, zhengqi.arch@bytedance.com, ebiederm@xmission.com, 
	Liam.Howlett@oracle.com, chengzhihao1@huawei.com, pilgrimtao@gmail.com, 
	haolee.swjtu@gmail.com, yuzhao@google.com, willy@infradead.org, 
	vasily.averin@linux.dev, vbabka@suse.cz, surenb@google.com, 
	sfr@canb.auug.org.au, mcgrof@kernel.org, sujiaxun@uniontech.com, 
	feng.tang@intel.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, 
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, 
	linux-mm@kvack.org, David Rientjes <rientjes@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 04FDF4000C
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Stat-Signature: 6x57do4czw6f11xz8ziaxmhinonofkuz
X-HE-Tag: 1684306784-95477
X-HE-Meta: U2FsdGVkX18zSVzOkmH9+Lc1vIa6rjzFdRYHGoy/tbR+gpsbSJy6neA+pP+JafpG+nezRPJi4pi50V7rqMop+lw/wkxYlaNhLT0FO/juwcVKKlnzTJ+rMEUGgdAD++8m6YEvqtZFu7HxDPuEvCe8MUEtFKYm2esVQgA01TcgfTVj+55CzDE41GTdruCtqcSh5tVk0i4VI0ADMUmPkHCotkBpbxkYm4TwBW1sYSgJ8eu/xZQjVHwcEUfVFjS7ZIYfmKyJ+QdOPBc4BhGFETdO7E6au6OvpCIRve5Vn4J+TQy8xL9rS1HsE7A/PvbUMkPSnyx8NX9plVeVcV+3y8JNLlsTb3O4pgXaJFdYrJpN//Zz9pU7Vn0ExEW8LVlrVl5etskCCjxnxq03PpvIhKNALS6ouucQ888ILjkEyVRzC1Dibba6iJBxdE/0umRc7ieEk6khlMmmJ/hBOcxYarMa2L7kFTaIRMFa0ZClhGy2pcc+5nN+phxsc72rJNbpOKuxQtNpj/UmlkzEJMNchFeFPW46TLix7ly/BY/AXJiSGF0eCHqcGhvZvjhXJHP0BLyo8iGzEqbVWuJRsNM/rHOcDhTYVNQoONAAW+yzJk8Z2ZHUTBKFYWyANGbyjrQMNJNuka5twDO6X96268gpmVhbSk7vUJKVvs/lDPe5OXgUUHDM2aRWWXbGtpcZ2G2gHJXEw66ix3S5G8CetRYVN/pyTNxwXE+pAoixiPYasM1LXZ401cdu+wFUAWMUxeq27oNtY8MtRpIE8cE/DUAuiKm1G/hTL8ToNsjWESG3l0F1VrDz8OCbYuVmuBXMBdBQEkbyWkFsQsXOQAwsxauOl6pYtkxY2nVZo6UblDJAU4mn+F6DjcOCjSYdQHRlHUpzkjtArc9d2PjLtKKjdj/ZJ1hkpCvn1L++DMwI+G95/+UKbGYRISlV00JxqmYiWsETSF8OQX1A89G9xyfJDyjH+4X
 XI9W3dr2
 vxIAFJLwxucxYk4XgxyeWhEBi2FYvbMcaegwhY0sKfmhexiGyJ1a/y8j8O/2z7/B7TrEXHvKXaY6dFPcBN2liZ1U8ciLqLGh9AXg2btFF4JsTbxsSQXGxytQ7XEEthZSp5aPWQjS18MDDt+jy4aY09zRY3z3Z9yhp9e/H4nLOJSmZ/u1awf4pYKU8l+jJtcGp1cqZg4UKNRRPPlo4BMmhvseQ4am/5kIAIPdu43En1XvhDwF8wHIjK6Rl8ulqSywXi80Qv2BJhMgiUHKulzFTPUC52oMEM/kXw0t9o+Q78EU8sh/ImBuUkCu5REIfvkGEcfQ9bd2LqINk/9d2ostYtutVnSHNosFACFjbMNRmy8rIlmhW1jw8qLI25zX3sjEtvQyOTZj5gSmtYjMZU9j7GJRCyiD9ufZXvhjOxK92fI3mN49BUqmraKSQsQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

+David Rientjes

On Tue, May 16, 2023 at 8:20=E2=80=AFPM chengkaitao <chengkaitao@didiglobal=
.com> wrote:
>
> Establish a new OOM score algorithm, supports the cgroup level OOM
> protection mechanism. When an global/memcg oom event occurs, we treat
> all processes in the cgroup as a whole, and OOM killers need to select
> the process to kill based on the protection quota of the cgroup.
>
> Here is a more detailed comparison and introduction of the old
> oom_score_adj mechanism and the new oom_protect mechanism,
>
> 1. The regulating granularity of oom_protect is smaller than that of
>    oom_score_adj. On a 512G physical machine, the minimum granularity
>    adjusted by oom_score_adj is 512M, and the minimum granularity
>    adjusted by oom_protect is one page (4K)
> 2. It may be simple to create a lightweight parent process and uniformly
>    set the oom_score_adj of some important processes, but it is not a
>    simple matter to make multi-level settings for tens of thousands of
>    processes on the physical machine through the lightweight parent
>    processes. We may need a huge table to record the value of oom_score_a=
dj
>    maintained by all lightweight parent processes, and the user process
>    limited by the parent process has no ability to change its own
>    oom_score_adj, because it does not know the details of the huge
>    table. on the other hand, we have to set the common parent process'
>    oom_score_adj, before it forks all children processes. We must strictl=
y
>    follow this setting sequence, and once oom_score_adj is set, it cannot
>    be changed. To sum up, it is very difficult to apply oom_score_adj in
>    other situations. The new patch adopts the cgroup mechanism. It does n=
ot
>    need any parent process to manage oom_score_adj. the settings between
>    each memcg are independent of each other, making it easier to plan the
>    OOM order of all processes. Due to the unique nature of memory
>    resources, current Service cloud vendors are not oversold in memory
>    planning. I would like to use the new patch to try to achieve the
>    possibility of oversold memory resources.
> 3. I conducted a test and deployed an excessive number of containers on
>    a physical machine, By setting the oom_score_adj value of all processe=
s
>    in the container to a positive number through dockerinit, even process=
es
>    that occupy very little memory in the container are easily killed,
>    resulting in a large number of invalid kill behaviors. If dockerinit i=
s
>    also killed unfortunately, it will trigger container self-healing, and
>    the container will rebuild, resulting in more severe memory
>    oscillations. The new patch abandons the behavior of adding an equal
>    amount of oom_score_adj to each process in the container and adopts a
>    shared oom_protect quota for all processes in the container. If a
>    process in the container is killed, the remaining other processes will
>    receive more oom_protect quota, making it more difficult for the
>    remaining processes to be killed. In my test case, the new patch reduc=
ed
>    the number of invalid kill behaviors by 70%.
> 4. oom_score_adj is a global configuration that cannot achieve a kill
>    order that only affects a certain memcg-oom-killer. However, the
>    oom_protect mechanism inherits downwards (If the oom_protect quota of
>    the parent cgroup is less than the sum of sub-cgroups oom_protect quot=
a,
>    the oom_protect quota of each sub-cgroup will be proportionally reduce=
d.
>    If the oom_protect quota of the parent cgroup is greater than the sum =
of
>    sub-cgroups oom_protect quota, the oom_protect quota of each sub-cgrou=
p
>    will be proportionally increased). The purpose of doing so is that use=
rs
>    can set oom_protect quota according to their own needs, and the system
>    management process can set appropriate oom_protect quota on the parent
>    memcg as the final cover. If the oom_protect of the parent cgroup is 0=
,
>    the kill order of memcg-oom or global-ooms will not be affected by use=
r
>    specific settings.
> 5. Per-process accounting does not count shared memory, similar to
>    active page cache, which also increases the probability of OOM-kill.
>    However, the memcg accounting may be more reasonable, as its memory
>    statistics are more comprehensive. In the new patch, all the shared
>    memory will also consume the oom_protect quota of the memcg, and the
>    process's oom_protect quota of the memcg will decrease, the probabilit=
y
>    of they being killed will increase.
> 6. In the final discussion of patch v2, we discussed that although the
>    adjustment range of oom_score_adj is [-1000,1000], but essentially it
>    only allows two usecases(OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliabl=
y.
>    Everything in between is clumsy at best. In order to solve this proble=
m
>    in the new patch, I introduced a new indicator oom_kill_inherit, which
>    counts the number of times the local and child cgroups have been
>    selected by the OOM killer of the ancestor cgroup. oom_kill_inherit
>    maintains a negative correlation with memory.oom.protect, so we have a
>    ruler to measure the optimal value of memory.oom.protect. By observing
>    the proportion of oom_kill_inherit in the parent cgroup, I can
>    effectively adjust the value of oom_protect to achieve the best.
>
> Changelog:
> v4:
>   * Fix warning: overflow in expression. (patch 1)
>   * Supplementary commit information. (patch 0)
> v3:
>   * Add "auto" option for memory.oom.protect. (patch 1)
>   * Fix division errors. (patch 1)
>   * Add observation indicator oom_kill_inherit. (patch 2)
>   https://lore.kernel.org/linux-mm/20230506114948.6862-1-chengkaitao@didi=
global.com/
> v2:
>   * Modify the formula of the process request memcg protection quota.
>   https://lore.kernel.org/linux-mm/20221208034644.3077-1-chengkaitao@didi=
global.com/
> v1:
>   https://lore.kernel.org/linux-mm/20221130070158.44221-1-chengkaitao@did=
iglobal.com/
>
> chengkaitao (2):
>   mm: memcontrol: protect the memory in cgroup from being oom killed
>   memcg: add oom_kill_inherit event indicator
>
>  Documentation/admin-guide/cgroup-v2.rst |  29 ++++-
>  fs/proc/base.c                          |  17 ++-
>  include/linux/memcontrol.h              |  46 +++++++-
>  include/linux/oom.h                     |   3 +-
>  include/linux/page_counter.h            |   6 +
>  mm/memcontrol.c                         | 199 ++++++++++++++++++++++++++=
++++++
>  mm/oom_kill.c                           |  25 ++--
>  mm/page_counter.c                       |  30 +++++
>  8 files changed, 334 insertions(+), 21 deletions(-)
>
> --
> 2.14.1
>
>

Perhaps this is only slightly relevant, but at Google we do have a
different per-memcg approach to protect from OOM kills, or more
specifically tell the kernel how we would like the OOM killer to
behave.

We define an interface called memory.oom_score_badness, and we also
allow it to be specified per-process through a procfs interface,
similar to oom_score_adj.

These scores essentially tell the OOM killer the order in which we
prefer memcgs to be OOM'd, and the order in which we want processes in
the memcg to be OOM'd. By default, all processes and memcgs start with
the same score. Ties are broken based on the rss of the process or the
usage of the memcg (prefer to kill the process/memcg that will free
more memory) -- similar to the current OOM killer.

This has been brought up before in other discussions without much
interest [1], but just thought it may be relevant here.

[1]https://lore.kernel.org/lkml/CAHS8izN3ej1mqUpnNQ8c-1Bx5EeO7q5NOkh0qrY_4P=
Lqc8rkHA@mail.gmail.com/#t