From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=aM0h=AT=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,
	URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9837CC433E0
	for <linux-mm@archiver.kernel.org>; Wed,  8 Jul 2020 15:12:22 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 49B182064C
	for <linux-mm@archiver.kernel.org>; Wed,  8 Jul 2020 15:12:22 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="S+wjR6dJ"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 49B182064C
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id AA1968D001B; Wed,  8 Jul 2020 11:12:21 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A53238D000B; Wed,  8 Jul 2020 11:12:21 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 941138D001B; Wed,  8 Jul 2020 11:12:21 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0044.hostedemail.com [216.40.44.44])
	by kanga.kvack.org (Postfix) with ESMTP id 807A68D000B
	for <linux-mm@kvack.org>; Wed,  8 Jul 2020 11:12:21 -0400 (EDT)
Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 2BF022DFD
	for <linux-mm@kvack.org>; Wed,  8 Jul 2020 15:12:21 +0000 (UTC)
X-FDA: 77015249682.19.eye74_52002c426ebe
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin19.hostedemail.com (Postfix) with ESMTP id C371B1ACEA2
	for <linux-mm@kvack.org>; Wed,  8 Jul 2020 15:12:20 +0000 (UTC)
X-HE-Tag: eye74_52002c426ebe
X-Filterd-Recvd-Size: 10530
Received: from mail-io1-f66.google.com (mail-io1-f66.google.com [209.85.166.66])
	by imf28.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed,  8 Jul 2020 15:12:20 +0000 (UTC)
Received: by mail-io1-f66.google.com with SMTP id d18so5267186ion.0
        for <linux-mm@kvack.org>; Wed, 08 Jul 2020 08:12:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc:content-transfer-encoding;
        bh=yiaCJNqVyxaQ5P5uTvGI4LNUs89H2LeYil/ZJ55axhA=;
        b=S+wjR6dJCMM4+IdOEjxfptoH37C8ms/vR15tCOn/vVJ036SHzRKu+izv3aokN6XJI8
         r2DLNXGESrfKLFoMMS0TlgHHtUPwjJErRM+/qimmmnIB/lQHYc50wTnQQxai5/SWk6m1
         lstUEnTD3EKp4LilL33y0JW1amBgmF469q8fnXbUfbo2BcBzI2ovVCVdUO0hd2SGG14x
         09ErkK7RDsQzZmjs0R5Eq+glcrJPR2z/f6VUrADsGXZ5tbwT2sYCpqKMCFjRni6G1wfm
         t3MXs3kQZR60oyL2KajQWYZoKr13MkgEap9xB4ZjKfqQ5gUYkMeJgsZR7kq96Vku+V0F
         TkMQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc:content-transfer-encoding;
        bh=yiaCJNqVyxaQ5P5uTvGI4LNUs89H2LeYil/ZJ55axhA=;
        b=CnRzEC6O02lVq5GaKp0JJxqIE8Lt65J5X+BQDdo74Yop9TQ2Dd54xPPeqMiaQ0Veac
         aKc5MIcYDVy7WdtkaNoMRogWqonOpzTXG74yp7Ohapvb1K2amoANC9amD5P9g5FQxIGy
         QkFhvJ9KATJyBkvdfvljDseCrqW75oee7/Bdc0AkN7XrkHtR8z20I/zmcgIsezeQRLnr
         xZTRgrZXzLPiKg0bO4KofsZYCVXC1mOTFVDBdo0GgmQ9SUNGBg7p+gSJafGNsc+TilVy
         uzBPhU4xMW3eLXsqtolWqOStlPWhxB5GQqzPTw/tIZKzS5GBps8GXhZkJgbvaAyX1JOA
         mfuw==
X-Gm-Message-State: AOAM530oCvgTwZR5EsKZ91I1wHF4+LBLJj7rv7a6F7YNOvmYeY2bkYMQ
	BufG6AJZlvn/TujjHwrctEe2OT5qZZOTXQQA4kE=
X-Google-Smtp-Source: ABdhPJwUmemvIEnhSd9Mw21ekhbnHdMjku/6XyEtLh79cbOU7HFl7XhpKeS4cyNQ954UerFRXmZPQFwGknP6MOxpQB4=
X-Received: by 2002:a05:6638:2508:: with SMTP id v8mr62050552jat.94.1594221139674;
 Wed, 08 Jul 2020 08:12:19 -0700 (PDT)
MIME-Version: 1.0
References: <1594214649-9837-1-git-send-email-laoar.shao@gmail.com> <20200708142806.GJ7271@dhcp22.suse.cz>
In-Reply-To: <20200708142806.GJ7271@dhcp22.suse.cz>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Wed, 8 Jul 2020 23:11:43 +0800
Message-ID: <CALOAHbC+O55ELMHLxk4=_sw0cJRxcWQ-_NEwV+tLQZ2rx0VJdQ@mail.gmail.com>
Subject: Re: [PATCH] mm, oom: make the calculation of oom badness more accurate
To: Michal Hocko <mhocko@kernel.org>, David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Linux MM <linux-mm@kvack.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: C371B1ACEA2
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam04
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Jul 8, 2020 at 10:28 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 08-07-20 09:24:09, Yafang Shao wrote:
> > Recently we found an issue on our production environment that when memc=
g
> > oom is triggered the oom killer doesn't chose the process with largest
> > resident memory but chose the first scanned process. Note that all
> > processes in this memcg have the same oom_score_adj, so the oom killer
> > should chose the process with largest resident memory.
> >
> > Bellow is part of the oom info, which is enough to analyze this issue.
> > [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52=
843037
> > [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988k=
B, failcnt 0
> > [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcn=
t 0
> > [...]
> > [7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes s=
wapents oom_score_adj name
> > [7516987.983510] [ 5740]     0  5740      257        1    32768        =
0          -998 pause
> > [7516987.983574] [58804]     0 58804     4594      771    81920        =
0          -998 entry_point.bas
> > [7516987.983577] [58908]     0 58908     7089      689    98304        =
0          -998 cron
> > [7516987.983580] [58910]     0 58910    16235     5576   163840        =
0          -998 supervisord
> > [7516987.983590] [59620]     0 59620    18074     1395   188416        =
0          -998 sshd
> > [7516987.983594] [59622]     0 59622    18680     6679   188416        =
0          -998 python
> > [7516987.983598] [59624]     0 59624  1859266     5161   548864        =
0          -998 odin-agent
> > [7516987.983600] [59625]     0 59625   707223     9248   983040        =
0          -998 filebeat
> > [7516987.983604] [59627]     0 59627   416433    64239   774144        =
0          -998 odin-log-agent
> > [7516987.983607] [59631]     0 59631   180671    15012   385024        =
0          -998 python3
> > [7516987.983612] [61396]     0 61396   791287     3189   352256        =
0          -998 client
> > [7516987.983615] [61641]     0 61641  1844642    29089   946176        =
0          -998 client
> > [7516987.983765] [ 9236]     0  9236     2642      467    53248        =
0          -998 php_scanner
> > [7516987.983911] [42898]     0 42898    15543      838   167936        =
0          -998 su
> > [7516987.983915] [42900]  1000 42900     3673      867    77824        =
0          -998 exec_script_vr2
> > [7516987.983918] [42925]  1000 42925    36475    19033   335872        =
0          -998 python
> > [7516987.983921] [57146]  1000 57146     3673      848    73728        =
0          -998 exec_script_J2p
> > [7516987.983925] [57195]  1000 57195   186359    22958   491520        =
0          -998 python2
> > [7516987.983928] [58376]  1000 58376   275764    14402   290816        =
0          -998 rosmaster
> > [7516987.983931] [58395]  1000 58395   155166     4449   245760        =
0          -998 rosout
> > [7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        =
0          -998 data_sim
> > [7516987.984221] oom-kill:constraint=3DCONSTRAINT_MEMCG,nodemask=3D(nul=
l),cpuset=3D3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d75=
3,mems_allowed=3D0-1,oom_memcg=3D/kubepods/podf1c273d3-9b36-11ea-b3df-246e9=
693c184,task_memcg=3D/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f24=
6a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=3Dpause,p=
id=3D5740,uid=3D0
> > [7516987.984254] Memory cgroup out of memory: Killed process 5740 (paus=
e) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> > [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:=
0kB, file-rss:0kB, shmem-rss:0kB
> >
> > We can find that the first scanned process 5740 (pause) was killed, but=
 its
> > rss is only one page. That is because, when we calculate the oom badnes=
s in
> > oom_badness(), we always ignore the negtive point and convert all of th=
ese
> > negtive points to 1. Now as oom_score_adj of all the processes in this
> > targeted memcg have the same value -998, the points of these processes =
are
> > all negtive value. As a result, the first scanned process will be kille=
d.
>
> Such a large bias can skew results quite considerably.
>

Right.
Pls. refer the kubernetes doc[1] for more information about this large bias=
 .

[1]. https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/

> > The oom_socre_adj (-998) in this memcg is set by kubelet, because it is=
 a
> > a Guaranteed pod, which has higher priority to prevent from being kille=
d by
> > system oom.
>
> This is really interesting! I assume that the oom_score_adj is set to
> protect from the global oom situation right?

Right. See also the kubernetes doc.

> I am struggling to
> understand what is the expected behavior when the oom is internal for
> such a group though. Does killing a single task from such a group is a
> sensible choice? I am not really familiar with kubelet but can it cope
> with data_sim going away from under it while the rest would still run?
> Wouldn't it make more sense to simply tear down the whole thing?
>

There are two containers in one kubernetes pod, one of which is a
pause-container, which has only one process - the pause, which is
managing the netns, and the other is the docker-init-container, in
which all other processes are running.
Once the pause process is killed, the kubelet will rebuild all the
containers in this pod, while if one of the processes in the
docker-init-container is killed, the kubelet will try to re-run it.
So tearing down the whole thing is more costly than only trying to
re-running one process.
I'm not familiar with kubernetes as well, that is my understanding.

> But that is a separate thing.

Right.

>
> > To fix this issue, we should make the calculation of oom point more
> > accurate. We can achieve it by convert the chosen_point from 'unsigned
> > long' to 'long'.
>
> oom_score has a very coarse units because it maps all the consumed
> memory into 0 - 1000 scale so effectively per-mille of the usable
> memory. oom_score_adj acts on top of that as a bias. This is
> exported to the userspace and I do not think we can change that (see
> Documentation/filesystems/proc.rst) unfortunately.

In this doc, I only find the oom_score and oom_score_adj is exposed to
the userspace.
While this patch only changes the oom_control->chosen_points, which is
only for oom internally use.
So I don't think we can't change oom_control->chosen_points.


> So you patch cannot
> be really accepted as is because it would start reporting values outside
> of the allowed range unless I am doing some math incorrectly.
>

See above, my patch will not break the userspace at all.

> On the other hand, in this particular case I believe the existing
> calculation is just wrong. Usable memory is 16777216kB (4194304 pages),
> the top consumer is 3976380 pages so 94.8% the lowest memory consumer is
> effectively 0%. Even if we discount 94.8% by 99.8% then we should be
> still having something like 7950 pages. So the normalization oom_badness
> does cuts results too aggressively. There was quite some churn in the
> calculation in the past fixing weird rounding bugs so I have to think
> about how to fix this properly some more.
>
> That being said, even though the configuration is weird I do agree that
> oom_badness scaling is really unexpected and the memory consumption
> in this particular example should be quite telling about who to chose as
> an oom victim.
> --
> Michal Hocko
> SUSE Labs


--=20
Thanks
Yafang