From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D968C433DF for ; Thu, 9 Jul 2020 01:58:18 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id F372E206F6 for ; Thu, 9 Jul 2020 01:58:17 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="aX91Nj09" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F372E206F6 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A4D386B0006; Wed, 8 Jul 2020 21:58:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9FC7C6B0007; Wed, 8 Jul 2020 21:58:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8C3FE6B0008; Wed, 8 Jul 2020 21:58:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 774AE6B0006 for ; Wed, 8 Jul 2020 21:58:17 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 1E360180AD804 for ; Thu, 9 Jul 2020 01:58:17 +0000 (UTC) X-FDA: 77016877434.14.robin96_2c0f3d926ec1 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin14.hostedemail.com (Postfix) with ESMTP id EB62F18229837 for ; Thu, 9 Jul 2020 01:58:16 +0000 (UTC) X-HE-Tag: robin96_2c0f3d926ec1 X-Filterd-Recvd-Size: 11297 Received: from mail-il1-f195.google.com (mail-il1-f195.google.com [209.85.166.195]) by imf48.hostedemail.com (Postfix) with ESMTP for ; Thu, 9 Jul 2020 01:58:16 +0000 (UTC) Received: by mail-il1-f195.google.com with SMTP id i18so682041ilk.10 for ; Wed, 08 Jul 2020 18:58:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=+7VDIEXcem1AzIXD8j4NeF6vEWKLK5G4IKHG9o5S3kw=; b=aX91Nj09Juorl1l7ETqrpRu0v3swY4Qe9X2AP1qiNet1K3Y+i06GVSPJlPw8pHrXev UjChJAEeEpjPxNO2BWa06tBpYZahkBCHg6PqGELhRhxLdIKrQVKJsLdFQLnJu+PGxWWC Svx32L+Igh9sXtUjrBg/GSEdC7bh63JqorQdTYX3fpnmixL0uxtwYhPmYUKWqahnkV07 QLVXf/BtWEtMKnjt4t1naWtUm/Keg9Sq70HZCSBZQIBS9ceV7uaHIhvPpVG+2OArIhlu V08nUSAfWnC+Dm4aq8yPG8Q2sOZjrZ45u6hNpxhtZb1SRtciptGtaOrxp/GecmHvSuxO Hz7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=+7VDIEXcem1AzIXD8j4NeF6vEWKLK5G4IKHG9o5S3kw=; b=pVslLM4SOLMotmFqS4hZ/dQ7im5k+wP77bM6AFIk3N+F2n5nwnfcSR0Enl9LP2YZE4 edRLRY+NsjFjMaImovD2t/HwhWBtow4b9wLZdqlL8QML77TNyQVXjfiVOCogkVnowX4l un+R8/WoKKtSW38XAh/jHGsO8hnHCcr/0tAJefgEp55JfjA88jNOp3rqRgl/nND581tT 8DmxBgV1rCQyN4jcIC2hkbi36biqu4AqBz/+NoGcqwmDRVjvlgjGNCGyTn9KpUV/pc0H yjt92oLkNsEzKXNjWs8t2A9OW1PNR8RRCrGs6g1nwvWdY7G3D26MIOPiVDnSfY9LLkcv 8aww== X-Gm-Message-State: AOAM533zsebGCqzQ0r56+T8iSEm1rtckUiNXm5EpD9Zm4Q1dRxT4jgqc GZI9jnWPr2C5D3TQcwls5UhJ1UMlmophyh7L/yhLRyRgjS8= X-Google-Smtp-Source: ABdhPJwt7asbTP4M+CAdumZd4iz8X2L9h62pAw4p6ntRaiwNUoqGbzy6HvT4HdrBy0rSftUTj/iq9KAzi6MhkscqbuQ= X-Received: by 2002:a92:da4c:: with SMTP id p12mr15899171ilq.142.1594259895854; Wed, 08 Jul 2020 18:58:15 -0700 (PDT) MIME-Version: 1.0 References: <1594214649-9837-1-git-send-email-laoar.shao@gmail.com> <20200708142806.GJ7271@dhcp22.suse.cz> <20200708143211.GK7271@dhcp22.suse.cz> In-Reply-To: From: Yafang Shao Date: Thu, 9 Jul 2020 09:57:39 +0800 Message-ID: Subject: Re: [PATCH] mm, oom: make the calculation of oom badness more accurate To: David Rientjes Cc: Michal Hocko , Andrew Morton , Linux MM Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: EB62F18229837 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Jul 9, 2020 at 1:57 AM David Rientjes wrote: > > On Wed, 8 Jul 2020, Michal Hocko wrote: > > > I have only now realized that David is not on Cc. Add him here. The > > patch is http://lkml.kernel.org/r/1594214649-9837-1-git-send-email-laoa= r.shao@gmail.com. > > > > I believe the main problem is that we are normalizing to oom_score_adj > > units rather than usage/total. I have a very vague recollection this ha= s > > been done in the past but I didn't get to dig into details yet. > > > > The memcg max is 4194304 pages, and an oom_score_adj of -998 would yield = a > page adjustment of: > > adj =3D -998 * 4194304 / 1000 =3D =E2=88=924185915 pages > > The largest pid 58406 (data_sim) has rss 3967322 pages, > pgtables 37101568 / 4096 =3D 9058 pages, and swapents 0. So it's unadjus= ted > badness is > > 3967322 + 9058 pages =3D 3976380 pages > > Factoring in oom_score_adj, all of these processes will have a badness of > 1 because oom_badness() doesn't underflow, which I think is the point of > Yafang's proposal. > Right. Thanks for helping clarify it. > I think the patch can work but, as you mention, also needs an update to > proc_oom_score(). proc_oom_score() is using the global amount of memory > so Yafang is likely not seeing it go negative for that reason but it coul= d > happen. > I missed proc_oom_score(). I will think about how to correct it. > > On Wed 08-07-20 16:28:08, Michal Hocko wrote: > > > On Wed 08-07-20 09:24:09, Yafang Shao wrote: > > > > Recently we found an issue on our production environment that when = memcg > > > > oom is triggered the oom killer doesn't chose the process with larg= est > > > > resident memory but chose the first scanned process. Note that all > > > > processes in this memcg have the same oom_score_adj, so the oom kil= ler > > > > should chose the process with largest resident memory. > > > > > > > > Bellow is part of the oom info, which is enough to analyze this iss= ue. > > > > [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcn= t 52843037 > > > > [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740= 988kB, failcnt 0 > > > > [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, fa= ilcnt 0 > > > > [...] > > > > [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_byt= es swapents oom_score_adj name > > > > [7516987.983510] [ 5740] 0 5740 257 1 32768 = 0 -998 pause > > > > [7516987.983574] [58804] 0 58804 4594 771 81920 = 0 -998 entry_point.bas > > > > [7516987.983577] [58908] 0 58908 7089 689 98304 = 0 -998 cron > > > > [7516987.983580] [58910] 0 58910 16235 5576 163840 = 0 -998 supervisord > > > > [7516987.983590] [59620] 0 59620 18074 1395 188416 = 0 -998 sshd > > > > [7516987.983594] [59622] 0 59622 18680 6679 188416 = 0 -998 python > > > > [7516987.983598] [59624] 0 59624 1859266 5161 548864 = 0 -998 odin-agent > > > > [7516987.983600] [59625] 0 59625 707223 9248 983040 = 0 -998 filebeat > > > > [7516987.983604] [59627] 0 59627 416433 64239 774144 = 0 -998 odin-log-agent > > > > [7516987.983607] [59631] 0 59631 180671 15012 385024 = 0 -998 python3 > > > > [7516987.983612] [61396] 0 61396 791287 3189 352256 = 0 -998 client > > > > [7516987.983615] [61641] 0 61641 1844642 29089 946176 = 0 -998 client > > > > [7516987.983765] [ 9236] 0 9236 2642 467 53248 = 0 -998 php_scanner > > > > [7516987.983911] [42898] 0 42898 15543 838 167936 = 0 -998 su > > > > [7516987.983915] [42900] 1000 42900 3673 867 77824 = 0 -998 exec_script_vr2 > > > > [7516987.983918] [42925] 1000 42925 36475 19033 335872 = 0 -998 python > > > > [7516987.983921] [57146] 1000 57146 3673 848 73728 = 0 -998 exec_script_J2p > > > > [7516987.983925] [57195] 1000 57195 186359 22958 491520 = 0 -998 python2 > > > > [7516987.983928] [58376] 1000 58376 275764 14402 290816 = 0 -998 rosmaster > > > > [7516987.983931] [58395] 1000 58395 155166 4449 245760 = 0 -998 rosout > > > > [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 = 0 -998 data_sim > > > > [7516987.984221] oom-kill:constraint=3DCONSTRAINT_MEMCG,nodemask=3D= (null),cpuset=3D3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c= 4d753,mems_allowed=3D0-1,oom_memcg=3D/kubepods/podf1c273d3-9b36-11ea-b3df-2= 46e9693c184,task_memcg=3D/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/= 1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=3Dpau= se,pid=3D5740,uid=3D0 > > > > [7516987.984254] Memory cgroup out of memory: Killed process 5740 (= pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB > > > > [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-= rss:0kB, file-rss:0kB, shmem-rss:0kB > > > > > > > > We can find that the first scanned process 5740 (pause) was killed,= but its > > > > rss is only one page. That is because, when we calculate the oom ba= dness in > > > > oom_badness(), we always ignore the negtive point and convert all o= f these > > > > negtive points to 1. Now as oom_score_adj of all the processes in t= his > > > > targeted memcg have the same value -998, the points of these proces= ses are > > > > all negtive value. As a result, the first scanned process will be k= illed. > > > > > > Such a large bias can skew results quite considerably. > > > > > > > The oom_socre_adj (-998) in this memcg is set by kubelet, because i= t is a > > > > a Guaranteed pod, which has higher priority to prevent from being k= illed by > > > > system oom. > > > > > > This is really interesting! I assume that the oom_score_adj is set to > > > protect from the global oom situation right? I am struggling to > > > understand what is the expected behavior when the oom is internal for > > > such a group though. Does killing a single task from such a group is = a > > > sensible choice? I am not really familiar with kubelet but can it cop= e > > > with data_sim going away from under it while the rest would still run= ? > > > Wouldn't it make more sense to simply tear down the whole thing? > > > > > > But that is a separate thing. > > > > > > > To fix this issue, we should make the calculation of oom point more > > > > accurate. We can achieve it by convert the chosen_point from 'unsig= ned > > > > long' to 'long'. > > > > > > oom_score has a very coarse units because it maps all the consumed > > > memory into 0 - 1000 scale so effectively per-mille of the usable > > > memory. oom_score_adj acts on top of that as a bias. This is > > > exported to the userspace and I do not think we can change that (see > > > Documentation/filesystems/proc.rst) unfortunately. So you patch canno= t > > > be really accepted as is because it would start reporting values outs= ide > > > of the allowed range unless I am doing some math incorrectly. > > > > > > On the other hand, in this particular case I believe the existing > > > calculation is just wrong. Usable memory is 16777216kB (4194304 pages= ), > > > the top consumer is 3976380 pages so 94.8% the lowest memory consumer= is > > > effectively 0%. Even if we discount 94.8% by 99.8% then we should be > > > still having something like 7950 pages. So the normalization oom_badn= ess > > > does cuts results too aggressively. There was quite some churn in the > > > calculation in the past fixing weird rounding bugs so I have to think > > > about how to fix this properly some more. > > > > > > That being said, even though the configuration is weird I do agree th= at > > > oom_badness scaling is really unexpected and the memory consumption > > > in this particular example should be quite telling about who to chose= as > > > an oom victim. --=20 Thanks Yafang