From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 043CFC433DF for ; Fri, 10 Jul 2020 14:11:37 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7FF49207D0 for ; Fri, 10 Jul 2020 14:11:36 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="btABKVVi" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7FF49207D0 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 03DFA8D000E; Fri, 10 Jul 2020 10:11:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F30F38D0001; Fri, 10 Jul 2020 10:11:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E47348D000E; Fri, 10 Jul 2020 10:11:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0142.hostedemail.com [216.40.44.142]) by kanga.kvack.org (Postfix) with ESMTP id CD9788D0001 for ; Fri, 10 Jul 2020 10:11:35 -0400 (EDT) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 87383181AEF1D for ; Fri, 10 Jul 2020 14:11:35 +0000 (UTC) X-FDA: 77022354150.18.pin75_4e0a5aa26ecf Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin18.hostedemail.com (Postfix) with ESMTP id E6CC7100ED0C9 for ; Fri, 10 Jul 2020 14:11:34 +0000 (UTC) X-HE-Tag: pin75_4e0a5aa26ecf X-Filterd-Recvd-Size: 11060 Received: from mail-il1-f194.google.com (mail-il1-f194.google.com [209.85.166.194]) by imf12.hostedemail.com (Postfix) with ESMTP for ; Fri, 10 Jul 2020 14:11:34 +0000 (UTC) Received: by mail-il1-f194.google.com with SMTP id a6so5131741ilq.13 for ; Fri, 10 Jul 2020 07:11:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=Xd1bQyvph9gfl4LJW5Mf/4akJABzXG0qDbrJH+QD38I=; b=btABKVVixxmc82PzbLwuf2m2ZaUAl8+5RRDiGdGIVFIXLi6bJfhEMUbUohFcea3IgX pLhqlpyLE0Jg2TTJmotLdpO0dZ+guCz0Pj7ZkSNSwSgntvhQPo8Xsmz4gZfHU9hrG4Wk Zpr5l7TtU1BlUDpDFLZwj3NcNe5E5uaFmro6HF0d7oBqVLqAEnRujePy0eO+MZHZgQvT DQzW4fTFXYaI9H9ZS1D8UUm6tmXW6fl11kZMUI63l5S99Og2JZvMDH9j1aqJ4n3Do0pa vxvEgQCTuqjORhmrqVHQU2VtWLloOxGM/rfKaVdtssZwNK3zMtDnOTBIH9/Yt8wHyXBy GfDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=Xd1bQyvph9gfl4LJW5Mf/4akJABzXG0qDbrJH+QD38I=; b=JNRK+sZlpugQO4ZQC2jSuwERLThsXodquwQI5XW+gUPBCOY3VRfoOnNbwc+DE8n05V sMwVKnZi8BJ7AHtZNklVlsqmJwCVMXVsle8lyMTgu4nCuGGaAzBBP+i7/NYaEsH9Amhe t7ECoFlnZjBak/flAKSoIBbWCQwjTn0lLhxuAt5EprivDbq9Ol7bhcJm6ru4W+pDBABY 2+2OzUDW6feWr8JXHGNv03KjldqSKVRQhTSV55W6l80JmyBFXPK7p41XjP4FkH/Q3QGj hD5L9Vgpm9CVWzQg1v0s89j7rXzCmljmpXH30VL8jydFefOtW5LHfki+gf/udfy8gITa qGpg== X-Gm-Message-State: AOAM530QMu2S92H1kK7FyJNU5fVHXNQI1QQKM8RY9mbvCNXJ/XTtO+ok fgl5ZWwObEdMANj7kNaa32pp4eHQvSB4OuF1TxI= X-Google-Smtp-Source: ABdhPJxsD4qUrbXJZh5ryoFzFdxY8cAzZhC/EH/eufjjc6y3R+BnZbUUfsHl35LCl6kzKP4GMX+kEoZlwFd7XRtuMac= X-Received: by 2002:a92:404e:: with SMTP id n75mr11661338ila.203.1594390293742; Fri, 10 Jul 2020 07:11:33 -0700 (PDT) MIME-Version: 1.0 References: <1594309987-9919-1-git-send-email-laoar.shao@gmail.com> <20200710121022.GA3022@dhcp22.suse.cz> In-Reply-To: <20200710121022.GA3022@dhcp22.suse.cz> From: Yafang Shao Date: Fri, 10 Jul 2020 22:10:57 +0800 Message-ID: Subject: Re: [PATCH v2] mm, oom: make the calculation of oom badness more accurate To: Michal Hocko Cc: David Rientjes , Andrew Morton , Linux MM Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: E6CC7100ED0C9 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Jul 10, 2020 at 8:10 PM Michal Hocko wrote: > > On Thu 09-07-20 11:53:07, Yafang Shao wrote: > > Recently we found an issue on our production environment that when memc= g > > oom is triggered the oom killer doesn't chose the process with largest > > resident memory but chose the first scanned process. Note that all > > processes in this memcg have the same oom_score_adj, so the oom killer > > should chose the process with largest resident memory. > > > > Bellow is part of the oom info, which is enough to analyze this issue. > > [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52= 843037 > > [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988k= B, failcnt 0 > > [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcn= t 0 > > [...] > > [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes s= wapents oom_score_adj name > > [7516987.983510] [ 5740] 0 5740 257 1 32768 = 0 -998 pause > > [7516987.983574] [58804] 0 58804 4594 771 81920 = 0 -998 entry_point.bas > > [7516987.983577] [58908] 0 58908 7089 689 98304 = 0 -998 cron > > [7516987.983580] [58910] 0 58910 16235 5576 163840 = 0 -998 supervisord > > [7516987.983590] [59620] 0 59620 18074 1395 188416 = 0 -998 sshd > > [7516987.983594] [59622] 0 59622 18680 6679 188416 = 0 -998 python > > [7516987.983598] [59624] 0 59624 1859266 5161 548864 = 0 -998 odin-agent > > [7516987.983600] [59625] 0 59625 707223 9248 983040 = 0 -998 filebeat > > [7516987.983604] [59627] 0 59627 416433 64239 774144 = 0 -998 odin-log-agent > > [7516987.983607] [59631] 0 59631 180671 15012 385024 = 0 -998 python3 > > [7516987.983612] [61396] 0 61396 791287 3189 352256 = 0 -998 client > > [7516987.983615] [61641] 0 61641 1844642 29089 946176 = 0 -998 client > > [7516987.983765] [ 9236] 0 9236 2642 467 53248 = 0 -998 php_scanner > > [7516987.983911] [42898] 0 42898 15543 838 167936 = 0 -998 su > > [7516987.983915] [42900] 1000 42900 3673 867 77824 = 0 -998 exec_script_vr2 > > [7516987.983918] [42925] 1000 42925 36475 19033 335872 = 0 -998 python > > [7516987.983921] [57146] 1000 57146 3673 848 73728 = 0 -998 exec_script_J2p > > [7516987.983925] [57195] 1000 57195 186359 22958 491520 = 0 -998 python2 > > [7516987.983928] [58376] 1000 58376 275764 14402 290816 = 0 -998 rosmaster > > [7516987.983931] [58395] 1000 58395 155166 4449 245760 = 0 -998 rosout > > [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 = 0 -998 data_sim > > [7516987.984221] oom-kill:constraint=3DCONSTRAINT_MEMCG,nodemask=3D(nul= l),cpuset=3D3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d75= 3,mems_allowed=3D0-1,oom_memcg=3D/kubepods/podf1c273d3-9b36-11ea-b3df-246e9= 693c184,task_memcg=3D/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f24= 6a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=3Dpause,p= id=3D5740,uid=3D0 > > [7516987.984254] Memory cgroup out of memory: Killed process 5740 (paus= e) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB > > [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:= 0kB, file-rss:0kB, shmem-rss:0kB > > > > We can find that the first scanned process 5740 (pause) was killed, but= its > > rss is only one page. That is because, when we calculate the oom badnes= s in > > oom_badness(), we always ignore the negtive point and convert all of th= ese > > negtive points to 1. Now as oom_score_adj of all the processes in this > > targeted memcg have the same value -998, the points of these processes = are > > all negtive value. As a result, the first scanned process will be kille= d. > > > > The oom_socre_adj (-998) in this memcg is set by kubelet, because it is= a > > a Guaranteed pod, which has higher priority to prevent from being kille= d by > > system oom. > > > > To fix this issue, we should make the calculation of oom point more > > accurate. We can achieve it by convert the chosen_point from 'unsigned > > long' to 'long'. > > > > Signed-off-by: Yafang Shao > > --- > > drivers/tty/sysrq.c | 1 + > > fs/proc/base.c | 7 ++++++- > > include/linux/oom.h | 4 ++-- > > mm/memcontrol.c | 1 + > > mm/oom_kill.c | 19 ++++++++----------- > > mm/page_alloc.c | 1 + > > 6 files changed, 19 insertions(+), 14 deletions(-) > > > > diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c > > index 7c95afa9..e83fd46 100644 > > --- a/drivers/tty/sysrq.c > > +++ b/drivers/tty/sysrq.c > > @@ -382,6 +382,7 @@ static void moom_callback(struct work_struct *ignor= ed) > > .memcg =3D NULL, > > .gfp_mask =3D gfp_mask, > > .order =3D -1, > > + .chosen_points =3D LONG_MIN, > > It would be better to do the initialization only once when we start > evaluating tasks (select_bad_process). > I used to initialize it in constrained_alloc() in the previous version, but I found that is not proper, so I change the initialization in the definitions of each oom_control. select_bad_process() should be a better choice. I will update it. > > }; > > > > mutex_lock(&oom_lock); > > diff --git a/fs/proc/base.c b/fs/proc/base.c > > index d86c0af..bf16406 100644 > > --- a/fs/proc/base.c > > +++ b/fs/proc/base.c > > @@ -551,8 +551,13 @@ static int proc_oom_score(struct seq_file *m, stru= ct pid_namespace *ns, > > { > > unsigned long totalpages =3D totalram_pages() + total_swap_pages; > > unsigned long points =3D 0; > > + long badness; > > > > - points =3D oom_badness(task, totalpages) * 1000 / totalpages; > > + badness =3D oom_badness(task, totalpages); > > + if (badness !=3D LONG_MIN) { > > + /* Let's keep the range of points as [0, 2000]. */ > > + points =3D (1000 + badness * 1000 / (long)totalpages) * 2= / 3; > > + } > > seq_printf(m, "%lu\n", points); > > This doesn't really work for OOM_SCORE_ADJ_MIN cases because they > will simply print LONG_MIN rather than 0. > The point has be initlialize to 0: unsigned long points =3D 0; So for OOM_SCORE_ADJ_MIN cases, it will print 0, seq_printf(m, "%lu\n", points); But.. > So you want > /* > * Special case OOM_SCORE_ADJ_MIN for all others scale the > * badness value into [0, 2000] range which we have been > * exporting for a long time so userspace might depend on it > */ the comment is useful, I will update it with your comment. Thanks. > if (badness =3D=3D LONG_MIN) > badness =3D 0; > else > points =3D (1000 + badness * 1000 / (long)totalpages) * 2= / 3 > > FTR. In my other email I was proposing to scale usage to the [-1000, 1000= ] > range by > points =3D adj + usage * 1000/ totalpages > > this would make the math slightly easier to follow but then I've > realized that this would be much less precise so what you have is > better. Btw. we used to do that in the past until a7f638f999ff4 > which has changed that for this very reason. > > > @@ -107,7 +107,7 @@ static inline vm_fault_t check_stable_address_space= (struct mm_struct *mm) > > > > bool __oom_reap_task_mm(struct mm_struct *mm); > > > > -extern unsigned long oom_badness(struct task_struct *p, > > +long oom_badness(struct task_struct *p, > > unsigned long totalpages); > > This is not really necessary. > > With that being addressed, you can add > Acked-by: Michal Hocko > Thanks for the review. --=20 Thanks Yafang