From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B62C0C433DF for ; Wed, 8 Jul 2020 16:09:32 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 71FDA20672 for ; Wed, 8 Jul 2020 16:09:32 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 71FDA20672 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id DBF7B6B0003; Wed, 8 Jul 2020 12:09:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D6F9C6B0005; Wed, 8 Jul 2020 12:09:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C5FBF6B0006; Wed, 8 Jul 2020 12:09:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0024.hostedemail.com [216.40.44.24]) by kanga.kvack.org (Postfix) with ESMTP id AFF886B0003 for ; Wed, 8 Jul 2020 12:09:31 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 3C1342C6D for ; Wed, 8 Jul 2020 16:09:31 +0000 (UTC) X-FDA: 77015393742.26.balls48_2b1740926ebe Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin26.hostedemail.com (Postfix) with ESMTP id 145A61804B647 for ; Wed, 8 Jul 2020 16:09:31 +0000 (UTC) X-HE-Tag: balls48_2b1740926ebe X-Filterd-Recvd-Size: 9211 Received: from mail-ej1-f68.google.com (mail-ej1-f68.google.com [209.85.218.68]) by imf35.hostedemail.com (Postfix) with ESMTP for ; Wed, 8 Jul 2020 16:09:30 +0000 (UTC) Received: by mail-ej1-f68.google.com with SMTP id f12so24948194eja.9 for ; Wed, 08 Jul 2020 09:09:30 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=WwiI5HzYIh3BtO0gDaclP1QOyoOmCbX2NrO8RPf2U78=; b=DX+457x3Frbnb+r32xOQVDX4b9q8L91QWwQ8HC49jfPWwvR8BVgaewlfw/qBxVcikb M8t347SBRAgwuPpH5lcOCoKj1hrBNdKabeCtLjtWP/rCfROqh9VRG50ty4xEXREfIkiH 3kiS8WbsNNrGcOH/eo2HkcCPfffwe7pii2OeWFNm/lrc2yJ7J2CV7RcrDy7yXh0Rzvq4 ln4+9r+Uyk4ziZqWwYxIe4h+Cr4AMBK25oPzosUqqqeCiNwnn2/qvGu5SxcIYd9hE1RP pEV7KceXB0wMV2xox/um0BBMxytY7qchPYUEsdpe8tF2H1KoeDzoNwOHD9gTOeaTkD0R GE1w== X-Gm-Message-State: AOAM532HFD0FEZNxX3WGsy7aFC/OvhFHNvYxXvwJMBTI15XKHhA9yN3s dxZRGwqhOcNTTw3FFfRm7Go= X-Google-Smtp-Source: ABdhPJxYEdtQjFcxO3SbzVWKL+iqLOI0NwlStBFvM15NhoBiqBQkWSZ2sVsugR1HccrdYUTSEv4y4g== X-Received: by 2002:a17:906:4341:: with SMTP id z1mr46281541ejm.392.1594224569445; Wed, 08 Jul 2020 09:09:29 -0700 (PDT) Received: from localhost (ip-37-188-179-51.eurotel.cz. [37.188.179.51]) by smtp.gmail.com with ESMTPSA id j21sm9909edq.20.2020.07.08.09.09.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 08 Jul 2020 09:09:28 -0700 (PDT) Date: Wed, 8 Jul 2020 18:09:26 +0200 From: Michal Hocko To: Yafang Shao Cc: David Rientjes , Andrew Morton , Linux MM Subject: Re: [PATCH] mm, oom: make the calculation of oom badness more accurate Message-ID: <20200708160926.GL7271@dhcp22.suse.cz> References: <1594214649-9837-1-git-send-email-laoar.shao@gmail.com> <20200708142806.GJ7271@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 145A61804B647 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam05 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed 08-07-20 23:11:43, Yafang Shao wrote: > On Wed, Jul 8, 2020 at 10:28 PM Michal Hocko wrote: > > > > On Wed 08-07-20 09:24:09, Yafang Shao wrote: > > > Recently we found an issue on our production environment that when memcg > > > oom is triggered the oom killer doesn't chose the process with largest > > > resident memory but chose the first scanned process. Note that all > > > processes in this memcg have the same oom_score_adj, so the oom killer > > > should chose the process with largest resident memory. > > > > > > Bellow is part of the oom info, which is enough to analyze this issue. > > > [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 > > > [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 > > > [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0 > > > [...] > > > [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name > > > [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause > > > [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas > > > [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron > > > [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord > > > [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd > > > [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python > > > [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent > > > [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat > > > [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent > > > [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3 > > > [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client > > > [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client > > > [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner > > > [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su > > > [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2 > > > [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python > > > [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p > > > [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2 > > > [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster > > > [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout > > > [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim > > > [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 > > > [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB > > > [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > > > > > We can find that the first scanned process 5740 (pause) was killed, but its > > > rss is only one page. That is because, when we calculate the oom badness in > > > oom_badness(), we always ignore the negtive point and convert all of these > > > negtive points to 1. Now as oom_score_adj of all the processes in this > > > targeted memcg have the same value -998, the points of these processes are > > > all negtive value. As a result, the first scanned process will be killed. > > > > Such a large bias can skew results quite considerably. > > > > Right. > Pls. refer the kubernetes doc[1] for more information about this large bias . > > [1]. https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/ > > > > The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a > > > a Guaranteed pod, which has higher priority to prevent from being killed by > > > system oom. > > > > This is really interesting! I assume that the oom_score_adj is set to > > protect from the global oom situation right? > > Right. See also the kubernetes doc. > > > I am struggling to > > understand what is the expected behavior when the oom is internal for > > such a group though. Does killing a single task from such a group is a > > sensible choice? I am not really familiar with kubelet but can it cope > > with data_sim going away from under it while the rest would still run? > > Wouldn't it make more sense to simply tear down the whole thing? > > > > There are two containers in one kubernetes pod, one of which is a > pause-container, which has only one process - the pause, which is > managing the netns, and the other is the docker-init-container, in > which all other processes are running. > Once the pause process is killed, the kubelet will rebuild all the > containers in this pod, while if one of the processes in the > docker-init-container is killed, the kubelet will try to re-run it. > So tearing down the whole thing is more costly than only trying to > re-running one process. > I'm not familiar with kubernetes as well, that is my understanding. Thanks for the clarification! [...] > > oom_score has a very coarse units because it maps all the consumed > > memory into 0 - 1000 scale so effectively per-mille of the usable > > memory. oom_score_adj acts on top of that as a bias. This is > > exported to the userspace and I do not think we can change that (see > > Documentation/filesystems/proc.rst) unfortunately. > > In this doc, I only find the oom_score and oom_score_adj is exposed to > the userspace. > While this patch only changes the oom_control->chosen_points, which is > only for oom internally use. > So I don't think we can't change oom_control->chosen_points. Unless I am misreading the patch you are allowing negative values to be returned from proc_oom_score and that is used by proc_oom_score which is exported to the userspace. -- Michal Hocko SUSE Labs