From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3B898C3815B for ; Mon, 20 Apr 2020 09:58:42 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E34142145D for ; Mon, 20 Apr 2020 09:58:40 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="eGHSHwCV" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E34142145D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 939B18E0005; Mon, 20 Apr 2020 05:58:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8E9AA8E0003; Mon, 20 Apr 2020 05:58:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 800C08E0005; Mon, 20 Apr 2020 05:58:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0138.hostedemail.com [216.40.44.138]) by kanga.kvack.org (Postfix) with ESMTP id 6791E8E0003 for ; Mon, 20 Apr 2020 05:58:40 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 210DF824556B for ; Mon, 20 Apr 2020 09:58:40 +0000 (UTC) X-FDA: 76727784000.13.loaf53_68c156075880d X-HE-Tag: loaf53_68c156075880d X-Filterd-Recvd-Size: 7396 Received: from mail-il1-f195.google.com (mail-il1-f195.google.com [209.85.166.195]) by imf37.hostedemail.com (Postfix) with ESMTP for ; Mon, 20 Apr 2020 09:58:39 +0000 (UTC) Received: by mail-il1-f195.google.com with SMTP id r2so2492438ilo.6 for ; Mon, 20 Apr 2020 02:58:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=UZZdefDWsa1jmrIjTgGlbrgNodB3iMmaoSJfayN0pd8=; b=eGHSHwCVo0k/kwEaQjzJVNl3jNIrSlI3yngXbrd19U3bT+oMxxqH3lm3JEoqDFzyqP +27DA2ZTy5k3GSABddzonFP/JAH+Qu6bkietyJlyGoitaX2TEm1Nt0jObO6Lr0g9X8ez 1GDs6s5O+o7sC6XaenfS8uNHSrNpt6WU9LvIUgIO7E9NnuzWAlRgEYsjks2S57kTyxwf IEp9AeDQclmgSMAnRtbp0dsl8SPO3bVfGpuUiKmHOKNx72oyQrnCwo2LE6y/aT3CZrDJ pukzLVSQusdZxBSADdWOcYnAPyJI/yL+TSbp9h8LSeFgyI96hPsgH8UuMbvhgmpIyamh yvrA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=UZZdefDWsa1jmrIjTgGlbrgNodB3iMmaoSJfayN0pd8=; b=SEldPeIR9Swrogvf6Cvt7AYuVUvAq2Z1Wa3QZ33Fk52EcwxBgqfqDPl6w59PJSVPR4 TYpe9RxjmR2hIOn/cD+AmHYUe7a4RUd05YuzXh9MRIRlXhdzQCM45cwwgjGEzy1FWMkO eMPiPgDYO3oNR4BQt8PaN9LNNYRBAU6QMgiVuNd1r2zAR/OrD5m3ZnDgCW2ZF0TCgL5M JjWYdXNT6QJL2VuLRvrN3vU4gZZ3746Oy6PfYMBHoXWnmbJ4vxMEz8e3s+sUhY0Vr1Ir 1PgmCe8vI4atrxMlUodmm9ZvpjaeDJLTWujDhm9QK9vB0om1QtO+/0C1jFq2KIL+MY9F vtvA== X-Gm-Message-State: AGi0Pub02Z8unMSzqVyvRBQzPTrT2Q7L7w12YUH6iiojWK28z75mDRMr n9Fzgsn+yYGmOqSNoKYo39n1yAg/PMErec7vyWA= X-Google-Smtp-Source: APiQypL5QS3ciY/iyGeI4NBXWyP+9l3gavirFDNQ8UQAv4ZeYF6PXCTBaE3AszZi1h/CszOySgBf3hErP0pkizNZyik= X-Received: by 2002:a92:9a4d:: with SMTP id t74mr15603088ili.168.1587376719194; Mon, 20 Apr 2020 02:58:39 -0700 (PDT) MIME-Version: 1.0 References: <20200418151311.7397-1-laoar.shao@gmail.com> <20200418151311.7397-4-laoar.shao@gmail.com> <20200420081353.GI27314@dhcp22.suse.cz> <20200420091452.GJ27314@dhcp22.suse.cz> In-Reply-To: <20200420091452.GJ27314@dhcp22.suse.cz> From: Yafang Shao Date: Mon, 20 Apr 2020 17:58:03 +0800 Message-ID: Subject: Re: [PATCH 3/3] memcg oom: bail out from the charge path if no victim found To: Michal Hocko Cc: Johannes Weiner , Vladimir Davydov , Andrew Morton , Linux MM Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Apr 20, 2020 at 5:14 PM Michal Hocko wrote: > > On Mon 20-04-20 16:52:05, Yafang Shao wrote: > > On Mon, Apr 20, 2020 at 4:13 PM Michal Hocko wrote: > > > > > > On Sat 18-04-20 11:13:11, Yafang Shao wrote: > [...] > > > > This patch is to improve it. > > > > If no victim found in memcg oom, we should force the current task to > > > > wait until there's available pages. That is similar with the behavior in > > > > memcg1 when oom_kill_disable is set. > > > > > > The primary reason why we force the charge is because we _cannot_ wait > > > indefinitely in the charge path because the current call chain might > > > hold locks or other resources which could block a large part of the > > > system. You are essentially reintroducing that behavior. > > > > > > > Seems my poor English misleads you ? > > The task is NOT waiting in the charge path, while it is really waiting > > at the the end of the page fault, so it doesn't hold any locks. > > How is that supposed to work? Sorry I didn't really study your patch > very closely because it doesn't apply on the current Linus' tree and > your previous 2 patches have reshuffled the code so it is not really > trivial to have a good picture of the overall logic change. > My patch is based on the commit 8632e9b5645b, and I can rebase my patch for better reviewing. Here is the overall logic of the patch. do_page_fault mem_cgroup_try_charge mem_cgroup_out_of_memory <<< over the limit of this memcg out_of_memory if (!oc->chosen) <<<< no killable tasks found Set_an_OOM _state_in_the_task <<<< set the oom state mm_fault_error pagefault_out_of_memory <<<< VM_FAULT_OOM is returned by the previously error mem_cgroup_oom_synchronize(true) Check_the_OOM_state_and_then_wait_here <<<< check the oom state > > See the comment above mem_cgroup_oom_synchronize() > > Anyway mem_cgroup_oom_synchronize shouldn't really trigger unless the > oom handling is disabled (aka handed over to the userspace). All other > paths should handle the oom in the charge path. Right. Now this patch introduces another patch to enter mem_cgroup_oom_synchronize(). > Please have a look at > 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path") > for more background and motivation. > Before I send this patch, I have read it carefully. > mem_cgroup_oom_synchronize was a workaround for deadlocks and the side > effect was that all other charge paths outside of #PF were failing > allocations prematurely and that had an effect to user space. > I guess this side effect is caused by the precision of the page counter, for example, the page counter isn't modified immdiately after uncharging the pages - that's the issue we should improve IMHO. > > > Is the above example a real usecase or you have just tried a test case > > > that would trigger the problem? > > > > On my server I found the memory usage of a container was greater than > > the limit of it. > > From the dmesg I know there's no killable tasks becasue the > > oom_score_adj is set with -1000. > > I would really recommend to address this problem in the userspace > configuration. Either by increasing the memory limit or fixing the > oom disabled userspace to not consume that much of a memory. > This issue can be addressed in the usespace configuration. But note that there're many containers running on one single host, what we should do is try to keep the isolation as strong as possible. If we don't take any action in the kernel, the users will complain to us that their service is easily effected by the weak isolation of the container. > > Then I tried this test case to produce this issue. > > This issue can be triggerer by the misconfiguration of oom_score_adj, > > and can also be tiggered by a memoy leak in the task with > > oom_score_adj -1000. > > Please note that there is not much the system can do about oom disabled > tasks that leak memory. Even the global case would slowly kill all other > userspace until it panics due to no eligible tasks. The oom_score_adj > has a very strong consequences. Do not use it without a very careful > consideration. global case -> kill others until the system panic. container case -> kill others until no tasks can run in the contianer I think this is the consistent behavior. Thanks Yafang