From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=05z6=56=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 09593C2BBC7
	for <linux-mm@archiver.kernel.org>; Tue, 14 Apr 2020 12:33:33 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id B37B3206D5
	for <linux-mm@archiver.kernel.org>; Tue, 14 Apr 2020 12:33:32 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="MZvJrbmP"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B37B3206D5
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 4EF518E0003; Tue, 14 Apr 2020 08:33:32 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 479178E0001; Tue, 14 Apr 2020 08:33:32 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 341588E0003; Tue, 14 Apr 2020 08:33:32 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0120.hostedemail.com [216.40.44.120])
	by kanga.kvack.org (Postfix) with ESMTP id 188958E0001
	for <linux-mm@kvack.org>; Tue, 14 Apr 2020 08:33:32 -0400 (EDT)
Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id C76582C9D
	for <linux-mm@kvack.org>; Tue, 14 Apr 2020 12:33:31 +0000 (UTC)
X-FDA: 76706401422.25.boats84_58fbc7f230642
X-HE-Tag: boats84_58fbc7f230642
X-Filterd-Recvd-Size: 9865
Received: from mail-il1-f195.google.com (mail-il1-f195.google.com [209.85.166.195])
	by imf29.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 14 Apr 2020 12:33:31 +0000 (UTC)
Received: by mail-il1-f195.google.com with SMTP id t10so4547225ilg.9
        for <linux-mm@kvack.org>; Tue, 14 Apr 2020 05:33:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=X2KK8qvtItk+aFWytzvARyZXMam5w7XJJ4e8a/lJvMU=;
        b=MZvJrbmPo7cgor8rt3zvQ/8SUj7mGUxTqgT1nax6M1/cf38p1DvnV8reh7H2g4j413
         GGNggWKSQhJUUqWkx0bv3f1xZsTXce7YntzJ0RUBGgGb97xhYywqtdmSICvv6qvdIpvm
         h4T6yqVhev7OHcxGvG5fqsG3ZLUm5LfjUul4XlqboUBuVVACoxLqlvC+JNZV8j8wbYia
         Krr0+0pbhFkRr/6lAvX8GIKZq+wncsqxgE0H/YdL5Xih/ha2clEnipPiBNvXhWMUXtpU
         VSOsA7qC0RhUb/RO8ZGBfeAQjFke+OcE7XpB6ub+mlPLusq1FDNSWdmWfwn9hBQMBLeO
         iF5Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=X2KK8qvtItk+aFWytzvARyZXMam5w7XJJ4e8a/lJvMU=;
        b=ETyfnXtxWwIpSru48J8ajL5Wzcgv5Eq7487no5Ict5o10BBlV9logCmUeOKLw62AtD
         V3RT6W6vXWrsLMO6kH2qt8oR87xsmxKCUvy3jgyKw8XXbPIDYqbKQrYJFL3QVS7pcu8r
         RhPnOE+GKjXOB8SpQeQa5xCPXy2KI8nn5Sg67xpEsOmPtRY86MBi5yiLiyi91flLQflD
         NIskTWi5zUxN1dtjs2huT7BW/0eQabEaK1AVcs3tEiTCT91v66t+DAEsuYMrOKzTSkhh
         ngm8wT083EpoMkQV81up9SNU4NZHucu/w1ExmhGVB6dbFXdhUPzAS6eWrV+OpgUNCLYi
         uMVg==
X-Gm-Message-State: AGi0PubW7o49d4lywrxEPMNvQutGqGRwq0Z9tjxPuIF24msIL1Jqcs0J
	OpToZsEGO0ILonJ/vcqfbWKKiV7p77FgyxTXz4w=
X-Google-Smtp-Source: APiQypLgBtRECKJULTrTLZpu8F+UE1pEqRe4kMuR5M+KHbn9R1cWYLUtBkpX2RNiR2K6bDa8LvUbClHLckYqb42j9qU=
X-Received: by 2002:a92:d105:: with SMTP id a5mr21737247ilb.142.1586867610693;
 Tue, 14 Apr 2020 05:33:30 -0700 (PDT)
MIME-Version: 1.0
References: <1586597774-6831-1-git-send-email-laoar.shao@gmail.com> <20200414073911.GC4629@dhcp22.suse.cz>
In-Reply-To: <20200414073911.GC4629@dhcp22.suse.cz>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Tue, 14 Apr 2020 20:32:54 +0800
Message-ID: <CALOAHbDv+ZAgmGJP7GFzGcjKBZTPk9kYo63g173Nh+vn00qmwg@mail.gmail.com>
Subject: Re: [RFC PATCH] mm, oom: oom ratelimit auto tuning
To: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>, Linux MM <linux-mm@kvack.org>
Content-Type: text/plain; charset="UTF-8"
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Apr 14, 2020 at 3:39 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Sat 11-04-20 05:36:14, Yafang Shao wrote:
> > Recently we find an issue that when OOM happens the server is almost
> > unresponsive for several minutes. That is caused by a slow serial set
> > with "console=ttyS1,19200". As the speed of this serial is too slow, it
> > will take almost 10 seconds to print a full OOM message into it. And
> > then all tasks allocating pages will be blocked as there is almost no
> > pages can be reclaimed. At that time, the memory pressure is around 90
> > for a long time. If we don't print the OOM messages into this serial,
> > a full OOM message only takes less than 1ms and the memory pressure is
> > less than 40.
>
> Which part of the oom report takes the most time? I would expect this to
> be the dump_tasks part which can be pretty large when there is a lot of
> eligible tasks to kill.
>

Yes, dump_tasks takes around 6s of the total 10s,  show_mem take
around 2s, and dump_stack takes around 0.8s.

> > We can avoid printing OOM messages into slow serial by adjusting
> > /proc/sys/kernel/printk to fix this issue, but then all messages with
> > KERN_WARNING level can't be printed into it neither, that may loss some
> > useful messages when we want to collect messages from the it for
> > debugging purpose.
>
> A large part of the oom report is printed with KERN_INFO log level. So
> you can reduce a large part of the output while not losing other
> potentially important information.
>

Reduce the KERN_INFO log can save lots of time, but I just worried
that sometimes the user may need the full log and if then can't find
these logs they may complain.

> > So it is better to decrease the ratelimit. We can introduce some sysctl
> > knobes similar with printk_ratelimit and burst, but it will burden the
> > amdin. Let the kernel automatically adjust the ratelimit, that would be
> > a better choice.
>
> No new knobs for ratelimiting. Admin shouldn't really care about these
> things.

Agreed.

[snip]
> Besides that I strongly suspect that you would be much better of
> by disabling /proc/sys/vm/oom_dump_tasks which would reduce the amount
> of output a lot. Or do you really require this information when
> debugging oom reports?
>

Yes, disabling /proc/sys/vm/oom_dump_tasks can save lots of time.
But I'm not sure whehter we can disable it totally, because disabling
it would prevent the tasks log from being wrote into /var/log/messages
neither.

> > The OOM ratelimit starts with a slow rate, and it will increase slowly
> > if the speed of the console is rapid and decrease rapidly if the speed
> > of the console is slow. oom_rs.burst will be in [1, 10] and
> > oom_rs.interval will always greater than 5 * HZ.
>
> I am not against increasing the ratelimit timeout. But this patch seems
> to be trying to be too clever.  Why cannot we simply increase the
> parameters of the ratelimit?

I justed worried that the user may complain it if too many
oom_kill_process callbacks are suppressed.
But considering that OOM burst at the same time are always because of
the same reason, so I think one snapshot of the OOM may be enough.
Simply setting oom_rs with {20 * HZ, 1} can resolve this issue.

> I am also interested whether this actually
> works. AFAIR ratelimit doesn't really work reliably when the ratelimited
> operation takes a long time because the internals have no way to see
> when the operation finished.
>

Agree with you that ratelimit() was not so reliable.

> >  mm/oom_kill.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 48 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index dfc357614e56..23dba8ccf313 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -954,8 +954,10 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >  {
> >       struct task_struct *victim = oc->chosen;
> >       struct mem_cgroup *oom_group;
> > -     static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> > -                                           DEFAULT_RATELIMIT_BURST);
> > +     static DEFINE_RATELIMIT_STATE(oom_rs, 20 * HZ, 1);
> > +     int delta;
> > +     unsigned long start;
> > +     unsigned long end;
> >
> >       /*
> >        * If the task is already exiting, don't alarm the sysadmin or kill
> > @@ -972,8 +974,51 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >       }
> >       task_unlock(victim);
> >
> > -     if (__ratelimit(&oom_rs))
> > +     if (__ratelimit(&oom_rs)) {
> > +             start = jiffies;
> >               dump_header(oc, victim);
> > +             end = jiffies;
> > +             delta = end - start;
> > +
> > +             /*
> > +              * The OOM messages may be printed to a serial with very low
> > +              * speed, e.g. console=ttyS1,19200. It will take long
> > +              * time to print these OOM messages to this serial, and
> > +              * then processes allocating pages will all be blocked due
> > +              * to it can hardly reclaim pages. That will case high
> > +              * memory pressure and the system may be unresponsive for a
> > +              * long time.
> > +              * In this case, we should decrease the OOM ratelimit or
> > +              * avoid printing OOM messages into the slow serial. But if
> > +              * we avoid printing OOM messages into the slow serial, all
> > +              * messages with KERN_WARNING level can't be printed into
> > +              * it neither, that may loss some useful messages when we
> > +              * want to collect messages from the console for debugging
> > +              * purpose. So it is better to decrease the ratelimit. We
> > +              * can introduce some sysctl knobes similar with
> > +              * printk_ratelimit and burst, but it will burden the
> > +              * admin. Let the kernel automatically adjust the ratelimit
> > +              * would be a better chioce.
> > +              * In bellow algorithm, it will decrease the OOM ratelimit
> > +              * rapidly if the console is slow and increase the OOM
> > +              * ratelimit slowly if the console is fast. oom_rs.burst
> > +              * will be in [1, 10] and oom_rs.interval will always
> > +              * greater than 5 * HZ.
> > +              */
> > +             if (delta < oom_rs.interval / 10) {
> > +                     if (oom_rs.interval >= 10 * HZ)
> > +                             oom_rs.interval /= 2;
> > +                     else if (oom_rs.interval > 6 * HZ)
> > +                             oom_rs.interval -= HZ;
> > +
> > +                     if (oom_rs.burst < 10)
> > +                             oom_rs.burst += 1;
> > +             } else if (oom_rs.burst > 1) {
> > +                     oom_rs.burst = 1;
> > +                     oom_rs.interval = 4 * delta;
> > +             }
> > +
> > +     }
> >
> >       /*
> >        * Do we need to kill the entire memory cgroup?
> > --
> > 2.18.2
>
> --


Thanks
Yafang