From: Yafang Shao <laoar.shao@gmail.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>, Linux MM <linux-mm@kvack.org>
Subject: Re: [RFC PATCH] mm, oom: oom ratelimit auto tuning
Date: Tue, 14 Apr 2020 20:32:54 +0800 [thread overview]
Message-ID: <CALOAHbDv+ZAgmGJP7GFzGcjKBZTPk9kYo63g173Nh+vn00qmwg@mail.gmail.com> (raw)
In-Reply-To: <20200414073911.GC4629@dhcp22.suse.cz>
On Tue, Apr 14, 2020 at 3:39 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Sat 11-04-20 05:36:14, Yafang Shao wrote:
> > Recently we find an issue that when OOM happens the server is almost
> > unresponsive for several minutes. That is caused by a slow serial set
> > with "console=ttyS1,19200". As the speed of this serial is too slow, it
> > will take almost 10 seconds to print a full OOM message into it. And
> > then all tasks allocating pages will be blocked as there is almost no
> > pages can be reclaimed. At that time, the memory pressure is around 90
> > for a long time. If we don't print the OOM messages into this serial,
> > a full OOM message only takes less than 1ms and the memory pressure is
> > less than 40.
>
> Which part of the oom report takes the most time? I would expect this to
> be the dump_tasks part which can be pretty large when there is a lot of
> eligible tasks to kill.
>
Yes, dump_tasks takes around 6s of the total 10s, show_mem take
around 2s, and dump_stack takes around 0.8s.
> > We can avoid printing OOM messages into slow serial by adjusting
> > /proc/sys/kernel/printk to fix this issue, but then all messages with
> > KERN_WARNING level can't be printed into it neither, that may loss some
> > useful messages when we want to collect messages from the it for
> > debugging purpose.
>
> A large part of the oom report is printed with KERN_INFO log level. So
> you can reduce a large part of the output while not losing other
> potentially important information.
>
Reduce the KERN_INFO log can save lots of time, but I just worried
that sometimes the user may need the full log and if then can't find
these logs they may complain.
> > So it is better to decrease the ratelimit. We can introduce some sysctl
> > knobes similar with printk_ratelimit and burst, but it will burden the
> > amdin. Let the kernel automatically adjust the ratelimit, that would be
> > a better choice.
>
> No new knobs for ratelimiting. Admin shouldn't really care about these
> things.
Agreed.
[snip]
> Besides that I strongly suspect that you would be much better of
> by disabling /proc/sys/vm/oom_dump_tasks which would reduce the amount
> of output a lot. Or do you really require this information when
> debugging oom reports?
>
Yes, disabling /proc/sys/vm/oom_dump_tasks can save lots of time.
But I'm not sure whehter we can disable it totally, because disabling
it would prevent the tasks log from being wrote into /var/log/messages
neither.
> > The OOM ratelimit starts with a slow rate, and it will increase slowly
> > if the speed of the console is rapid and decrease rapidly if the speed
> > of the console is slow. oom_rs.burst will be in [1, 10] and
> > oom_rs.interval will always greater than 5 * HZ.
>
> I am not against increasing the ratelimit timeout. But this patch seems
> to be trying to be too clever. Why cannot we simply increase the
> parameters of the ratelimit?
I justed worried that the user may complain it if too many
oom_kill_process callbacks are suppressed.
But considering that OOM burst at the same time are always because of
the same reason, so I think one snapshot of the OOM may be enough.
Simply setting oom_rs with {20 * HZ, 1} can resolve this issue.
> I am also interested whether this actually
> works. AFAIR ratelimit doesn't really work reliably when the ratelimited
> operation takes a long time because the internals have no way to see
> when the operation finished.
>
Agree with you that ratelimit() was not so reliable.
> > mm/oom_kill.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 48 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index dfc357614e56..23dba8ccf313 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -954,8 +954,10 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > {
> > struct task_struct *victim = oc->chosen;
> > struct mem_cgroup *oom_group;
> > - static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> > - DEFAULT_RATELIMIT_BURST);
> > + static DEFINE_RATELIMIT_STATE(oom_rs, 20 * HZ, 1);
> > + int delta;
> > + unsigned long start;
> > + unsigned long end;
> >
> > /*
> > * If the task is already exiting, don't alarm the sysadmin or kill
> > @@ -972,8 +974,51 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > }
> > task_unlock(victim);
> >
> > - if (__ratelimit(&oom_rs))
> > + if (__ratelimit(&oom_rs)) {
> > + start = jiffies;
> > dump_header(oc, victim);
> > + end = jiffies;
> > + delta = end - start;
> > +
> > + /*
> > + * The OOM messages may be printed to a serial with very low
> > + * speed, e.g. console=ttyS1,19200. It will take long
> > + * time to print these OOM messages to this serial, and
> > + * then processes allocating pages will all be blocked due
> > + * to it can hardly reclaim pages. That will case high
> > + * memory pressure and the system may be unresponsive for a
> > + * long time.
> > + * In this case, we should decrease the OOM ratelimit or
> > + * avoid printing OOM messages into the slow serial. But if
> > + * we avoid printing OOM messages into the slow serial, all
> > + * messages with KERN_WARNING level can't be printed into
> > + * it neither, that may loss some useful messages when we
> > + * want to collect messages from the console for debugging
> > + * purpose. So it is better to decrease the ratelimit. We
> > + * can introduce some sysctl knobes similar with
> > + * printk_ratelimit and burst, but it will burden the
> > + * admin. Let the kernel automatically adjust the ratelimit
> > + * would be a better chioce.
> > + * In bellow algorithm, it will decrease the OOM ratelimit
> > + * rapidly if the console is slow and increase the OOM
> > + * ratelimit slowly if the console is fast. oom_rs.burst
> > + * will be in [1, 10] and oom_rs.interval will always
> > + * greater than 5 * HZ.
> > + */
> > + if (delta < oom_rs.interval / 10) {
> > + if (oom_rs.interval >= 10 * HZ)
> > + oom_rs.interval /= 2;
> > + else if (oom_rs.interval > 6 * HZ)
> > + oom_rs.interval -= HZ;
> > +
> > + if (oom_rs.burst < 10)
> > + oom_rs.burst += 1;
> > + } else if (oom_rs.burst > 1) {
> > + oom_rs.burst = 1;
> > + oom_rs.interval = 4 * delta;
> > + }
> > +
> > + }
> >
> > /*
> > * Do we need to kill the entire memory cgroup?
> > --
> > 2.18.2
>
> --
Thanks
Yafang
next prev parent reply other threads:[~2020-04-14 12:33 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-11 9:36 Yafang Shao
2020-04-14 7:39 ` Michal Hocko
2020-04-14 12:32 ` Yafang Shao [this message]
2020-04-14 14:32 ` Michal Hocko
2020-04-14 14:58 ` Yafang Shao
2020-04-15 5:58 ` Tetsuo Handa
2020-04-17 11:57 ` Yafang Shao
2020-04-17 13:03 ` Tetsuo Handa
2020-04-17 13:55 ` Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CALOAHbDv+ZAgmGJP7GFzGcjKBZTPk9kYo63g173Nh+vn00qmwg@mail.gmail.com \
--to=laoar.shao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox