linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yafang Shao <laoar.shao@gmail.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Linux MM <linux-mm@kvack.org>
Subject: Re: [PATCH v2] mm, oom: make the calculation of oom badness more accurate
Date: Fri, 10 Jul 2020 22:10:57 +0800	[thread overview]
Message-ID: <CALOAHbB19oeXo+WGu5B93uy1UW_+1phhAse6_uK7hfLiX68c0w@mail.gmail.com> (raw)
In-Reply-To: <20200710121022.GA3022@dhcp22.suse.cz>

On Fri, Jul 10, 2020 at 8:10 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 09-07-20 11:53:07, Yafang Shao wrote:
> > Recently we found an issue on our production environment that when memcg
> > oom is triggered the oom killer doesn't chose the process with largest
> > resident memory but chose the first scanned process. Note that all
> > processes in this memcg have the same oom_score_adj, so the oom killer
> > should chose the process with largest resident memory.
> >
> > Bellow is part of the oom info, which is enough to analyze this issue.
> > [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
> > [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
> > [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
> > [...]
> > [7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
> > [7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
> > [7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
> > [7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
> > [7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
> > [7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
> > [7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
> > [7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
> > [7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
> > [7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
> > [7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
> > [7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
> > [7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
> > [7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
> > [7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
> > [7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
> > [7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
> > [7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
> > [7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
> > [7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
> > [7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
> > [7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
> > [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
> > [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> > [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> >
> > We can find that the first scanned process 5740 (pause) was killed, but its
> > rss is only one page. That is because, when we calculate the oom badness in
> > oom_badness(), we always ignore the negtive point and convert all of these
> > negtive points to 1. Now as oom_score_adj of all the processes in this
> > targeted memcg have the same value -998, the points of these processes are
> > all negtive value. As a result, the first scanned process will be killed.
> >
> > The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
> > a Guaranteed pod, which has higher priority to prevent from being killed by
> > system oom.
> >
> > To fix this issue, we should make the calculation of oom point more
> > accurate. We can achieve it by convert the chosen_point from 'unsigned
> > long' to 'long'.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  drivers/tty/sysrq.c |  1 +
> >  fs/proc/base.c      |  7 ++++++-
> >  include/linux/oom.h |  4 ++--
> >  mm/memcontrol.c     |  1 +
> >  mm/oom_kill.c       | 19 ++++++++-----------
> >  mm/page_alloc.c     |  1 +
> >  6 files changed, 19 insertions(+), 14 deletions(-)
> >
> > diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
> > index 7c95afa9..e83fd46 100644
> > --- a/drivers/tty/sysrq.c
> > +++ b/drivers/tty/sysrq.c
> > @@ -382,6 +382,7 @@ static void moom_callback(struct work_struct *ignored)
> >               .memcg = NULL,
> >               .gfp_mask = gfp_mask,
> >               .order = -1,
> > +             .chosen_points = LONG_MIN,
>
> It would be better to do the initialization only once when we start
> evaluating tasks (select_bad_process).
>

I used to initialize it in constrained_alloc() in the previous
version, but I found that is not proper, so I change the
initialization in the definitions of each oom_control.
select_bad_process() should be a better choice. I will update it.

> >       };
> >
> >       mutex_lock(&oom_lock);
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index d86c0af..bf16406 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -551,8 +551,13 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
> >  {
> >       unsigned long totalpages = totalram_pages() + total_swap_pages;
> >       unsigned long points = 0;
> > +     long badness;
> >
> > -     points = oom_badness(task, totalpages) * 1000 / totalpages;
> > +     badness = oom_badness(task, totalpages);
> > +     if (badness != LONG_MIN) {
> > +             /* Let's keep the range of points as [0, 2000]. */
> > +             points = (1000 + badness * 1000 / (long)totalpages) * 2 / 3;
> > +     }
> >       seq_printf(m, "%lu\n", points);
>
> This doesn't really work for OOM_SCORE_ADJ_MIN cases because they
> will simply print LONG_MIN rather than 0.
>

The point has be initlialize to 0:
unsigned long points = 0;

So for OOM_SCORE_ADJ_MIN cases, it will print 0,
seq_printf(m, "%lu\n", points);

But..

> So you want
>         /*
>          * Special case OOM_SCORE_ADJ_MIN for all others scale the
>          * badness value into [0, 2000] range which we have been
>          * exporting for a long time so userspace might depend on it
>          */

the comment is useful, I will update it with your comment.
Thanks.

>         if (badness == LONG_MIN)
>                 badness = 0;
>         else
>                 points = (1000 + badness * 1000 / (long)totalpages) * 2 / 3
>
> FTR. In my other email I was proposing to scale usage to the [-1000, 1000]
> range by
>         points = adj + usage * 1000/ totalpages
>
> this would make the math slightly easier to follow but then I've
> realized that this would be much less precise so what you have is
> better. Btw. we used to do that in the past until a7f638f999ff4
> which has changed that for this very reason.
>
> > @@ -107,7 +107,7 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
> >
> >  bool __oom_reap_task_mm(struct mm_struct *mm);
> >
> > -extern unsigned long oom_badness(struct task_struct *p,
> > +long oom_badness(struct task_struct *p,
> >               unsigned long totalpages);
>
> This is not really necessary.
>
> With that being addressed, you can add
> Acked-by: Michal Hocko <mhocko@suse.com>
>

Thanks for the review.


-- 
Thanks
Yafang


  reply	other threads:[~2020-07-10 14:11 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-09 15:53 Yafang Shao
2020-07-10 12:10 ` Michal Hocko
2020-07-10 14:10   ` Yafang Shao [this message]
2020-07-10 12:42 ` Qian Cai
2020-07-10 12:58   ` Michal Hocko
2020-07-10 13:07     ` Michal Hocko
2020-07-10 14:04       ` Yafang Shao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALOAHbB19oeXo+WGu5B93uy1UW_+1phhAse6_uK7hfLiX68c0w@mail.gmail.com \
    --to=laoar.shao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox