From: Yafang Shao <laoar.shao@gmail.com>
To: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>,
Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
Andrew Morton <akpm@linux-foundation.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Linux MM <linux-mm@kvack.org>
Subject: Re: [PATCH v2] memcg, oom: check memcg margin for parallel oom
Date: Wed, 15 Jul 2020 11:10:07 +0800 [thread overview]
Message-ID: <CALOAHbB3wHgUPVJvg6trwWpNzeM=atgvoJ4wzih0g0DFdmStYw@mail.gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.23.453.2007141934320.2615810@chino.kir.corp.google.com>
On Wed, Jul 15, 2020 at 10:44 AM David Rientjes <rientjes@google.com> wrote:
>
> On Wed, 15 Jul 2020, Yafang Shao wrote:
>
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 1962232..15e0e18 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -1560,15 +1560,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > > .gfp_mask = gfp_mask,
> > > > .order = order,
> > > > };
> > > > - bool ret;
> > > > + bool ret = true;
> > > >
> > > > if (mutex_lock_killable(&oom_lock))
> > > > return true;
> > > > +
> > > > + if (mem_cgroup_margin(memcg) >= (1 << order))
> > > > + goto unlock;
> > > > +
> > > > /*
> > > > * A few threads which were not waiting at mutex_lock_killable() can
> > > > * fail to bail out. Therefore, check again after holding oom_lock.
> > > > */
> > > > ret = should_force_charge() || out_of_memory(&oc);
> > > > +
> > > > +unlock:
> > > > mutex_unlock(&oom_lock);
> > > > return ret;
> > > > }
> > >
> > > Hi Yafang,
> > >
> > > We've run with a patch very much like this for several years and it works
> > > quite successfully to prevent the unnecessary oom killing of processes.
> > >
> > > We do this in out_of_memory() directly, however, because we found that we
> > > could prevent even *more* unnecessary killing if we checked this at the
> > > "point of no return" because the selection of processes takes some
> > > additional time when we might resolve the oom condition.
> > >
> >
> > Hi David,
> >
> > Your proposal could also resolve the issue,
>
> It has successfully resolved it for several years in our kernel, we tried
> an approach similiar to yours but saw many instances where memcg oom kills
> continued to proceed even though the memcg information dumped to the
> kernel log showed memory available.
>
> If this was a page or two that became available due to memory freeing,
> it's not a significant difference. Instead, if this races with an oom
> notification and a process exiting or being SIGKILL'd, it becomes much
> harder to explain to a user why their process was oom killed when there
> are tens of megabytes of memory available as shown by the kernel log (the
> freeing/exiting happened during a particularly long iteration of processes
> attached to the memcg, for example).
>
> That's what motivated a change to moving this to out_of_memory() directly,
> we found that it prevented even more unnecessary oom kills, which is a
> very good thing. It may only be easily observable and make a significant
> difference at very large scale, however.
>
Thanks for the clarification.
If it is the race which causes this issue and we want to reduce the
race window, I don't know whether it is proper to check the memcg
margin in out_of_memory() or do it before calling do_send_sig_info().
Because per my understanding, dump_header() always takes much more
time than select_bad_process() especially if there're slow consoles.
So the race might easily happen when doing dump_header() or dumping
other information, but if we check the memcg margin after dumping this
oom info, it would be strange to dump so much oom logs without killing
a process.
> > but I'm wondering why do
> > it specifically for memcg oom?
> > Doesn't it apply to global oom?
> > For example, in the global oom, when selecting the processes, the
> > others might free some pages and then it might allocate pages
> > successfully.
> >
>
> It's more complex because memory being allocated from the page allocator
> must be physically contiguous, it's not a simple matter of comparing the
> margin of available memory to the memory being charged. It could
> theoretically be done but I haven't seen a use case where it has actually
> mattered as opposed to memcg oom where it can happen quite readily at
> scale. When memory is uncharged to a memcg because of large freeing or a
> process exiting, that's immediately chargable by another process in the
> same hierarchy because of its isolation as opposed to the page allocator
> where that memory is up for grabs and anything on the system could
> allocate it.
>
> > > Some may argue that this is unnecessarily exposing mem_cgroup_margin() to
> > > generic mm code, but in the interest of preventing any unnecessary oom
> > > kill we've found it to be helpful.
> > >
> > > I proposed a variant of this in https://lkml.org/lkml/2020/3/11/1089.
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -798,6 +798,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> > > void mem_cgroup_split_huge_fixup(struct page *head);
> > > #endif
> > >
> > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg);
> > > +
> > > #else /* CONFIG_MEMCG */
> > >
> > > #define MEM_CGROUP_ID_SHIFT 0
> > > @@ -825,6 +827,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> > > {
> > > }
> > >
> > > +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > > +{
> > > +}
> > > +
> > > static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
> > > bool in_low_reclaim)
> > > {
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1282,7 +1282,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> > > * Returns the maximum amount of memory @mem can be charged with, in
> > > * pages.
> > > */
> > > -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > > {
> > > unsigned long margin = 0;
> > > unsigned long count;
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -1109,9 +1109,23 @@ bool out_of_memory(struct oom_control *oc)
> > > if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
> > > panic("System is deadlocked on memory\n");
> > > }
> > > - if (oc->chosen && oc->chosen != (void *)-1UL)
> > > + if (oc->chosen && oc->chosen != (void *)-1UL) {
> > > + if (is_memcg_oom(oc)) {
> > > + /*
> > > + * If a memcg is now under its limit or current will be
> > > + * exiting and freeing memory, avoid needlessly killing
> > > + * chosen.
> > > + */
> > > + if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order) ||
> > > + task_will_free_mem(current)) {
> > > + put_task_struct(oc->chosen);
> > > + return true;
> > > + }
> > > + }
> > > +
> > > oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
> > > "Memory cgroup out of memory");
> > > + }
> > > return !!oc->chosen;
> > > }
> > >
> >
> >
> > --
> > Thanks
> > Yafang
> >
--
Thanks
Yafang
next prev parent reply other threads:[~2020-07-15 3:10 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-14 13:57 Yafang Shao
2020-07-14 14:05 ` Michal Hocko
2020-07-14 14:30 ` Chris Down
2020-07-14 18:46 ` David Rientjes
2020-07-15 1:44 ` Yafang Shao
2020-07-15 2:44 ` David Rientjes
2020-07-15 3:10 ` Yafang Shao [this message]
2020-07-15 3:18 ` David Rientjes
2020-07-15 3:31 ` Yafang Shao
2020-07-15 17:30 ` David Rientjes
2020-07-16 2:38 ` Yafang Shao
2020-07-16 7:04 ` David Rientjes
2020-07-16 11:53 ` Yafang Shao
2020-07-16 12:21 ` Michal Hocko
2020-07-16 13:09 ` Tetsuo Handa
2020-07-16 19:53 ` David Rientjes
2020-07-17 1:35 ` Yafang Shao
2020-07-17 19:26 ` David Rientjes
2020-07-18 2:15 ` Yafang Shao
2020-07-16 5:54 ` Tetsuo Handa
2020-07-16 6:11 ` Michal Hocko
2020-07-16 7:06 ` David Rientjes
2020-07-16 6:08 ` Michal Hocko
2020-07-16 6:56 ` David Rientjes
2020-07-16 7:12 ` Michal Hocko
2020-07-16 20:04 ` David Rientjes
2020-07-28 18:04 ` Johannes Weiner
2020-07-15 6:56 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CALOAHbB3wHgUPVJvg6trwWpNzeM=atgvoJ4wzih0g0DFdmStYw@mail.gmail.com' \
--to=laoar.shao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=penguin-kernel@i-love.sakura.ne.jp \
--cc=rientjes@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox