From: David Rientjes <rientjes@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
stable@kernel.org, Michal Hocko <mhocko@suse.cz>,
azurit@pobox.sk, mm-commits@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
Date: Wed, 27 Nov 2013 16:56:04 -0800 (PST) [thread overview]
Message-ID: <alpine.DEB.2.02.1311271622330.10617@chino.kir.corp.google.com> (raw)
In-Reply-To: <20131127233353.GH3556@cmpxchg.org>
On Wed, 27 Nov 2013, Johannes Weiner wrote:
> > The memcg oom killer has incurred a serious regression since the 3.12-rc6
> > kernel where this was merged as 4942642080ea ("mm: memcg: handle non-error
> > OOM situations more gracefully"). It cc'd stable@kernel.org although it
> > doesn't appear to have been picked up yet, and I'm hoping that we can
> > avoid having it merged in a stable kernel until we get this fixed.
> >
> > This patch, specifically the above, allows memcgs to bypass their limits
> > by charging the root memcg in oom conditions.
> >
> > If I create a memcg, cg1, with memory.limit_in_bytes == 128MB and start a
> > memory allocator in it to induce oom, the memcg limit is trivially broken:
> >
> > membench invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
> > membench cpuset=/ mems_allowed=0-3
> > CPU: 9 PID: 11388 Comm: membench Not tainted 3.13-rc1
> > ffffc90015ec6000 ffff880671c3dc18 ffffffff8154a1e3 0000000000000007
> > ffff880674c215d0 ffff880671c3dc98 ffffffff81548b45 ffff880671c3dc58
> > ffffffff81151de7 0000000000000001 0000000000000292 ffff880800000000
> > Call Trace:
> > [<ffffffff8154a1e3>] dump_stack+0x46/0x58
> > [<ffffffff81548b45>] dump_header+0x7a/0x1bb
> > [<ffffffff81151de7>] ? find_lock_task_mm+0x27/0x70
> > [<ffffffff812e8b76>] ? ___ratelimit+0x96/0x110
> > [<ffffffff811521c4>] oom_kill_process+0x1c4/0x330
> > [<ffffffff81099ee5>] ? has_ns_capability_noaudit+0x15/0x20
> > [<ffffffff811a916a>] mem_cgroup_oom_synchronize+0x50a/0x570
> > [<ffffffff811a8660>] ? __mem_cgroup_try_charge_swapin+0x70/0x70
> > [<ffffffff81152968>] pagefault_out_of_memory+0x18/0x90
> > [<ffffffff81547b86>] mm_fault_error+0xb7/0x15a
> > [<ffffffff81553efb>] __do_page_fault+0x42b/0x500
> > [<ffffffff810c667d>] ? set_next_entity+0xad/0xd0
> > [<ffffffff810c670b>] ? pick_next_task_fair+0x6b/0x170
> > [<ffffffff8154d08e>] ? __schedule+0x38e/0x780
> > [<ffffffff81553fde>] do_page_fault+0xe/0x10
> > [<ffffffff815509e2>] page_fault+0x22/0x30
> > Task in /cg1 killed as a result of limit of /cg1
> > memory: usage 131072kB, limit 131072kB, failcnt 1053
> > memory+swap: usage 0kB, limit 18014398509481983kB, failcnt 0
> > kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
> > Memory cgroup stats for /cg1: cache:84KB rss:130988KB rss_huge:116736KB mapped_file:72KB writeback:0KB inactive_anon:0KB active_anon:130976KB inactive_file:4KB active_file:0KB unevictable:0KB
> > [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
> > [ 7761] 0 7761 1106 483 5 0 0 bash
> > [11388] 99 11388 270773 33031 83 0 0 membench
> > Memory cgroup out of memory: Kill process 11388 (membench) score 1010 or sacrifice child
> > Killed process 11388 (membench) total-vm:1083092kB, anon-rss:130824kB, file-rss:1300kB
> >
> > The score of 1010 shown for pid 11388 (membench) should never happen in
> > the oom killer, the maximum value should always be 1000 in any oom
> > context. This indicates that the process has allocated more memory than
> > is available to the memcg. The rss value, 33031 pages, shows that it has
> > allocated >129MB of memory in a memcg limited to 128MB.
> >
> > The entire premise of memcg is to prevent processes attached to it to not
> > be able to allocate more memory than allowed and this trivially breaks
> > that premise in oom conditions.
>
> We already allow a task to allocate beyond the limit if it's selected
> by the OOM killer, so that it can exit faster.
>
> My patch added that a task can bypass the limit when it decided to
> trigger the OOM killer, so that it can get to the OOM kill faster.
>
The task that is bypassing the memcg charge to the root memcg may not be
the process that is chosen by the oom killer, and it's possible the amount
of memory freed by killing the victim is less than the amount of memory
bypassed.
> So I don't think my patch has broken "the entire premise of memcgs".
> At the same time, it also does not really rely on that bypass, we
> should be able to remove it.
>
> This patch series was not supposed to go into the last merge window, I
> already told stable to hold off on these until further notice.
>
Were you targeting these to 3.13 instead? If so, it would have already
appeared in 3.13-rc1 anyway. Is it still a work in progress?
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 13b9d0f..5f9e467 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2675,7 +2675,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> goto bypass;
>
> if (unlikely(task_in_memcg_oom(current)))
> - goto bypass;
> + goto nomem;
>
> /*
> * We always charge the cgroup the mm_struct belongs to.
Is there any benefit to doing this over just schedule_timeout_killable()
since we need to wait for mem_cgroup_oom_synchronize() to be able to make
forward progress at this point?
Should we be checking mem_cgroup_margin() here to ensure
task_in_memcg_oom() is still accurate and we haven't raced by freeing
memory?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-11-28 0:56 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <526028bd.k5qPj2+MDOK1o6ii%akpm@linux-foundation.org>
2013-11-27 23:08 ` David Rientjes
2013-11-27 23:33 ` Johannes Weiner
2013-11-28 0:56 ` David Rientjes [this message]
2013-11-28 2:18 ` Johannes Weiner
2013-11-28 2:38 ` David Rientjes
2013-11-28 3:13 ` Johannes Weiner
2013-11-28 3:20 ` David Rientjes
2013-11-28 3:52 ` Johannes Weiner
2013-11-30 0:00 ` David Rientjes
2013-11-30 0:51 ` Greg KH
2013-11-30 10:25 ` David Rientjes
2013-11-30 3:35 ` Johannes Weiner
2013-11-30 10:32 ` David Rientjes
2013-11-30 15:55 ` Johannes Weiner
2013-11-30 22:12 ` David Rientjes
2013-11-28 10:02 ` Michal Hocko
2013-11-30 0:05 ` David Rientjes
2013-12-02 13:12 ` Michal Hocko
2013-12-02 22:51 ` David Rientjes
2013-11-28 9:12 ` Michal Hocko
2013-11-30 3:37 ` Johannes Weiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.DEB.2.02.1311271622330.10617@chino.kir.corp.google.com \
--to=rientjes@google.com \
--cc=akpm@linux-foundation.org \
--cc=azurit@pobox.sk \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.cz \
--cc=mm-commits@vger.kernel.org \
--cc=stable@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox