Re: [patch] mm, coredump: fail allocations when coredumping instead of oom killing

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Rientjes <rientjes@google.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Minchan Kim <minchan.kim@gmail.com>,
	Oleg Nesterov <oleg@redhat.com>,
	linux-mm@kvack.org
Subject: Re: [patch] mm, coredump: fail allocations when coredumping instead of oom killing
Date: Mon, 19 Mar 2012 17:46:47 -0700 (PDT)	[thread overview]
Message-ID: <alpine.DEB.2.00.1203191723470.3609@chino.kir.corp.google.com> (raw)
In-Reply-To: <20120319145245.7efb0cd4.akpm@linux-foundation.org>

On Mon, 19 Mar 2012, Andrew Morton wrote:

> > Yup, this is the one.  We only currently see this when a memcg is at its 
> > limit and there are other threads that are trying to exit that are blocked 
> > on a coredumper that can no longer get memory.  dump_write() calling 
> > ->write() (ext4 in this case) causes a livelock when 
> > add_to_page_cache_locked() tries to charge the soon-to-be-added pagecache 
> > to the coredumper's memcg that is oom and calls 
> > mem_cgroup_charge_common().  That allows the oom, but the oom killer will 
> > find the other threads that are exiting and choose to be a no-op to avoid 
> > needlessly killing threads.  The coredumper only has PF_DUMPCORE and not 
> > PF_EXITING so it doesn't get immediately killed.
> 
> I don't understand the description of the livelock.  Does
> add_to_page_cache_locked() succeed, or fail?  What does "allows the
> oom" mean?
> 

Sorry if it wasn't clear.  The coredumper calling into 
add_to_page_cache_locked() calls the oom killer because the memcg is oom 
(and would call the global oom killer if the entire system were oom).  The 
oom killer, both memcg and global, doesn't do anything because it sees 
eligible threads with PF_EXITING set.  This logic has existed for several 
years to avoid needlessly oom killing additional threads when others are 
already in the process of exiting and freeing their memory.  Those 
PF_EXITING threads, however, are blocked on the coredumper to exit in 
exit_mm(), so they'll never actually exit.  Thus, the coredumper must make 
forward progress for anything to actually exit and the oom killer is 
useless.

In this condition, there are a few options:

 - give the coredumper access to memory reserves and allow it to allocate,
   essentially oom killing it,

 - fail coredumper memory allocations because of the oom condition and 
   allow the threads blocked on it to exit, or

 - implement an oom killer timeout that would kill additional threads if 
   we repeatedly call into it without making forward progress over a small 
   period of time.

The first and last, in my opinion, are non-starters because it allows a 
complete depletion of memory reserves if the coredumper is chosen and then 
nothing is guaranteed to be able to ever exit.  This patch implements the 
middle option where we do our best effort to allow the coredump to be 
successful (we even try direct reclaim before failing) but choose to fail 
before calling into the oom killer and causing a livelock.

> AFAICT, dumping core should only require the allocation of 2-3
> unreclaimable pages at any one time.  That's if reclaim is working
> properly.  So I'd have thought that permitting the core-dumper to
> allocate those pages would cause everything to run to completion
> nicely.
> 

If there's nothing to reclaim (more obvious when running in a memcg) then 
we prohibit the allocation and livelock in the presence of PF_EXITING 
threads that are waiting on the coredump; there's nothing that allows 
those allocations in the kernel to succeed currently.  If we can guarantee 
that the call to ->write() allocates 2-3 pages at most then we could 
perhaps get away with doing something like

	if (current->flags & PF_DUMPCORE) {
		set_thread_flag(TIF_MEMDIE);
		return 0;
	}

in the oom killer like we allow for fatal_signal_pending() right now.  I 
chose to be more conservative, however, because the amount of memory it 
allocates is filesystem dependent and may deplete all memory reserves.

> Relatedly, RLIMIT_CORE shouldn't affect this?  The core dumper only
> really needs to pin a single pagecache page: the one into which it is
> presently copying data.
> 

It's filesystem dependent, the VM doesn't safeguard against a livelock of 
the memcg and the system without this patch.  But even with one page the 
vulnerability still exists.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2012-03-20  0:46 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-15  2:15 David Rientjes
2012-03-15 10:20 ` Mel Gorman
2012-03-15 21:47   ` David Rientjes
2012-03-19 21:52     ` Andrew Morton
2012-03-20  0:46       ` David Rientjes [this message]
2012-03-22 23:07         ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.00.1203191723470.3609@chino.kir.corp.google.com \
    --to=rientjes@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan.kim@gmail.com \
    --cc=oleg@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox