Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
To: mhocko@suse.com
Cc: linux-mm@kvack.org
Subject: Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
Date: Sat, 10 Dec 2016 20:24:57 +0900	[thread overview]
Message-ID: <201612102024.CBB26549.SJFOOtOVMFFQHL@I-love.SAKURA.ne.jp> (raw)
In-Reply-To: <20161209144624.GB4334@dhcp22.suse.cz>

Michal Hocko wrote:
> On Fri 09-12-16 23:23:10, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 08-12-16 00:29:26, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Tue 06-12-16 19:33:59, Tetsuo Handa wrote:
> > > > > > If the OOM killer is invoked when many threads are looping inside the
> > > > > > page allocator, it is possible that the OOM killer is preempted by other
> > > > > > threads.
> > > > > 
> > > > > Hmm, the only way I can see this would happen is when the task which
> > > > > actually manages to take the lock is not invoking the OOM killer for
> > > > > whatever reason. Is this what happens in your case? Are you able to
> > > > > trigger this reliably?
> > > > 
> > > > Regarding http://I-love.SAKURA.ne.jp/tmp/serial-20161206.txt.xz ,
> > > > somebody called oom_kill_process() and reached
> > > > 
> > > >   pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> > > > 
> > > > line but did not reach
> > > > 
> > > >   pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> > > > 
> > > > line within tolerable delay.
> > > 
> > > I would be really interested in that. This can happen only if
> > > find_lock_task_mm fails. This would mean that either we are selecting a
> > > child without mm or the selected victim has no mm anymore. Both cases
> > > should be ephemeral because oom_badness will rule those tasks on the
> > > next round. So the primary question here is why no other task has hit
> > > out_of_memory.
> > 
> > This can also happen due to AB-BA livelock (oom_lock v.s. console_sem).
> 
> Care to explain how would that livelock look like?

Two types of threads (Thread-1 which is holding oom_lock, Thread-2 which is not
holding oom_lock) are doing memory allocation. Since oom_lock is a mutex, there
can be only 1 instance for Thread-1. But there can be multiple instances for
Thread-2.

(1) Thread-1 enters out_of_memory() because it is holding oom_lock.
(2) Thread-1 enters printk() due to

    pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", ...);

    in oom_kill_process().

(3) vprintk_func() is mapped to vprintk_default() because Thread-1 is not
    inside NMI handler.

(4) In vprintk_emit(), in_sched == false because loglevel for pr_err()
    is not LOGLEVEL_SCHED.

(5) Thread-1 calls log_store() via log_output() from vprintk_emit().

(6) Thread-1 calls console_trylock() because in_sched == false.

(7) Thread-1 acquires console_sem via down_trylock_console_sem().

(8) In console_trylock(), console_may_schedule is set to true because
    Thread-1 is in sleepable context.

(9) Thread-1 calls console_unlock() because console_trylock() succeeded.

(9) In console_unlock(), pending data stored by log_store() are printed
    to consoles. Since there may be slow consoles, cond_resched() is called
    if possible. And since console_may_schedule == true because Thread-1 is
    in sleepable context, Thread-1 may be scheduled at console_unlock().

(10) Thread-2 tries to acquire oom_lock but it fails because Thread-1 is
     holding oom_lock.

(11) Thread-2 enters warn_alloc() because it is waiting for Thread-1 to
     return from oom_kill_process().

(12) Thread-2 enters printk() due to

     warn_alloc(gfp_mask, "page allocation stalls for %ums, order:%u", ...);

     in __alloc_pages_slowpath().

(13) vprintk_func() is mapped to vprintk_default() because Thread-2 is not
     inside NMI handler.

(14) In vprintk_emit(), in_sched == false because loglevel for pr_err()
     is not LOGLEVEL_SCHED.

(15) Thread-2 calls log_store() via log_output() from vprintk_emit().

(16) Thread-2 calls console_trylock() because in_sched == false.

(17) Thread-2 fails to acquire console_sem via down_trylock_console_sem().

(18) Thread-2 returns from vprintk_emit().

(19) Thread-2 leaves warn_alloc().

(20) When Thread-1 is waken up, it finds new data appended by Thread-2.

(21) Thread-1 remains inside console_unlock() with oom_lock still held
     because there is data which should be printed to consoles.

(22) Thread-2 remains failing to acquire oom_lock, periodically appending
     new data via warn_alloc(), and failing to acquire oom_lock.

(23) The user visible result is that Thread-1 is unable to return from

     pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", ...);

     in oom_kill_process().

The introduction of uncontrolled

  warn_alloc(gfp_mask, "page allocation stalls for %ums, order:%u", ...);

in __alloc_pages_slowpath() increased the possibility for Thread-1 to remain
inside console_unlock(). Although Sergey is working on this problem by
offloading printing to consoles, we might still see "** XXX printk messages
dropped **" messages if we let Thread-2 call printk() uncontrolledly, for

  /*
   * Give the killed process a good chance to exit before trying
   * to allocate memory again.
   */
  schedule_timeout_killable(1);

which is called after Thread-1 returned from oom_kill_process() allows
Thread-2 and other threads to consume long duration before the OOM reaper
can start reaping by taking oom_lock.

> 
> > >                Have you tried to instrument the kernel and see whether
> > > GFP_NOFS contexts simply preempted any other attempt to get there?
> > > I would find it quite unlikely but not impossible. If that is the case
> > > we should really think how to move forward. One way is to make the oom
> > > path fully synchronous as suggested below. Other is to tweak GFP_NOFS
> > > some more and do not take the lock while we are evaluating that. This
> > > sounds quite messy though.
> > 
> > Do you mean "tweak GFP_NOFS" as something like below patch?
> > 
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3036,6 +3036,17 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
> >  
> >  	*did_some_progress = 0;
> >  
> > +	if (!(gfp_mask & (__GFP_FS | __GFP_NOFAIL))) {
> > +		if ((current->flags & PF_DUMPCORE) ||
> > +		    (order > PAGE_ALLOC_COSTLY_ORDER) ||
> > +		    (ac->high_zoneidx < ZONE_NORMAL) ||
> > +		    (pm_suspended_storage()) ||
> > +		    (gfp_mask & __GFP_THISNODE))
> > +			return NULL;
> > +		*did_some_progress = 1;
> > +		return NULL;
> > +	}
> > +
> >  	/*
> >  	 * Acquire the oom lock.  If that fails, somebody else is
> >  	 * making progress for us.
> > 
> > Then, serial-20161209-gfp.txt in http://I-love.SAKURA.ne.jp/tmp/20161209.tar.xz is
> > console log with above patch applied. Spinning without invoking the OOM killer.
> > It did not avoid locking up.
> 
> OK, so the reason of the lock up must be something different. If we are
> really {dead,live}locking on the printk because of warn_alloc then that
> path should be tweaked instead. Something like below should rule this
> out:

Last year I proposed disabling preemption at
http://lkml.kernel.org/r/201509191605.CAF13520.QVSFHLtFJOMOOF@I-love.SAKURA.ne.jp
but it was not accepted. "while (1);" in userspace corresponds with
pointless "direct reclaim and warn_alloc()" in kernel space. This time,
I'm proposing serialization by oom_lock and replace warn_alloc() with kmallocwd
in order to make printk() not to flood.

> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ed65d7df72d5..c2ba51cec93d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3024,11 +3024,14 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
>  	unsigned int filter = SHOW_MEM_FILTER_NODES;
>  	struct va_format vaf;
>  	va_list args;
> +	static DEFINE_MUTEX(warn_lock);
>  
>  	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) ||
>  	    debug_guardpage_minorder() > 0)
>  		return;
>  

if (gfp_mask & __GFP_DIRECT_RECLAIM)

> +	mutex_lock(&warn_lock);
> +
>  	/*
>  	 * This documents exceptions given to allocations in certain
>  	 * contexts that are allowed to allocate outside current's set
> @@ -3054,6 +3057,8 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
>  	dump_stack();
>  	if (!should_suppress_show_mem())
>  		show_mem(filter);
> +

if (gfp_mask & __GFP_DIRECT_RECLAIM)

> +	mutex_unlock(&warn_lock);
>  }
>  
>  static inline struct page *

and I think "s/warn_lock/oom_lock/" because out_of_memory() might
call show_mem() concurrently.

I think this warn_alloc() is too much noise. When something went
wrong, multiple instances of Thread-2 tend to call warn_alloc()
concurrently. We don't need to report similar memory information.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2016-12-10 11:25 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-06 10:33 Tetsuo Handa
2016-12-07  8:15 ` Michal Hocko
2016-12-07 15:29   ` Tetsuo Handa
2016-12-08  8:20     ` Vlastimil Babka
2016-12-08 11:00       ` Tetsuo Handa
2016-12-08 13:32         ` Michal Hocko
2016-12-08 16:18         ` Sergey Senozhatsky
2016-12-08 13:27     ` Michal Hocko
2016-12-09 14:23       ` Tetsuo Handa
2016-12-09 14:46         ` Michal Hocko
2016-12-10 11:24           ` Tetsuo Handa [this message]
2016-12-12  9:07             ` Michal Hocko
2016-12-12 11:49               ` Petr Mladek
2016-12-12 13:00                 ` Michal Hocko
2016-12-12 14:05                   ` Tetsuo Handa
2016-12-13  1:06                 ` Sergey Senozhatsky
2016-12-12 12:12               ` Tetsuo Handa
2016-12-12 12:55                 ` Michal Hocko
2016-12-12 13:19                   ` Michal Hocko
2016-12-13 12:06                     ` Tetsuo Handa
2016-12-13 17:06                       ` Michal Hocko
2016-12-14 11:37                         ` Tetsuo Handa
2016-12-14 12:42                           ` Michal Hocko
2016-12-14 16:36                             ` Tetsuo Handa
2016-12-14 18:18                               ` Michal Hocko
2016-12-15 10:21                                 ` Tetsuo Handa
2016-12-19 11:25                                   ` Tetsuo Handa
2016-12-19 12:27                                     ` Sergey Senozhatsky
2016-12-20 15:39                                       ` Sergey Senozhatsky
2016-12-22 10:27                                         ` Tetsuo Handa
2016-12-22 10:53                                           ` Petr Mladek
2016-12-22 13:40                                             ` Sergey Senozhatsky
2016-12-22 13:33                                           ` Tetsuo Handa
2016-12-22 19:24                                             ` Michal Hocko
2016-12-24  6:25                                               ` Tetsuo Handa
2016-12-26 11:49                                                 ` Michal Hocko
2016-12-27 10:39                                                   ` Tetsuo Handa
2016-12-27 10:57                                                     ` Michal Hocko
2016-12-22 13:42                                           ` Sergey Senozhatsky
2016-12-22 14:01                                             ` Tetsuo Handa
2016-12-22 14:09                                               ` Sergey Senozhatsky
2016-12-22 14:30                                                 ` Sergey Senozhatsky
2016-12-26 10:54                                                 ` Tetsuo Handa
2016-12-26 11:34                                                   ` Sergey Senozhatsky
2017-01-12 13:10                                                     ` Petr Mladek
2017-01-13  2:52                                                       ` Sergey Senozhatsky
2017-01-13  3:53                                                         ` Sergey Senozhatsky
2017-01-13 11:15                                                           ` Petr Mladek
2017-01-13 11:14                                                         ` Petr Mladek
2017-01-12 14:18                                                     ` Petr Mladek
2017-01-13  2:28                                                       ` Sergey Senozhatsky
2017-01-13 11:03                                                         ` Petr Mladek
2017-01-13 11:50                                                           ` Sergey Senozhatsky
2017-01-13 12:15                                                             ` Petr Mladek
2016-12-26 11:41                                                   ` Sergey Senozhatsky
2017-01-13 14:03                                                     ` Petr Mladek
2016-12-15  1:11                         ` Sergey Senozhatsky
2016-12-15  6:35                           ` Michal Hocko
2016-12-15 10:16                             ` Petr Mladek
2016-12-14  9:37                       ` Petr Mladek
2016-12-14 10:20                         ` Sergey Senozhatsky
2016-12-14 11:01                           ` Petr Mladek
2016-12-14 12:23                             ` Sergey Senozhatsky
2016-12-14 12:47                               ` Petr Mladek
2016-12-14 10:26                         ` Michal Hocko
2016-12-15  7:34                           ` Sergey Senozhatsky
2016-12-14 11:37                         ` Tetsuo Handa
2016-12-14 12:36                           ` Petr Mladek
2016-12-14 12:44                             ` Michal Hocko
2016-12-14 13:36                               ` Tetsuo Handa
2016-12-14 13:52                                 ` Michal Hocko
2016-12-14 12:50                             ` Sergey Senozhatsky
2016-12-12 14:59                   ` Tetsuo Handa
2016-12-12 15:55                     ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201612102024.CBB26549.SJFOOtOVMFFQHL@I-love.SAKURA.ne.jp \
    --to=penguin-kernel@i-love.sakura.ne.jp \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox