Re: Silent hang up caused by pages being not scanned?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: Silent hang up caused by pages being not scanned?
@ 2015-10-14  8:03 Hillf Danton
  0 siblings, 0 replies; 18+ messages in thread
From: Hillf Danton @ 2015-10-14  8:03 UTC (permalink / raw)
  To: 'Tetsuo Handa'
  Cc: Linus Torvalds, Michal Hocko, David Rientjes, Johannes Weiner,
	linux-kernel, linux-mm

> > 
> > In particular, I think that you'll find that you will have to change
> > the heuristics in __alloc_pages_slowpath() where we currently do
> > 
> >         if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || ..
> > 
> > when the "did_some_progress" logic changes that radically.
> > 
> 
> Yes. But we can't simply do
> 
>	if (order <= PAGE_ALLOC_COSTLY_ORDER || ..
> 
> because we won't be able to call out_of_memory(), can we?
>
Can you please try a simplified retry logic?

thanks
Hillf
--- a/mm/page_alloc.c	Wed Oct 14 14:45:28 2015
+++ b/mm/page_alloc.c	Wed Oct 14 15:43:31 2015
@@ -3154,8 +3154,7 @@ retry:
 
 	/* Keep reclaiming pages as long as there is reasonable progress */
 	pages_reclaimed += did_some_progress;
-	if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
-	    ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
+	if (did_some_progress) {
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
 		goto retry;
--


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: can't oom-kill zap the victim's memory?
@ 2015-09-28 16:18 Tetsuo Handa
  2015-10-02 12:36 ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2015-09-28 16:18 UTC (permalink / raw)
  To: mhocko, rientjes
  Cc: oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm,
	linux-kernel, skozina

Michal Hocko wrote:
> The point I've tried to made is that oom unmapper running in a detached
> context (e.g. kernel thread) vs. directly in the oom context doesn't
> make any difference wrt. lock because the holders of the lock would loop
> inside the allocator anyway because we do not fail small allocations.

We tried to allow small allocations to fail. It resulted in unstable system
with obscure bugs.

We tried to allow small !__GFP_FS allocations to fail. It failed to fail by
effectively __GFP_NOFAIL allocations.

We are now trying to allow zapping OOM victim's mm. Michal is already
skeptical about this approach due to lock dependency.

We already spent 9 months on this OOM livelock. No silver bullet yet.
Proposed approaches are too drastic to backport for existing users.
I think we are out of bullet.

Until we complete adding/testing __GFP_NORETRY (or __GFP_KILLABLE) to most
of callsites, timeout based workaround will be the only bullet we can use.

Michal's panic_on_oom_timeout and David's "global access to memory reserves"
will be acceptable for some users if these approaches are used as opt-in.
Likewise, my memdie_task_skip_secs / memdie_task_panic_secs will be
acceptable for those who want to retry a bit more rather than panic on
accidental livelock if this approach is used as opt-in.

Tetsuo Handa wrote:
> Excuse me, but thinking about CLONE_VM without CLONE_THREAD case...
> Isn't there possibility of hitting livelocks at
> 
>         /*
>          * If current has a pending SIGKILL or is exiting, then automatically
>          * select it.  The goal is to allow it to allocate so that it may
>          * quickly exit and free its memory.
>          *
>          * But don't select if current has already released its mm and cleared
>          * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
>          */
>         if (current->mm &&
>             (fatal_signal_pending(current) || task_will_free_mem(current))) {
>                 mark_oom_victim(current);
>                 return true;
>         }
> 
> if current thread receives SIGKILL just before reaching here, for we don't
> send SIGKILL to all threads sharing the mm?

Seems that CLONE_VM without CLONE_THREAD is irrelevant here.
We have sequences like

  Do a GFP_KENREL allocation.
  Hold a lock.
  Do a GFP_NOFS allocation.
  Release a lock.

where an example is seen in VFS operations which receive pathname from
user space using getname() and then call VFS functions and filesystem
code takes locks which can contend with other threads.

------------------------------------------------------------
diff --git a/fs/namei.c b/fs/namei.c
index d68c21f..d51c333 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4005,6 +4005,8 @@ int vfs_symlink(struct inode *dir, struct dentry *dentry, const char *oldname)
        if (error)
                return error;

+       if (fatal_signal_pending(current))
+               printk(KERN_INFO "Calling symlink with SIGKILL pending\n");
        error = dir->i_op->symlink(dir, dentry, oldname);
        if (!error)
                fsnotify_create(dir, dentry);
@@ -4021,6 +4023,10 @@ SYSCALL_DEFINE3(symlinkat, const char __user *, oldname,
        struct path path;
        unsigned int lookup_flags = 0;

+       if (!strcmp(current->comm, "a.out")) {
+               printk(KERN_INFO "Sending SIGKILL to current thread\n");
+               do_send_sig_info(SIGKILL, SEND_SIG_FORCED, current, true);
+       }
        from = getname(oldname);
        if (IS_ERR(from))
                return PTR_ERR(from);
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 996481e..2b6faa5 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -240,6 +240,8 @@ xfs_symlink(
        if (error)
                goto out_trans_cancel;

+       if (fatal_signal_pending(current))
+               printk(KERN_INFO "Calling xfs_ilock() with SIGKILL pending\n");
        xfs_ilock(dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL |
                      XFS_IOLOCK_PARENT | XFS_ILOCK_PARENT);
        unlock_dp_on_error = true;
------------------------------------------------------------

[  119.534976] Sending SIGKILL to current thread
[  119.535898] Calling symlink with SIGKILL pending
[  119.536870] Calling xfs_ilock() with SIGKILL pending

Any program can potentially hit this silent livelock. We can't predict
what locks the OOM victim threads will depend on after TIF_MEMDIE was
set by the OOM killer. Therefore, I think that TIF_MEMDIE disables the
OOM killer indefinitely is one of possible causes regarding silent
hangup troubles.

Michal Hocko wrote:
> I really hate to do "easy" things now just to feel better about
> particular case which will kick us back little bit later. And from my
> own experience I can tell you that a more non-deterministic OOM behavior
> is thing people complain about.

I believe that not waiting for TIF_MEMDIE thread indefinitely is the first
choice we can propose people to try. From my own experience I can tell you
that some customers are really sensitive about bugs which halt their systems
(e.g. https://access.redhat.com/solutions/68466 ).
Opt-in version of TIF_MEMDIE timeout should be acceptable for people
who prefer avoiding silent hangup over non-deterministic OOM behavior if
they were explained about the truth of current memory allocator's behavior.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: can't oom-kill zap the victim's memory?
  2015-09-28 16:18 can't oom-kill zap the victim's memory? Tetsuo Handa
@ 2015-10-02 12:36 ` Michal Hocko
  2015-10-03  6:02   ` Can't we use timeout based OOM warning/killing? Tetsuo Handa
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2015-10-02 12:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On Tue 29-09-15 01:18:00, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > The point I've tried to made is that oom unmapper running in a detached
> > context (e.g. kernel thread) vs. directly in the oom context doesn't
> > make any difference wrt. lock because the holders of the lock would loop
> > inside the allocator anyway because we do not fail small allocations.
> 
> We tried to allow small allocations to fail. It resulted in unstable system
> with obscure bugs.

Have they been reported/fixed? All kernel paths doing an allocation are
_supposed_ to check and handle ENOMEM. If they are not then they are
buggy and should be fixed.

> We tried to allow small !__GFP_FS allocations to fail. It failed to fail by
> effectively __GFP_NOFAIL allocations.

What do you mean by that? An opencoded __GFP_NOFAIL?

> We are now trying to allow zapping OOM victim's mm. Michal is already
> skeptical about this approach due to lock dependency.

I am not sure where this came from. I am all for this approach. It will
not solve the problem completely for sure but it can help in many cases
already.

> We already spent 9 months on this OOM livelock. No silver bullet yet.
> Proposed approaches are too drastic to backport for existing users.
> I think we are out of bullet.

Not at all. We have this problem since ever basically. And we have a lot
of legacy issues to care about. But nobody could reasonably expect this
will be solved in a short time period.

> Until we complete adding/testing __GFP_NORETRY (or __GFP_KILLABLE) to most
> of callsites,

This is simply not doable. There are thousand of allocation sites all
over the kernel.

> timeout based workaround will be the only bullet we can use.

Those are the last resort which only paper over real bugs which should
be fixed. I would agree with your urging if this was something that can
easily happen on a _properly_ configured system. System which can blow
into an OOM storm is far from being configured properly. If you have an
untrusted users running on your system you should better put them into a
highly restricted environment and limit as much as possible.

I can completely understand your frustration about the pace of the
progress here but this is nothing new and we should strive for long term
vision which would be much less fragile than what we have right now. No
timeout based solution is the way in that direction.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Can't we use timeout based OOM warning/killing?
  2015-10-02 12:36 ` Michal Hocko
@ 2015-10-03  6:02   ` Tetsuo Handa
  2015-10-06 14:51     ` Tetsuo Handa
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2015-10-03  6:02 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Michal Hocko wrote:
> On Tue 29-09-15 01:18:00, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > The point I've tried to made is that oom unmapper running in a detached
> > > context (e.g. kernel thread) vs. directly in the oom context doesn't
> > > make any difference wrt. lock because the holders of the lock would loop
> > > inside the allocator anyway because we do not fail small allocations.
> >
> > We tried to allow small allocations to fail. It resulted in unstable system
> > with obscure bugs.
>
> Have they been reported/fixed? All kernel paths doing an allocation are
> _supposed_ to check and handle ENOMEM. If they are not then they are
> buggy and should be fixed.
>

Kernel developers are not interested in testing OOM cases. I proposed a
SystemTap-based mandatory memory allocation failure injection for testing
OOM cases, but there was no response. Most of memory allocation failure
paths in the kernel remain untested. Unless you persuade all kernel
developers to test OOM cases and add a gfp flag which bypasses memory
allocation failure injection test (e.g. __GFP_FITv1_PASSED) and change
any !__GFP_FITv1_PASSED && !__GFP_NOFAIL allocations always fail, we can't
check that "all kernel paths doing an allocation are _supposed_ to check
and handle ENOMEM".

> > We tried to allow small !__GFP_FS allocations to fail. It failed to fail by
> > effectively __GFP_NOFAIL allocations.
>
> What do you mean by that? An opencoded __GFP_NOFAIL?
>  

Yes. XFS livelock is an example I can trivially reproduce.
Loss of reliability of buffered write()s is another example.

  [ 1721.405074] buffer_io_error: 36 callbacks suppressed
  [ 1721.406263] Buffer I/O error on dev sda1, logical block 34652401, lost async page write
  [ 1721.406996] Buffer I/O error on dev sda1, logical block 34650278, lost async page write
  [ 1721.407125] Buffer I/O error on dev sda1, logical block 34652330, lost async page write
  [ 1721.407197] Buffer I/O error on dev sda1, logical block 34653485, lost async page write
  [ 1721.407203] Buffer I/O error on dev sda1, logical block 34652398, lost async page write
  [ 1721.407232] Buffer I/O error on dev sda1, logical block 34650494, lost async page write
  [ 1721.407356] Buffer I/O error on dev sda1, logical block 34652361, lost async page write
  [ 1721.407386] Buffer I/O error on dev sda1, logical block 34653484, lost async page write
  [ 1721.407481] Buffer I/O error on dev sda1, logical block 34652396, lost async page write
  [ 1721.407504] Buffer I/O error on dev sda1, logical block 34650291, lost async page write
  [ 1723.369963] XFS: a.out(8241) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [ 1723.810033] XFS: a.out(7788) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [ 1725.434057] XFS: a.out(8171) possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  [ 1725.448049] XFS: a.out(7810) possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  [ 1725.470757] XFS: a.out(8122) possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  [ 1725.474061] XFS: a.out(7881) possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  [ 1725.586610] XFS: a.out(8241) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [ 1726.026702] XFS: a.out(7770) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [ 1726.043988] XFS: a.out(7788) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [ 1727.682001] XFS: a.out(8122) possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  [ 1727.688661] XFS: a.out(8171) possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  [ 1727.785214] XFS: a.out(8241) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [ 1728.226640] XFS: a.out(7770) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [ 1728.290648] XFS: a.out(7788) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [ 1729.930028] XFS: a.out(8171) possible memory allocation deadlock in kmem_alloc (mode:0x8250)

> > We are now trying to allow zapping OOM victim's mm. Michal is already
> > skeptical about this approach due to lock dependency.
>
> I am not sure where this came from. I am all for this approach. It will
> not solve the problem completely for sure but it can help in many cases
> already.
>

Sorry. This was my misunderstanding. But I still think that we need to be
prepared for cases where zapping OOM victim's mm approach fails.
( http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp )

> > We already spent 9 months on this OOM livelock. No silver bullet yet.
> > Proposed approaches are too drastic to backport for existing users.
> > I think we are out of bullet.
>
> Not at all. We have this problem since ever basically. And we have a lot
> of legacy issues to care about. But nobody could reasonably expect this
> will be solved in a short time period.
>

What people generally imagine with OOM killer is that OOM killer is invoked
when the system is out of memory. But we know that there are many possible
cases where OOM killer messages are not printed. We did not make effort to
break people free from the belief that OOM killer is invoked when the system
is out of memory, nor make effort to provide people a mean to warn OOM
situation, after we recognized the "too small to fail" memory-allocation rule
( https://lwn.net/Articles/627419/ ) which was 9 months ago.

> > Until we complete adding/testing __GFP_NORETRY (or __GFP_KILLABLE) to most
> > of callsites,
>
> This is simply not doable. There are thousand of allocation sites all
> over the kernel.

But changing the default behavior (i.e. implicitly behave like __GFP_NORETRY
inside memory allocator unless __GFP_NOFAIL is passed) is also not doable.
You will need to ask for ACKs from thousand of allocation sites all over
the kernel but that is not realistic.

An example. I proposed a patch which changes the default behavior in XFS and
got a feedback ( http://marc.info/?l=linux-mm&m=144279862227010 ) that
fundamentally changing the allocation behavior of the filesystem requires
some indication of the testing and characterization of how the change has
impacted low memory balance and performance of the filesystem.
You will need to ask for ACKs from all filesystem developers.

Another example. I don't like that permission checks for access requests from
user space start failing with ENOMEM error when memory is tight. It is not
happy that access requests by critical processes are failed by inconsequential
process's memory consumption.
( https://www.mail-archive.com/tomoyo-users-en@lists.osdn.me/msg00008.html )
This problem is not limited to permission checks. If a process executed a
program using execve() and that process reached the point of no return in
the execve() operation, any memory allocation failure before reaching the
point of handling ENOMEM errors (e.g. failing to load shared libraries before
calling the main() function of the new program), the process will be killed.
If the process were the global init process, the system will panic().

Despite we mean to simply enforce only "all kernel paths doing an allocation
are _supposed_ to check and handle ENOMEM", we have a period where memory
allocation failure in the user space results in an unrecoverable failure.
We depend on /proc/$pid/oom_score_adj for protecting critical processes from
inconsequential process.

I'm happy to give up memory allocation upon SIGKILL, but I'm not happy to
give up upon ENOMEM without making effort to solve OOM situation.

>
> > timeout based workaround will be the only bullet we can use.
>
> Those are the last resort which only paper over real bugs which should
> be fixed. I would agree with your urging if this was something that can
> easily happen on a _properly_ configured system. System which can blow
> into an OOM storm is far from being configured properly. If you have an
> untrusted users running on your system you should better put them into a
> highly restricted environment and limit as much as possible.

People are reporting hang up problems. I'm suspecting that some of them are
caused by silent OOM. I showed you that there are many possible paths which
can lead to silent hang up. But we are forcing people to use kernels without
means to find out what was happening. Therefore, "there is no report" does
not mean that "we are not hitting OOM livelock problems".

Without means to find out what was happening, we will "overlook real bugs"
before "paper over real bugs". The means are expected to work without
knowledge to use trace points functionality, are expected to run without
memory allocation, are expected to dump output without administrator's
operation, are expected to work before power reset by watchdog timers.

>
> I can completely understand your frustration about the pace of the
> progress here but this is nothing new and we should strive for long term
> vision which would be much less fragile than what we have right now. No
> timeout based solution is the way in that direction.

Can we stop randomly setting TIF_MEMDIE to only one task and staying silent
forever in the hope that the task can make a quick exit? As long as small
allocations do not fail, this TIF_MEMDIE logic is prone to livelock.

We won't be able to change small allocations to fail (like Linus said at
http://lkml.kernel.org/r/CA+55aFw=OLSdh-5Ut2vjy=4Yf1fTXqpzoDHdF7XnT5gDHs6sYA@mail.gmail.com
and I said in this post) in the near future.

Like I said at http://lkml.kernel.org/r/201510012113.HEA98301.SVFQOFtFOHLMOJ@I-love.SAKURA.ne.jp ,
can't we start adding a mean to emit some diagnostic kernel messages
automatically?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Can't we use timeout based OOM warning/killing?
  2015-10-03  6:02   ` Can't we use timeout based OOM warning/killing? Tetsuo Handa
@ 2015-10-06 14:51     ` Tetsuo Handa
  2015-10-12  6:43       ` Tetsuo Handa
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2015-10-06 14:51 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Tetsuo Handa wrote:
> Sorry. This was my misunderstanding. But I still think that we need to be
> prepared for cases where zapping OOM victim's mm approach fails.
> ( http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp )

I tested whether it is easy/difficult to make zapping OOM victim's mm
approach fail. The result seems that not difficult to make it fail.

---------- Reproducer start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/mman.h>

static int reader(void *unused)
{
	char c;
	int fd = open("/proc/self/cmdline", O_RDONLY);
	while (pread(fd, &c, 1, 0) == 1);
	return 0;
}

static int writer(void *unused)
{
	const int fd = open("/proc/self/exe", O_RDONLY);
	static void *ptr[10000];
	int i;
	sleep(2);
	while (1) {
		for (i = 0; i < 10000; i++)
			ptr[i] = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd,
				      0);
		for (i = 0; i < 10000; i++)
			munmap(ptr[i], 4096);
	}
	return 0;
}

int main(int argc, char *argv[])
{
	int zero_fd = open("/dev/zero", O_RDONLY);
	char *buf = NULL;
	unsigned long size = 0;
	int i;
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	for (i = 0; i < 100; i++) {
		clone(reader, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM,
		      NULL);
	}
	clone(writer, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL);
	read(zero_fd, buf, size); /* Will cause OOM due to overcommit */
	return * (char *) NULL; /* Kill all threads. */
}
---------- Reproducer end ----------

(I wrote this program for trying to mimic a trouble that a customer's system
 hung up with a lot of ps processes blocked at reading /proc/pid/ entries
 due to unkillable down_read(&mm->mmap_sem) in __access_remote_vm(). Though
 I couldn't identify what function was holding the mmap_sem for writing...)

Uptime > 429 of http://I-love.SAKURA.ne.jp/tmp/serial-20151006.txt.xz showed
a OOM livelock that

  (1) thread group leader is blocked at down_read(&mm->mmap_sem) in exit_mm()
      called from do_exit().

  (2) writer thread is blocked at down_write(&mm->mmap_sem) in vm_mmap_pgoff()
      called from SyS_mmap_pgoff() called from SyS_mmap().

  (3) many reader threads are blocking the writer thread because of
      down_read(&mm->mmap_sem) called from proc_pid_cmdline_read().

  (4) while the thread group leader is blocked at down_read(&mm->mmap_sem),
      some of the reader threads are trying to allocate memory via page fault.

So, zapping the first OOM victim's mm might fail by chance.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Can't we use timeout based OOM warning/killing?
  2015-10-06 14:51     ` Tetsuo Handa
@ 2015-10-12  6:43       ` Tetsuo Handa
  2015-10-12 15:25         ` Silent hang up caused by pages being not scanned? Tetsuo Handa
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2015-10-12  6:43 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Tetsuo Handa wrote:
> So, zapping the first OOM victim's mm might fail by chance.

I retested with a slightly different version.

---------- Reproducer start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/mman.h>

static int writer(void *unused)
{
	const int fd = open("/proc/self/exe", O_RDONLY);
	while (1) {
		void *ptr = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);
		munmap(ptr, 4096);
	}
	return 0;
}

int main(int argc, char *argv[])
{
	char buffer[128] = { };
	const pid_t pid = fork();
	if (pid == 0) { /* down_write(&mm->mmap_sem) requester which is chosen as an OOM victim. */
		int i;
		for (i = 0; i < 9; i++)
			clone(writer, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL);
		writer(NULL);
	}
	snprintf(buffer, sizeof(buffer) - 1, "/proc/%u/stat", pid);
	if (fork() == 0) { /* down_read(&mm->mmap_sem) requester. */
		const int fd = open(buffer, O_RDONLY);
		while (pread(fd, buffer, sizeof(buffer), 0) > 0);
		_exit(0);
	} else { /* A dummy process for invoking the OOM killer. */
		char *buf = NULL;
		unsigned long size = 0;
		const int fd = open("/dev/zero", O_RDONLY);
		for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
			char *cp = realloc(buf, size);
			if (!cp) {
				size >>= 1;
				break;
			}
			buf = cp;
		}
		read(fd, buf, size); /* Will cause OOM due to overcommit */
		return 0;
	}
}
---------- Reproducer end ----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151012.txt.xz .

Uptime between 101 and 300 is a silent hang up (i.e. no OOM killer messages,
no SIGKILL pending tasks, no TIF_MEMDIE tasks) which I solved using SysRq-f
at uptime = 289. I don't know the reason of this silent hang up, but the
memory unzapping kernel thread will not help because there is no OOM victim.

----------
[  101.438951] MemAlloc-Info: 10 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[  111.817922] MemAlloc-Info: 12 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[  122.281828] MemAlloc-Info: 13 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[  132.793724] MemAlloc-Info: 14 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[  143.336154] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[  289.343187] sysrq: SysRq : Manual OOM execution
(...snipped...)
[  292.065650] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task.
(...snipped...)
[  302.590736] kworker/3:2 invoked oom-killer: gfp_mask=0x24000c0, order=-1, oom_score_adj=0
(...snipped...)
[  302.690047] MemAlloc-Info: 4 stalling task, 0 dying task, 0 victim task.
----------

Uptime between 379 and 605 is a mmap_sem livelock after the OOM killer was
invoked.

----------
[  380.039897] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[  380.042500] [  467]     0   467    14047     1815      28       3        0             0 systemd-journal
[  380.045055] [  482]     0   482    10413      259      23       3        0         -1000 systemd-udevd
[  380.047637] [  504]     0   504    12795      119      25       3        0         -1000 auditd
[  380.050127] [ 1244]     0  1244    82428     4257      81       3        0             0 firewalld
[  380.052536] [ 1247]    70  1247     6988       61      21       3        0             0 avahi-daemon
[  380.055028] [ 1250]     0  1250    54104     1372      42       4        0             0 rsyslogd
[  380.057505] [ 1251]     0  1251   137547     2620      91       3        0             0 tuned
[  380.059996] [ 1255]     0  1255     4823       77      15       3        0             0 irqbalance
[  380.062552] [ 1256]     0  1256     1095       37       8       3        0             0 rngd
[  380.065020] [ 1259]     0  1259    53626      441      60       3        0             0 abrtd
[  380.067383] [ 1260]     0  1260    53001      341      58       5        0             0 abrt-watch-log
[  380.069965] [ 1265]     0  1265     8673       83      21       3        0             0 systemd-logind
[  380.072554] [ 1266]    81  1266     6663      117      18       3        0          -900 dbus-daemon
[  380.075122] [ 1272]     0  1272    31577      154      21       3        0             0 crond
[  380.077544] [ 1314]    70  1314     6988       57      19       3        0             0 avahi-daemon
[  380.080013] [ 1427]     0  1427    46741      225      44       3        0             0 vmtoolsd
[  380.082478] [ 1969]     0  1969    25942     3100      48       3        0             0 dhclient
[  380.084969] [ 1990]   999  1990   128626     1929      50       4        0             0 polkitd
[  380.087516] [ 2073]     0  2073    20629      214      45       3        0         -1000 sshd
[  380.090065] [ 2201]     0  2201     7320       68      21       3        0             0 xinetd
[  380.092465] [ 3215]     0  3215    22773      257      44       3        0             0 master
[  380.094879] [ 3217]    89  3217    22816      249      45       3        0             0 qmgr
[  380.097304] [ 3249]     0  3249    75245      315      97       3        0             0 nmbd
[  380.099666] [ 3259]     0  3259    92963      486     131       5        0             0 smbd
[  380.101956] [ 3282]     0  3282    27503       30      12       3        0             0 agetty
[  380.104277] [ 3283]     0  3283    21788      154      49       3        0             0 login
[  380.106574] [ 3286]     0  3286    92963      486     126       5        0             0 smbd
[  380.108835] [ 3296]  1000  3296    28864      117      13       3        0             0 bash
[  380.111073] [ 3374]    89  3374    22799      249      46       3        0             0 pickup
[  380.113298] [ 3378]    89  3378    22836      252      45       3        0             0 cleanup
[  380.115555] [ 3385]    89  3385    22800      248      44       3        0             0 trivial-rewrite
[  380.117811] [ 3392]     0  3392    22825      265      48       3        0             0 local
[  380.119995] [ 3393]     0  3393    30828       59      17       3        0             0 anacron
[  380.122183] [ 3417]  1000  3417   541715   397587     787       6        0             0 a.out
[  380.124315] [ 3418]  1000  3418     1081       24       8       3        0             0 a.out
[  380.126410] [ 3419]  1000  3419     1042       21       7       3        0             0 a.out
[  380.128535] Out of memory: Kill process 3417 (a.out) score 890 or sacrifice child
[  380.130392] Killed process 3418 (a.out) total-vm:4324kB, anon-rss:96kB, file-rss:0kB
[  392.704028] MemAlloc-Info: 7 stalling task, 10 dying task, 1 victim task.
(...snipped...)
[  601.129977] a.out           R  running task        0  3417   3296 0x00000080
[  601.131899]  ffff8800774dba10 ffffffff8112b174 0000000000000100 0000000000000000
[  601.134026]  0000000000000000 0000000000000000 00000000a23cb49d 0000000000000000
[  601.136076]  ffff880077603200 00000000024280ca 0000000000000000 ffff880077603200
[  601.138090] Call Trace:
[  601.139145]  [<ffffffff8112b174>] ? try_to_free_pages+0x94/0xc0
[  601.140831]  [<ffffffff8111a8c4>] ? out_of_memory+0x2f4/0x460
[  601.142489]  [<ffffffff8111fa63>] ? __alloc_pages_nodemask+0x613/0xc30
[  601.144328]  [<ffffffff81161c40>] ? alloc_pages_vma+0xb0/0x200
[  601.145994]  [<ffffffff81143056>] ? handle_mm_fault+0xfa6/0x1370
[  601.147677]  [<ffffffff8162f557>] ? native_iret+0x7/0x7
[  601.149258]  [<ffffffff81058217>] ? __do_page_fault+0x177/0x400
[  601.150966]  [<ffffffff810584d0>] ? do_page_fault+0x30/0x80
[  601.152625]  [<ffffffff81630518>] ? page_fault+0x28/0x30
[  601.154159]  [<ffffffff813230c0>] ? __clear_user+0x20/0x50
[  601.155723]  [<ffffffff81327a68>] ? iov_iter_zero+0x68/0x250
[  601.157329]  [<ffffffff813fc6c8>] ? read_iter_zero+0x38/0xa0
[  601.158923]  [<ffffffff81187f04>] ? __vfs_read+0xc4/0xf0
[  601.160453]  [<ffffffff8118868a>] ? vfs_read+0x7a/0x120
[  601.161961]  [<ffffffff811893a0>] ? SyS_read+0x50/0xc0
[  601.163513]  [<ffffffff8162e9ee>] ? entry_SYSCALL_64_fastpath+0x12/0x71
[  601.165254] a.out           D ffff8800777b7e08     0  3418   3417 0x00100084
[  601.167118]  ffff8800777b7e08 ffff880077606400 ffff8800777b8000 ffff880036032e00
[  601.169137]  ffff880036032de8 ffffffff00000000 ffffffff00000001 ffff8800777b7e20
[  601.171159]  ffffffff8162a570 ffff880077606400 ffff8800777b7ea8 ffffffff8162d8eb
[  601.173183] Call Trace:
[  601.174193]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.175661]  [<ffffffff8162d8eb>] rwsem_down_write_failed+0x1fb/0x350
[  601.177388]  [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30
[  601.179194]  [<ffffffff81322f93>] call_rwsem_down_write_failed+0x13/0x20
[  601.180971]  [<ffffffff8162d05f>] ? down_write+0x1f/0x30
[  601.182509]  [<ffffffff81147abe>] vm_munmap+0x2e/0x60
[  601.183992]  [<ffffffff811489fd>] SyS_munmap+0x1d/0x30
[  601.185485]  [<ffffffff8162e9ee>] entry_SYSCALL_64_fastpath+0x12/0x71
[  601.187224] a.out           D ffff88007c60fdf0     0  3420   3417 0x00000084
[  601.189130]  ffff88007c60fdf0 ffff880078e15780 ffff88007c610000 ffff880036032de8
[  601.191158]  ffff880036032e00 ffff88007c60ff58 ffff880078e15780 ffff88007c60fe08
[  601.193180]  ffffffff8162a570 ffff880078e15780 ffff88007c60fe68 ffffffff8162d698
[  601.195217] Call Trace:
[  601.196226]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.197683]  [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[  601.199407]  [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[  601.201192]  [<ffffffff8162d032>] ? down_read+0x12/0x20
[  601.202711]  [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[  601.204328]  [<ffffffff810584d0>] do_page_fault+0x30/0x80
[  601.205874]  [<ffffffff81630518>] page_fault+0x28/0x30
[  601.207376] a.out           D ffff88007c24fdf0     0  3421   3417 0x00000084
[  601.209286]  ffff88007c24fdf0 ffff880078e13200 ffff88007c250000 ffff880036032de8
[  601.211316]  ffff880036032e00 ffff88007c24ff58 ffff880078e13200 ffff88007c24fe08
[  601.213335]  ffffffff8162a570 ffff880078e13200 ffff88007c24fe68 ffffffff8162d698
[  601.215356] Call Trace:
[  601.216377]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.217831]  [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[  601.219529]  [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[  601.221296]  [<ffffffff8162d032>] ? down_read+0x12/0x20
[  601.222802]  [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[  601.224403]  [<ffffffff810584d0>] do_page_fault+0x30/0x80
[  601.225958]  [<ffffffff81630518>] page_fault+0x28/0x30
[  601.227453] a.out           D ffff88007823bdf0     0  3422   3417 0x00000084
[  601.229348]  ffff88007823bdf0 ffff880078e10000 ffff88007823c000 ffff880036032de8
[  601.231395]  ffff880036032e00 ffff88007823bf58 ffff880078e10000 ffff88007823be08
[  601.233427]  ffffffff8162a570 ffff880078e10000 ffff88007823be68 ffffffff8162d698
[  601.235472] Call Trace:
[  601.236504]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.237989]  [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[  601.239720]  [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[  601.241583]  [<ffffffff8162d032>] ? down_read+0x12/0x20
[  601.243144]  [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[  601.244777]  [<ffffffff810584d0>] do_page_fault+0x30/0x80
[  601.246307]  [<ffffffff81630518>] page_fault+0x28/0x30
[  601.247823] a.out           D ffff88007c483df0     0  3423   3417 0x00000084
[  601.249719]  ffff88007c483df0 ffff880078e13e80 ffff88007c484000 ffff880036032de8
[  601.251765]  ffff880036032e00 ffff88007c483f58 ffff880078e13e80 ffff88007c483e08
[  601.253808]  ffffffff8162a570 ffff880078e13e80 ffff88007c483e68 ffffffff8162d698
[  601.255831] Call Trace:
[  601.256850]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.258286]  [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[  601.260005]  [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[  601.261803]  [<ffffffff8162d032>] ? down_read+0x12/0x20
[  601.263329]  [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[  601.264936]  [<ffffffff810584d0>] do_page_fault+0x30/0x80
[  601.266504]  [<ffffffff81630518>] page_fault+0x28/0x30
[  601.268019] a.out           D ffff880035893e08     0  3424   3417 0x00000084
[  601.269940]  ffff880035893e08 ffff880078e17080 ffff880035894000 ffff880036032e00
[  601.271945]  ffff880036032de8 ffffffff00000000 ffffffff00000001 ffff880035893e20
[  601.273954]  ffffffff8162a570 ffff880078e17080 ffff880035893ea8 ffffffff8162d8eb
[  601.276000] Call Trace:
[  601.277007]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.278497]  [<ffffffff8162d8eb>] rwsem_down_write_failed+0x1fb/0x350
[  601.280240]  [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30
[  601.282058]  [<ffffffff81322f93>] call_rwsem_down_write_failed+0x13/0x20
[  601.283872]  [<ffffffff8162d05f>] ? down_write+0x1f/0x30
[  601.285403]  [<ffffffff81147abe>] vm_munmap+0x2e/0x60
[  601.286924]  [<ffffffff811489fd>] SyS_munmap+0x1d/0x30
[  601.288435]  [<ffffffff8162e9ee>] entry_SYSCALL_64_fastpath+0x12/0x71
[  601.290184] a.out           D ffff8800353b7df0     0  3425   3417 0x00000084
[  601.292108]  ffff8800353b7df0 ffff880078e10c80 ffff8800353b8000 ffff880036032de8
[  601.294165]  ffff880036032e00 ffff8800353b7f58 ffff880078e10c80 ffff8800353b7e08
[  601.296206]  ffffffff8162a570 ffff880078e10c80 ffff8800353b7e68 ffffffff8162d698
[  601.298267] Call Trace:
[  601.299300]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.300755]  [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[  601.302437]  [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[  601.304221]  [<ffffffff8162d032>] ? down_read+0x12/0x20
[  601.305764]  [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[  601.307389]  [<ffffffff810584d0>] do_page_fault+0x30/0x80
[  601.308968]  [<ffffffff81630518>] page_fault+0x28/0x30
[  601.310488] a.out           D ffff88007cf87df0     0  3426   3417 0x00000084
[  601.312380]  ffff88007cf87df0 ffff880078e16400 ffff88007cf88000 ffff880036032de8
[  601.314414]  ffff880036032e00 ffff88007cf87f58 ffff880078e16400 ffff88007cf87e08
[  601.316443]  ffffffff8162a570 ffff880078e16400 ffff88007cf87e68 ffffffff8162d698
[  601.318490] Call Trace:
[  601.319536]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.321036]  [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[  601.322763]  [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[  601.324504]  [<ffffffff8162d032>] ? down_read+0x12/0x20
[  601.326071]  [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[  601.327715]  [<ffffffff810584d0>] do_page_fault+0x30/0x80
[  601.329287]  [<ffffffff81630518>] page_fault+0x28/0x30
[  601.330761] a.out           D ffff8800792dfdf0     0  3427   3417 0x00000084
[  601.332705]  ffff8800792dfdf0 ffff880078e12580 ffff8800792e0000 ffff880036032de8
[  601.334699]  ffff880036032e00 ffff8800792dff58 ffff880078e12580 ffff8800792dfe08
[  601.336750]  ffffffff8162a570 ffff880078e12580 ffff8800792dfe68 ffffffff8162d698
[  601.338794] Call Trace:
[  601.339781]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.341280]  [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150
[  601.343009]  [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30
[  601.344813]  [<ffffffff8162d032>] ? down_read+0x12/0x20
[  601.346361]  [<ffffffff810583f7>] __do_page_fault+0x357/0x400
[  601.347990]  [<ffffffff810584d0>] do_page_fault+0x30/0x80
[  601.349521]  [<ffffffff81630518>] page_fault+0x28/0x30
[  601.351044] a.out           D ffff88007743faa8     0  3428   3417 0x00000084
[  601.352942]  ffff88007743faa8 ffff88007bda6400 ffff880077440000 ffff88007743fae0
[  601.354990]  ffff88007fccdfc0 00000001000484e5 0000000000000000 ffff88007743fac0
[  601.357024]  ffffffff8162a570 ffff88007fccdfc0 ffff88007743fb40 ffffffff8162dbed
[  601.359075] Call Trace:
[  601.360096]  [<ffffffff8162a570>] schedule+0x30/0x80
[  601.361540]  [<ffffffff8162dbed>] schedule_timeout+0x11d/0x1c0
[  601.363190]  [<ffffffff810c7e00>] ? cascade+0x90/0x90
[  601.364697]  [<ffffffff8162dce9>] schedule_timeout_uninterruptible+0x19/0x20
[  601.366574]  [<ffffffff8111fc9d>] __alloc_pages_nodemask+0x84d/0xc30
[  601.368332]  [<ffffffff811609a7>] alloc_pages_current+0x87/0x110
[  601.370002]  [<ffffffff811166cf>] __page_cache_alloc+0xaf/0xc0
[  601.371606]  [<ffffffff81119225>] filemap_fault+0x1e5/0x420
[  601.373203]  [<ffffffff81244f39>] xfs_filemap_fault+0x39/0x60
[  601.374798]  [<ffffffff8113d5e7>] __do_fault+0x47/0xd0
[  601.376315]  [<ffffffff81142ec5>] handle_mm_fault+0xe15/0x1370
[  601.377938]  [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30
[  601.379707]  [<ffffffff81058217>] __do_page_fault+0x177/0x400
[  601.381320]  [<ffffffff810584d0>] do_page_fault+0x30/0x80
[  601.382831]  [<ffffffff81630518>] page_fault+0x28/0x30
[  601.384337] a.out           R  running task        0  3419   3417 0x00000080
[  601.386257]  00000000f80745e8 ffff880034ab4400 ffff8800776d3f18 ffff8800776d3f18
[  601.388287]  0000000000000080 0000000000000000 ffff8800776d3ec8 ffffffff81187e72
[  601.390341]  ffff880034ab4400 ffff880034ab4410 0000000000020000 0000000000000000
[  601.392366] Call Trace:
[  601.393388]  [<ffffffff81187e72>] ? __vfs_read+0x32/0xf0
[  601.394952]  [<ffffffff81290aa9>] ? security_file_permission+0xa9/0xc0
[  601.396745]  [<ffffffff8118858d>] ? rw_verify_area+0x4d/0xd0
[  601.398359]  [<ffffffff8118868a>] ? vfs_read+0x7a/0x120
[  601.399897]  [<ffffffff81189560>] ? SyS_pread64+0x90/0xb0
[  601.401429]  [<ffffffff8162e9ee>] ? entry_SYSCALL_64_fastpath+0x12/0x71
----------

I think that I noticed three problems from this reproducer.

(1) While the likeliness of hitting mmap_sem livelock would depend on how
    frequently down_read(&mm->mmap_sem) tasks and down_write(&mm->mmap_sem)
    tasks contend on the OOM victim's mm, we can hit mmap_sem livelock with
    even only one down_read(&mm->mmap_sem) task. On systems where processes
    are monitored using /proc/pid/ interface, we can by chance hit this
    mmap_sem livelock.

(2) The OOM killer tries to kill child process of the memory hog. But the
    child process is not always consuming a lot of memory. The memory
    unzapping kernel thread might not be able to reclaim enough memory
    unless we choose subsequent OOM victims when the first OOM victim task
    got mmap_sem livelock.

(3) I don't know the reason but I can observe that (when there are many
    tasks which got SIGKILL by the OOM killer) many of dying tasks participate
    in a memory allocation competition via page_fault() which cannot make
    forward progress because dying tasks without TIF_MEMDIE are not allowed
    to access the memory reserves.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Silent hang up caused by pages being not scanned?
  2015-10-12  6:43       ` Tetsuo Handa
@ 2015-10-12 15:25         ` Tetsuo Handa
  2015-10-12 21:23           ` Linus Torvalds
  2015-10-13 13:32           ` Michal Hocko
  0 siblings, 2 replies; 18+ messages in thread
From: Tetsuo Handa @ 2015-10-12 15:25 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Tetsuo Handa wrote:
> Uptime between 101 and 300 is a silent hang up (i.e. no OOM killer messages,
> no SIGKILL pending tasks, no TIF_MEMDIE tasks) which I solved using SysRq-f
> at uptime = 289. I don't know the reason of this silent hang up, but the
> memory unzapping kernel thread will not help because there is no OOM victim.
> 
> ----------
> [  101.438951] MemAlloc-Info: 10 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  111.817922] MemAlloc-Info: 12 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  122.281828] MemAlloc-Info: 13 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  132.793724] MemAlloc-Info: 14 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  143.336154] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  289.343187] sysrq: SysRq : Manual OOM execution
> (...snipped...)
> [  292.065650] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  302.590736] kworker/3:2 invoked oom-killer: gfp_mask=0x24000c0, order=-1, oom_score_adj=0
> (...snipped...)
> [  302.690047] MemAlloc-Info: 4 stalling task, 0 dying task, 0 victim task.
> ----------

I examined this hang up using additional debug printk() patch. And it was
observed that when this silent hang up occurs, zone_reclaimable() called from
shrink_zones() called from a __GFP_FS memory allocation request is returning
true forever. Since the __GFP_FS memory allocation request can never call
out_of_memory() due to did_some_progree > 0, the system will silently hang up
with 100% CPU usage.

----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0473eec..fda0bb5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2821,6 +2821,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 }
 #endif /* CONFIG_COMPACTION */
 
+pid_t dump_target_pid;
+
 /* Perform direct synchronous page reclaim */
 static int
 __perform_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -2847,6 +2849,9 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
+	if (dump_target_pid == current->pid)
+		printk(KERN_INFO "__perform_reclaim returned %u at line %u\n",
+		       progress, __LINE__);
 	return progress;
 }
 
@@ -3007,6 +3012,7 @@ static int malloc_watchdog(void *unused)
 	unsigned int memdie_pending;
 	unsigned int stalling_tasks;
 	u8 index;
+	pid_t pid;
 
  not_stalling: /* Healty case. */
 	/*
@@ -3025,12 +3031,16 @@ static int malloc_watchdog(void *unused)
 	 * and stop_memalloc_timer() within timeout duration.
 	 */
 	if (likely(!memalloc_counter[index]))
+	{
+		dump_target_pid = 0;
 		goto not_stalling;
+	}
  maybe_stalling: /* Maybe something is wrong. Let's check. */
 	/* First, report whether there are SIGKILL tasks and/or OOM victims. */
 	sigkill_pending = 0;
 	memdie_pending = 0;
 	stalling_tasks = 0;
+	pid = 0;
 	preempt_disable();
 	rcu_read_lock();
 	for_each_process_thread(g, p) {
@@ -3062,8 +3072,11 @@ static int malloc_watchdog(void *unused)
 			(fatal_signal_pending(p) ? "-dying" : ""),
 			p->comm, p->pid, m->gfp, m->order, spent);
 		show_stack(p, NULL);
+		if (!pid && (m->gfp & __GFP_FS))
+			pid = p->pid;
 	}
 	spin_unlock(&memalloc_list_lock);
+	dump_target_pid = -pid;
 	/* Wait until next timeout duration. */
 	schedule_timeout_interruptible(timeout);
 	if (memalloc_counter[index])
@@ -3155,6 +3168,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto nopage;
 
 retry:
+	if (dump_target_pid == -current->pid)
+		dump_target_pid = -dump_target_pid;
+
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
@@ -3280,6 +3296,11 @@ retry:
 		goto noretry;
 
 	/* Keep reclaiming pages as long as there is reasonable progress */
+	if (dump_target_pid == current->pid) {
+		printk(KERN_INFO "did_some_progress=%lu at line %u\n",
+		       did_some_progress, __LINE__);
+		dump_target_pid = 0;
+	}
 	pages_reclaimed += did_some_progress;
 	if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
 	    ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 27d580b..cb0c22e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2527,6 +2527,8 @@ static inline bool compaction_ready(struct zone *zone, int order)
 	return watermark_ok;
 }
 
+extern pid_t dump_target_pid;
+
 /*
  * This is the direct reclaim path, for page-allocating processes.  We only
  * try to reclaim pages from zones which will satisfy the caller's allocation
@@ -2619,16 +2621,41 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			sc->nr_reclaimed += nr_soft_reclaimed;
 			sc->nr_scanned += nr_soft_scanned;
 			if (nr_soft_reclaimed)
+			{
+				if (dump_target_pid == current->pid)
+					printk(KERN_INFO "nr_soft_reclaimed=%lu at line %u\n",
+					       nr_soft_reclaimed, __LINE__);
 				reclaimable = true;
+			}
 			/* need some check for avoid more shrink_zone() */
 		}
 
 		if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
+		{
+			if (dump_target_pid == current->pid)
+				printk(KERN_INFO "shrink_zone returned 1 at line %u\n",
+				       __LINE__);
 			reclaimable = true;
+		}
 
 		if (global_reclaim(sc) &&
 		    !reclaimable && zone_reclaimable(zone))
+		{
+			if (dump_target_pid == current->pid) {
+				printk(KERN_INFO "zone_reclaimable returned 1 at line %u\n",
+				       __LINE__);
+				printk(KERN_INFO "(ACTIVE_FILE=%lu+INACTIVE_FILE=%lu",
+				       zone_page_state(zone, NR_ACTIVE_FILE),
+				       zone_page_state(zone, NR_INACTIVE_FILE));
+				if (get_nr_swap_pages() > 0)
+					printk(KERN_CONT "+ACTIVE_ANON=%lu+INACTIVE_ANON=%lu",
+					       zone_page_state(zone, NR_ACTIVE_ANON),
+					       zone_page_state(zone, NR_INACTIVE_ANON));
+				printk(KERN_CONT ") * 6 > PAGES_SCANNED=%lu\n",
+				       zone_page_state(zone, NR_PAGES_SCANNED));
+			}
 			reclaimable = true;
+		}
 	}
 
 	/*
@@ -2674,6 +2701,9 @@ retry:
 				sc->priority);
 		sc->nr_scanned = 0;
 		zones_reclaimable = shrink_zones(zonelist, sc);
+		if (dump_target_pid == current->pid)
+			printk(KERN_INFO "shrink_zones returned %u at line %u\n",
+			       zones_reclaimable, __LINE__);
 
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2707,11 +2737,21 @@ retry:
 	delayacct_freepages_end();
 
 	if (sc->nr_reclaimed)
+	{
+		if (dump_target_pid == current->pid)
+			printk(KERN_INFO "sc->nr_reclaimed=%lu at line %u\n",
+			       sc->nr_reclaimed, __LINE__);
 		return sc->nr_reclaimed;
+	}
 
 	/* Aborted reclaim to try compaction? don't OOM, then */
 	if (sc->compaction_ready)
+	{
+		if (dump_target_pid == current->pid)
+			printk(KERN_INFO "sc->compaction_ready=%u at line %u\n",
+			       sc->compaction_ready, __LINE__);
 		return 1;
+	}
 
 	/* Untapped cgroup reserves?  Don't OOM, retry. */
 	if (!sc->may_thrash) {
@@ -2720,6 +2760,9 @@ retry:
 		goto retry;
 	}
 
+	if (dump_target_pid == current->pid)
+		printk(KERN_INFO "zones_reclaimable=%u at line %u\n",
+		       zones_reclaimable, __LINE__);
 	/* Any of the zones still reclaimable?  Don't OOM. */
 	if (zones_reclaimable)
 		return 1;
@@ -2875,7 +2918,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	 * point.
 	 */
 	if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask))
+	{
+		if (dump_target_pid == current->pid)
+			printk(KERN_INFO "throttle_direct_reclaim returned 1 at line %u\n",
+			       __LINE__);
 		return 1;
+	}
 
 	trace_mm_vmscan_direct_reclaim_begin(order,
 				sc.may_writepage,
@@ -2885,6 +2933,9 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 
 	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
 
+	if (dump_target_pid == current->pid)
+		printk(KERN_INFO "do_try_to_free_pages returned %lu at line %u\n",
+		       nr_reclaimed, __LINE__);
 	return nr_reclaimed;
 }
 
----------

What is strange, the values printed by this debug printk() patch did not
change as time went by. Thus, I think that this is not a problem of lack of
CPU time for scanning pages. I suspect that there is a bug that nobody is
scanning pages.

----------
[   66.821450] zone_reclaimable returned 1 at line 2646
[   66.823020] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32
[   66.824935] shrink_zones returned 1 at line 2706
[   66.826392] zones_reclaimable=1 at line 2765
[   66.827865] do_try_to_free_pages returned 1 at line 2938
[   67.102322] __perform_reclaim returned 1 at line 2854
[   67.103968] did_some_progress=1 at line 3301
(...snipped...)
[  281.439977] zone_reclaimable returned 1 at line 2646
[  281.439977] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32
[  281.439978] shrink_zones returned 1 at line 2706
[  281.439978] zones_reclaimable=1 at line 2765
[  281.439979] do_try_to_free_pages returned 1 at line 2938
[  281.439979] __perform_reclaim returned 1 at line 2854
[  281.439980] did_some_progress=1 at line 3301
----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151013.txt.xz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-12 15:25         ` Silent hang up caused by pages being not scanned? Tetsuo Handa
@ 2015-10-12 21:23           ` Linus Torvalds
  2015-10-13 12:21             ` Tetsuo Handa
  2015-10-13 13:32           ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2015-10-12 21:23 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina

On Mon, Oct 12, 2015 at 8:25 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> I examined this hang up using additional debug printk() patch. And it was
> observed that when this silent hang up occurs, zone_reclaimable() called from
> shrink_zones() called from a __GFP_FS memory allocation request is returning
> true forever. Since the __GFP_FS memory allocation request can never call
> out_of_memory() due to did_some_progree > 0, the system will silently hang up
> with 100% CPU usage.

I wouldn't blame the zones_reclaimable() logic itself, but yeah, that looks bad.

So the do_try_to_free_pages() logic that does that

        /* Any of the zones still reclaimable?  Don't OOM. */
        if (zones_reclaimable)
                return 1;

is rather dubious. The history of that odd line is pretty dubious too:
it used to be that we would return success if "shrink_zones()"
succeeded or if "nr_reclaimed" was non-zero, but that "shrink_zones()"
logic got rewritten, and I don't think the current situation is all
that sane.

And returning 1 there is actively misleading to callers, since it
makes them think that it made progress.

So I think you should look at what happens if you just remove that
illogical and misleading return value.

HOWEVER.

I think that it's very true that we have then tuned all our *other*
heuristics for taking this thing into account, so I suspect that we'll
find that we'll need to tweak other places. But this crazy "let's say
that we made progress even when we didn't" thing looks just wrong.

In particular, I think that you'll find that you will have to change
the heuristics in __alloc_pages_slowpath() where we currently do

        if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || ..

when the "did_some_progress" logic changes that radically.

Because while the current return value looks insane, all the other
testing and tweaking has been done with that very odd return value in
place.

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-12 21:23           ` Linus Torvalds
@ 2015-10-13 12:21             ` Tetsuo Handa
  2015-10-13 16:37               ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2015-10-13 12:21 UTC (permalink / raw)
  To: torvalds
  Cc: mhocko, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Linus Torvalds wrote:
> On Mon, Oct 12, 2015 at 8:25 AM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >
> > I examined this hang up using additional debug printk() patch. And it was
> > observed that when this silent hang up occurs, zone_reclaimable() called from
> > shrink_zones() called from a __GFP_FS memory allocation request is returning
> > true forever. Since the __GFP_FS memory allocation request can never call
> > out_of_memory() due to did_some_progree > 0, the system will silently hang up
> > with 100% CPU usage.
> 
> I wouldn't blame the zones_reclaimable() logic itself, but yeah, that looks bad.
> 

I compared "hang up after the OOM killer is invoked" and "hang up before
the OOM killer is invoked" by always printing the values.

 			}
 			reclaimable = true;
 		}
+		else if (dump_target_pid == current->pid) {
+			printk(KERN_INFO "(ACTIVE_FILE=%lu+INACTIVE_FILE=%lu",
+			       zone_page_state(zone, NR_ACTIVE_FILE),
+			       zone_page_state(zone, NR_INACTIVE_FILE));
+			if (get_nr_swap_pages() > 0)
+				printk(KERN_CONT "+ACTIVE_ANON=%lu+INACTIVE_ANON=%lu",
+				       zone_page_state(zone, NR_ACTIVE_ANON),
+				       zone_page_state(zone, NR_INACTIVE_ANON));
+			printk(KERN_CONT ") * 6 > PAGES_SCANNED=%lu\n",
+			       zone_page_state(zone, NR_PAGES_SCANNED));
+		}
 	}
 
 	/*

For the former case, most of trials showed that

  (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0

. Sometimes PAGES_SCANNED > 0 (as grep'ed below), but ACTIVE_FILE and
INACTIVE_FILE seems to be always 0.

----------
[  195.905057] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  195.927430] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  206.317088] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  206.338007] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  216.723776] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  216.744618] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  227.129653] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  227.151238] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  237.650232] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  237.671343] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  277.980310] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  278.001481] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  288.339220] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  288.361908] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  298.682988] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  298.704055] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  350.368952] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  350.389770] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  360.724821] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  360.746100] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  845.231887] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=27
[  845.233770] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  845.253196] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=27
[  845.254910] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[ 1397.628073] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[ 1397.649165] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[ 1408.207041] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[ 1408.228762] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
----------

For the latter case, most of output showed that
ACTIVE_FILE + INACTIVE_FILE > 0.

----------
[  142.647201] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  142.648883] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  142.842868] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  142.955817] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.086363] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.231120] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.359238] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.473342] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.618103] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.746210] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.908162] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.035415] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.161926] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.306435] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.434265] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.436099] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  144.643374] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.773239] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.902309] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.046154] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.185410] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.317218] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.460304] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.654212] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.817362] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.945136] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  146.086303] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  146.242127] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  153.489868] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  153.491593] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  153.674246] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  153.839478] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  154.003234] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  154.155085] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  154.322187] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  154.447355] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  154.653150] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  154.782216] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  154.939439] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.105921] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.278386] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.440832] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.623970] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.625766] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.831074] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.996903] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  156.139137] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  156.318492] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  156.484300] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  156.667411] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  156.817246] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  157.012323] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  157.159483] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  157.323193] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  157.488399] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  157.654198] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  164.339172] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  164.340896] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  164.583026] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  164.797386] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  164.965110] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  165.124935] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  165.431304] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  165.700317] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  165.862071] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.029257] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.198312] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.356224] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.559302] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.684486] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.898551] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.900496] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  167.175960] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  167.324390] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  167.526150] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  167.693365] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  167.878407] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.061503] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.225306] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.416398] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.617395] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.783201] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.989053] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  169.196126] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  175.361136] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  175.362865] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  175.626817] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  175.797361] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  176.006389] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  176.211479] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  176.433890] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  176.630951] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  176.855509] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  177.049814] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  177.258218] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  177.455404] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  177.665085] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  177.874173] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.057217] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.059056] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.350935] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.559404] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.782483] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.982803] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  179.203930] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  179.428321] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  179.611349] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  179.851164] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  180.034220] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  180.279197] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  180.455284] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  180.811445] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  186.368405] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  186.370115] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  186.614733] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  186.845695] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.024274] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.211389] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.427147] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.552333] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.734117] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.935811] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.138296] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.354041] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.559245] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.641776] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.716434] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.718199] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  189.015952] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  189.218976] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  189.440131] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  189.659238] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  189.882360] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  190.087342] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  190.314442] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  190.408926] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  190.631240] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  190.850326] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  191.067488] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  191.283243] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
----------

So, something is preventing ACTIVE_FILE and INACTIVE_FILE to become 0 ?

I also tried below change, but the result was same. Therefore, this
problem seems to be independent with "!__GFP_FS allocations do not fail".
(Complete log with below change (uptime > 101) is at
http://I-love.SAKURA.ne.jp/tmp/serial-20151013-2.txt.xz . )

----------
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2736,7 +2736,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			 * and the OOM killer can't be invoked, but
 			 * keep looping as per tradition.
 			 */
-			*did_some_progress = 1;
 			goto out;
 		}
 		if (pm_suspended_storage())
----------

----------
[  102.719555] (ACTIVE_FILE=3+INACTIVE_FILE=3) * 6 > PAGES_SCANNED=19
[  102.721234] (ACTIVE_FILE=1+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  102.722908] shrink_zones returned 1 at line 2717
----------

> So the do_try_to_free_pages() logic that does that
> 
>         /* Any of the zones still reclaimable?  Don't OOM. */
>         if (zones_reclaimable)
>                 return 1;
> 
> is rather dubious. The history of that odd line is pretty dubious too:
> it used to be that we would return success if "shrink_zones()"
> succeeded or if "nr_reclaimed" was non-zero, but that "shrink_zones()"
> logic got rewritten, and I don't think the current situation is all
> that sane.
> 
> And returning 1 there is actively misleading to callers, since it
> makes them think that it made progress.
> 
> So I think you should look at what happens if you just remove that
> illogical and misleading return value.
> 

If I remove

	/* Any of the zones still reclaimable?  Don't OOM. */
	if (zones_reclaimable)
		return 1;

the OOM killer is invoked even when there are so much memory which can be
reclaimed after written to disk. This is definitely premature invocation of
the OOM killer.

  $ cat < /dev/zero > /tmp/log & sleep 10; ./a.out

---------- When there is a lot of data to write ----------
[  489.952827] Mem-Info:
[  489.953840] active_anon:328227 inactive_anon:3033 isolated_anon:26
[  489.953840]  active_file:2309 inactive_file:80915 isolated_file:0
[  489.953840]  unevictable:0 dirty:53 writeback:80874 unstable:0
[  489.953840]  slab_reclaimable:4975 slab_unreclaimable:4256
[  489.953840]  mapped:2973 shmem:4192 pagetables:1939 bounce:0
[  489.953840]  free:12963 free_pcp:60 free_cma:0
[  489.963395] Node 0 DMA free:7300kB min:400kB low:500kB high:600kB active_anon:5728kB inactive_anon:88kB active_file:140kB inactive_file:1276kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:1300kB mapped:140kB shmem:160kB slab_reclaimable:256kB slab_unreclaimable:180kB kernel_stack:64kB pagetables:180kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9768 all_unreclaimable? yes
[  489.974035] lowmem_reserve[]: 0 1729 1729 1729
[  489.975813] Node 0 DMA32 free:44552kB min:44652kB low:55812kB high:66976kB active_anon:1307180kB inactive_anon:12044kB active_file:9096kB inactive_file:322384kB unevictable:0kB isolated(anon):104kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:216kB writeback:322196kB mapped:11752kB shmem:16608kB slab_reclaimable:19644kB slab_unreclaimable:16844kB kernel_stack:3584kB pagetables:7576kB unstable:0kB bounce:0kB free_pcp:240kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:2419896 all_unreclaimable? yes
[  489.988452] lowmem_reserve[]: 0 0 0 0
[  489.990043] Node 0 DMA: 2*4kB (UE) 1*8kB (M) 4*16kB (UME) 1*32kB (E) 2*64kB (UE) 3*128kB (UME) 2*256kB (UM) 2*512kB (ME) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 7280kB
[  489.995142] Node 0 DMA32: 578*4kB (UME) 726*8kB (UE) 447*16kB (UE) 253*32kB (UME) 155*64kB (UME) 42*128kB (UME) 3*256kB (UME) 2*512kB (UM) 4*1024kB (U) 0*2048kB 0*4096kB = 44552kB
[  490.000511] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  490.002914] 87434 total pagecache pages
[  490.004612] 0 pages in swap cache
[  490.006138] Swap cache stats: add 0, delete 0, find 0/0
[  490.007976] Free swap  = 0kB
[  490.009329] Total swap = 0kB
[  490.011033] 524157 pages RAM
[  490.012352] 0 pages HighMem/MovableOnly
[  490.013903] 76615 pages reserved
[  490.015260] 0 pages hwpoisoned
---------- When there is a lot of data to write ----------

  $ ./a.out

---------- When there is no data to write ----------
[  792.359024] Mem-Info:
[  792.360001] active_anon:413751 inactive_anon:6226 isolated_anon:0
[  792.360001]  active_file:0 inactive_file:0 isolated_file:0
[  792.360001]  unevictable:0 dirty:0 writeback:0 unstable:0
[  792.360001]  slab_reclaimable:1243 slab_unreclaimable:3638
[  792.360001]  mapped:104 shmem:6236 pagetables:1033 bounce:0
[  792.360001]  free:12965 free_pcp:126 free_cma:0
[  792.368559] Node 0 DMA free:7292kB min:400kB low:500kB high:600kB active_anon:7040kB inactive_anon:160kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:160kB slab_reclaimable:24kB slab_unreclaimable:172kB kernel_stack:64kB pagetables:460kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8 all_unreclaimable? yes
[  792.378240] lowmem_reserve[]: 0 1729 1729 1729
[  792.379834] Node 0 DMA32 free:44568kB min:44652kB low:55812kB high:66976kB active_anon:1647964kB inactive_anon:24744kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:0kB writeback:0kB mapped:416kB shmem:24784kB slab_reclaimable:4948kB slab_unreclaimable:14380kB kernel_stack:3104kB pagetables:3672kB unstable:0kB bounce:0kB free_pcp:504kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8 all_unreclaimable? yes
[  792.390085] lowmem_reserve[]: 0 0 0 0
[  792.391643] Node 0 DMA: 3*4kB (UE) 0*8kB 3*16kB (UE) 24*32kB (ME) 11*64kB (UME) 5*128kB (UM) 2*256kB (ME) 3*512kB (ME) 1*1024kB (E) 1*2048kB (E) 0*4096kB = 7292kB
[  792.396201] Node 0 DMA32: 242*4kB (UME) 386*8kB (UME) 397*16kB (UME) 199*32kB (UE) 105*64kB (UME) 37*128kB (UME) 24*256kB (UME) 20*512kB (UME) 0*1024kB 0*2048kB 0*4096kB = 44616kB
[  792.401136] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  792.403356] 6250 total pagecache pages
[  792.404803] 0 pages in swap cache
[  792.406208] Swap cache stats: add 0, delete 0, find 0/0
[  792.407896] Free swap  = 0kB
[  792.409172] Total swap = 0kB
[  792.410460] 524157 pages RAM
[  792.411752] 0 pages HighMem/MovableOnly
[  792.413106] 76615 pages reserved
[  792.414493] 0 pages hwpoisoned
---------- When there is no data to write ----------

> HOWEVER.
> 
> I think that it's very true that we have then tuned all our *other*
> heuristics for taking this thing into account, so I suspect that we'll
> find that we'll need to tweak other places. But this crazy "let's say
> that we made progress even when we didn't" thing looks just wrong.
> 
> In particular, I think that you'll find that you will have to change
> the heuristics in __alloc_pages_slowpath() where we currently do
> 
>         if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || ..
> 
> when the "did_some_progress" logic changes that radically.
> 

Yes. But we can't simply do

	if (order <= PAGE_ALLOC_COSTLY_ORDER || ..

because we won't be able to call out_of_memory(), can we?

> Because while the current return value looks insane, all the other
> testing and tweaking has been done with that very odd return value in
> place.
> 
>                 Linus
> 

Well, did I encounter a difficult to fix problem?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-13 12:21             ` Tetsuo Handa
@ 2015-10-13 16:37               ` Linus Torvalds
  2015-10-14 12:21                 ` Tetsuo Handa
  2015-10-15 13:14                 ` Michal Hocko
  0 siblings, 2 replies; 18+ messages in thread
From: Linus Torvalds @ 2015-10-13 16:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina

On Tue, Oct 13, 2015 at 5:21 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> If I remove
>
>         /* Any of the zones still reclaimable?  Don't OOM. */
>         if (zones_reclaimable)
>                 return 1;
>
> the OOM killer is invoked even when there are so much memory which can be
> reclaimed after written to disk. This is definitely premature invocation of
> the OOM killer.

Right. The rest of the code knows that the return value right now
means "there is no memory at all" rather than "I made progress".

> Yes. But we can't simply do
>
>         if (order <= PAGE_ALLOC_COSTLY_ORDER || ..
>
> because we won't be able to call out_of_memory(), can we?

So I think that whole thing is kind of senseless. Not just that
particular conditional, but what it *does* too.

What can easily happen is that we are a blocking allocation, but
because we're __GFP_FS or something, the code doesn't actually start
writing anything out. Nor is anything congested. So the thing just
loops.

And looping is stupid, because we may be not able to actually free
anything exactly because of limitations like __GFP_FS.

So

 (a) the looping condition is senseless

 (b) what we do when looping is senseless

and we actually do try to wake up kswapd in the loop, but we never
*wait* for it, so that's largely pointless too.

So *of*course* the direct reclaim code has to set "I made progress",
because if it doesn't lie and say so, then the code will randomly not
loop, and will oom, and things go to hell.

But I hate the "let's tweak the zone_reclaimable" idea, because it
doesn't actually fix anything. It just perpetuates this "the code
doesn't make sense, so let's add *more* senseless heusristics to this
whole loop".

So instead of that senseless thing, how about trying something
*sensible*. Make the code do something that we can actually explain as
making sense.

I'd suggest something like:

 - add a "retry count"

 - if direct reclaim made no progress, or made less progress than the target:

      if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry;

 - regardless of whether we made progress or not:

      if (retry count < X) goto retry;

      if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then
goto retry

   where 'X" is something sane that limits our CPU use, but also
guarantees that we don't end up waiting *too* long (if a single
allocation takes more than a big fraction of a second, we should
probably stop trying).

The whole time-based thing might even be explicit. There's nothing
wrong with doing something like

    unsigned long timeout = jiffies + HZ/4;

at the top of the function, and making the whole retry logic actually
say something like

    if (time_after(timeout, jiffies)) goto noretry;

(or make *that* trigger the oom logic, or whatever).

Now, I realize the above suggestions are big changes, and they'll
likely break things and we'll still need to tweak things, but dammit,
wouldn't that be better than just randomly tweaking the insane
zone_reclaimable logic?

                    Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-13 16:37               ` Linus Torvalds
@ 2015-10-14 12:21                 ` Tetsuo Handa
  2015-10-15 13:14                 ` Michal Hocko
  1 sibling, 0 replies; 18+ messages in thread
From: Tetsuo Handa @ 2015-10-14 12:21 UTC (permalink / raw)
  To: torvalds
  Cc: mhocko, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Linus Torvalds wrote:
> On Tue, Oct 13, 2015 at 5:21 AM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >
> > If I remove
> >
> >         /* Any of the zones still reclaimable?  Don't OOM. */
> >         if (zones_reclaimable)
> >                 return 1;
> >
> > the OOM killer is invoked even when there are so much memory which can be
> > reclaimed after written to disk. This is definitely premature invocation of
> > the OOM killer.
> 
> Right. The rest of the code knows that the return value right now
> means "there is no memory at all" rather than "I made progress".
> 
> > Yes. But we can't simply do
> >
> >         if (order <= PAGE_ALLOC_COSTLY_ORDER || ..
> >
> > because we won't be able to call out_of_memory(), can we?
> 
> So I think that whole thing is kind of senseless. Not just that
> particular conditional, but what it *does* too.
> 
> What can easily happen is that we are a blocking allocation, but
> because we're __GFP_FS or something, the code doesn't actually start
> writing anything out. Nor is anything congested. So the thing just
> loops.

congestion_wait() sounds like a source of silent hang up.
http://lkml.kernel.org/r/201406052145.CIB35534.OQLVMSJFOHtFOF@I-love.SAKURA.ne.jp

> 
> And looping is stupid, because we may be not able to actually free
> anything exactly because of limitations like __GFP_FS.
> 
> So
> 
>  (a) the looping condition is senseless
> 
>  (b) what we do when looping is senseless
> 
> and we actually do try to wake up kswapd in the loop, but we never
> *wait* for it, so that's largely pointless too.

Aren't we waiting for kswapd forever?
In other words, we never check whether kswapd can make some progress.
http://lkml.kernel.org/r/20150812091104.GA14940@dhcp22.suse.cz

> 
> So *of*course* the direct reclaim code has to set "I made progress",
> because if it doesn't lie and say so, then the code will randomly not
> loop, and will oom, and things go to hell.
> 
> But I hate the "let's tweak the zone_reclaimable" idea, because it
> doesn't actually fix anything. It just perpetuates this "the code
> doesn't make sense, so let's add *more* senseless heusristics to this
> whole loop".

I also don't think that tweaking current reclaim logic solves bugs
which bothered me via unexplained hangups / reboots.
To me, current memory allocator is too puzzling that it is as if

   if (there_is_much_free_memory() == TRUE)
       goto OK;
   if (do_some_heuristic1() == SUCCESS)
       goto OK;
   if (do_some_heuristic2() == SUCCESS)
       goto OK;
   if (do_some_heuristic3() == SUCCESS)
       goto OK;
   (...snipped...)
   if (do_some_heuristicN() == SUCCESS)
       goto OK;
   while (1);

and we don't know how many heuristics we need to add in order to avoid
reaching the "while (1);". (We are reaching the "while (1);" before

   if (out_of_memory() == SUCCESS)
       goto OK;

is called.)

> 
> So instead of that senseless thing, how about trying something
> *sensible*. Make the code do something that we can actually explain as
> making sense.
> 
> I'd suggest something like:
> 
>  - add a "retry count"
> 
>  - if direct reclaim made no progress, or made less progress than the target:
> 
>       if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry;

Yes.

> 
>  - regardless of whether we made progress or not:
> 
>       if (retry count < X) goto retry;
> 
>       if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then
> goto retry

I tried sleeping for reducing CPU usage and reporting via SysRq-w.
http://lkml.kernel.org/r/201411231353.BDE90173.FQOMJtHOLVFOFS@I-love.SAKURA.ne.jp

I complained at http://lkml.kernel.org/r/201502162023.GGE26089.tJOOFQMFFHLOVS@I-love.SAKURA.ne.jp

| Oh, why every thread trying to allocate memory has to repeat
| the loop that might defer somebody who can make progress if CPU time was
| given? I wish only somebody like kswapd repeats the loop on behalf of all
| threads waiting at memory allocation slowpath...

Direct reclaim can defer termination upon SIGKILL if blocked at unkillable
lock. If performance were not a problem, is direct reclaim mandatory?

Of course, performance is the problem. Thus we would try direct reclaim
for at least once. But I wish memory allocation logic were as simple as

  (1) If there are enough free memory, allocate it.

  (2) If there are not enough free memory, join on the
      waitqueue list

        wait_event_timeout(waiter, memory_reclaimed, timeout)

      and wait for reclaiming kernel threads (e.g. kswapd) to wake
      the waiters up. If the caller is willing to give up upon SIGKILL
      (e.g. __GFP_KILLABLE) then

        wait_event_killable_timeout(waiter, memory_reclaimed, timeout)

      and return NULL upon SIGKILL.

  (3) Whenever reclaiming kernel threads reclaimed memory and there are
      waiters, wake the waiters up.

  (4) If reclaiming kernel threads cannot reclaim memory,
      the caller will wake up due to timeout, and invoke the OOM
      killer unless the caller does not want (e.g. __GFP_NO_OOMKILL).

> 
>    where 'X" is something sane that limits our CPU use, but also
> guarantees that we don't end up waiting *too* long (if a single
> allocation takes more than a big fraction of a second, we should
> probably stop trying).

Isn't a second too short for waiting for swapping / writeback?

> 
> The whole time-based thing might even be explicit. There's nothing
> wrong with doing something like
> 
>     unsigned long timeout = jiffies + HZ/4;
> 
> at the top of the function, and making the whole retry logic actually
> say something like
> 
>     if (time_after(timeout, jiffies)) goto noretry;
> 
> (or make *that* trigger the oom logic, or whatever).

I prefer time-based thing, for my customer's usage (where watchdog timeout
is configured to 10 seconds) will require kernel messages (maybe OOM killer
messages) printed within a few seconds.

> 
> Now, I realize the above suggestions are big changes, and they'll
> likely break things and we'll still need to tweak things, but dammit,
> wouldn't that be better than just randomly tweaking the insane
> zone_reclaimable logic?
> 
>                     Linus

Yes, this will be big changes. But this change will be better than living
with "no means for understanding what was happening are available" v.s.
"really interesting things are observed if means are available".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-13 16:37               ` Linus Torvalds
  2015-10-14 12:21                 ` Tetsuo Handa
@ 2015-10-15 13:14                 ` Michal Hocko
  2015-10-16 15:57                   ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2015-10-15 13:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina, Mel Gorman, Rik van Riel

[CC Mel and Rik as well - this has diverged from the original thread
 considerably but the current topic started here:
 http://lkml.kernel.org/r/201510130025.EJF21331.FFOQJtVOMLFHSO%40I-love.SAKURA.ne.jp
]

On Tue 13-10-15 09:37:06, Linus Torvalds wrote:
> So instead of that senseless thing, how about trying something
> *sensible*. Make the code do something that we can actually explain as
> making sense.

I do agree that zone_reclaimable is subtle and hackish way to wait for
the writeback/kswapd to clean up pages which cannot be reclaimed from
the direct reclaim.

> I'd suggest something like:
> 
>  - add a "retry count"
> 
>  - if direct reclaim made no progress, or made less progress than the target:
> 
>       if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry;
> 
>  - regardless of whether we made progress or not:
> 
>       if (retry count < X) goto retry;
> 
>       if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then
> goto retry

This will certainly cap the reclaim retries but there are risks with
this approach afaics.

First of all other allocators might piggy back on the current reclaimer
and push it to the OOM killer even when we are not really OOM. Maybe
this is possible currently as well but it is less likely because
NR_PAGES_SCANNED is reset on a freed page which allows the reclaimer
another round.

I am also not sure it would help with pathological cases like the
one discussed here. If you have only a small amount of reclaimable
memory on the LRU lists then you scan them quite quickly which will
consume retries. Maybe a sufficient timeout can help but I am afraid we
can still hit the OOM prematurely because a large part of the memory
is still under writeback (which might be a slow device - e.g. an USB
stick).

We used have this kind of problems in memcg reclaim.  We do not
have (resp. didn't have until recently with CONFIG_CGROUP_WRITEBACK)
dirty memory throttling for memory cgroups so the LRU can become full
of dirty data really quickly and that led to memcg OOM killer.
We are not doing zone_reclaimable and other heuristics so we had to
explicitly wait_on_page_writeback in the reclaim to prevent from
premature OOM killer. Ugly hack but the only thing that worked
reliably. Time based solutions were tried and failed with different
workloads and quite randomly depending on the load/storage.

>    where 'X" is something sane that limits our CPU use, but also
> guarantees that we don't end up waiting *too* long (if a single
> allocation takes more than a big fraction of a second, we should
> probably stop trying).
> 
> The whole time-based thing might even be explicit. There's nothing
> wrong with doing something like
> 
>     unsigned long timeout = jiffies + HZ/4;
> 
> at the top of the function, and making the whole retry logic actually
> say something like
> 
>     if (time_after(timeout, jiffies)) goto noretry;
> 
> (or make *that* trigger the oom logic, or whatever).
> 
> Now, I realize the above suggestions are big changes, and they'll
> likely break things and we'll still need to tweak things, but dammit,
> wouldn't that be better than just randomly tweaking the insane
> zone_reclaimable logic?

Yes zone_reclaimable is subtle and imho it is used even at the
wrong level. We should decide whether we are really OOM at
__alloc_pages_slowpath. We definitely need a big picture logic to tell
us when it makes sense to drop the ball and trigger OOM killer or fail
the allocation request.

E.g. free + reclaimable + writeback < min_wmark on all usable zones for
more than X rounds of direct reclaim without any progress is
a sufficient signal to go OOM. Costly/noretry allocations can fail earlier
of course. This is obviously a half baked idea which needs much more
consideration all I am trying to say is that we need a high level metric
to tell OOM condition.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-15 13:14                 ` Michal Hocko
@ 2015-10-16 15:57                   ` Michal Hocko
  2015-10-16 18:34                     ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2015-10-16 15:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina, Mel Gorman, Rik van Riel

On Thu 15-10-15 15:14:09, Michal Hocko wrote:
> On Tue 13-10-15 09:37:06, Linus Torvalds wrote:
[...]
> > Now, I realize the above suggestions are big changes, and they'll
> > likely break things and we'll still need to tweak things, but dammit,
> > wouldn't that be better than just randomly tweaking the insane
> > zone_reclaimable logic?
> 
> Yes zone_reclaimable is subtle and imho it is used even at the
> wrong level. We should decide whether we are really OOM at
> __alloc_pages_slowpath. We definitely need a big picture logic to tell
> us when it makes sense to drop the ball and trigger OOM killer or fail
> the allocation request.
> 
> E.g. free + reclaimable + writeback < min_wmark on all usable zones for
> more than X rounds of direct reclaim without any progress is
> a sufficient signal to go OOM. Costly/noretry allocations can fail earlier
> of course. This is obviously a half baked idea which needs much more
> consideration all I am trying to say is that we need a high level metric
> to tell OOM condition.

OK so here is what I am playing with currently. It is not complete
yet. Anyway I have tested it with 2 scenarios on a swapless system with
2G of RAM both do

$ cat writer.sh
#!/bin/sh
size=$((1<<30))
block=$((4<<10))

writer()
{
	(
        while true
        do
                dd if=/dev/zero of=/mnt/data/file.$1 bs=$block count=$(($size/$block))
                rm /mnt/data/file.$1
                sync
        done
	) &
}

writer 1
writer 2

sleep 10s # allow to accumulate enough dirty pages

1) massive OOM
start 100 memeaters each 80M run in parallel (anon private MAP_POPULATE
mapping). This will trigger many OOM killers and the overall count is
what I was interested in. The test is considered finished when we get
a steady state - writers can make progress and there is no more OOM
killing for some time.

$ grep "invoked oom-killer" base-run-oom.log | wc -l
78
$ grep "invoked oom-killer" test-run-oom.log | wc -l
63

So it looks like we have triggered less OOM killing with the patch
applied. I haven't checked those too closely but it seems like at least
two instances might not have triggered with the current implementation
because DMA32 zone is considered reclaimable. But this check is
inherently racy so we cannot be sure.
$ grep "DMA32.*all_unreclaimable? no" test2-run-oom.log | wc -l
2

2) almost OOM situation
invoke 10 memeaters in parallel and try to fill up all the memory
without triggering the OOM killer. This is quite hard and it required a
lot of tunning. I've ended up with:
#!/bin/sh
pkill mem_eater
sync
echo 3 > /proc/sys/vm/drop_caches
sync
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo)
sh writer.sh &
sleep 10s
for i in $(seq 10)
do
        memcg_test/tools/mem_eater $size &
done

wait

and this one doesn't hit the OOM killer with the original implementation
while it hits it with the patch applied:
[   32.727001] DMA32 free:5428kB min:5532kB low:6912kB high:8296kB active_anon:1802520kB inactive_anon:204kB active_file:6692kB inactive_file:137184k
B unevictable:0kB isolated(anon):136kB isolated(file):32kB present:2080640kB managed:1997880kB mlocked:0kB dirty:0kB writeback:137168kB mapped:6408kB
 shmem:204kB slab_reclaimable:20472kB slab_unreclaimable:13276kB kernel_stack:1456kB pagetables:4756kB unstable:0kB bounce:0kB free_pcp:120kB local_p
cp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:948764 all_unreclaimable? yes

There is a lot of memory in the writeback but all_unreclaimable is yes
so who knows maybe it is just a coincidence we haven't triggered OOM in
the original kernel.

Anyway the two implementation will be hard to compare because workloads
are very different but I think something like below should be more
readable and deterministic than what we have right now. It will need
some more tuning for sure and I will be playing with it some more. I
would just like to hear opinions whether this approach makes sense.
If yes I will post it separately in a new thread for a wider discussion.
This email thread seems to be full of detours already.
---

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-16 15:57                   ` Michal Hocko
@ 2015-10-16 18:34                     ` Linus Torvalds
  2015-10-16 18:49                       ` Tetsuo Handa
  2015-10-19 12:53                       ` Michal Hocko
  0 siblings, 2 replies; 18+ messages in thread
From: Linus Torvalds @ 2015-10-16 18:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina, Mel Gorman, Rik van Riel

On Fri, Oct 16, 2015 at 8:57 AM, Michal Hocko <mhocko@kernel.org> wrote:
>
> OK so here is what I am playing with currently. It is not complete
> yet.

So this looks like it's going in a reasonable direction. However:

> +               if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> +                               ac->high_zoneidx, alloc_flags, target)) {
> +                       /* Wait for some write requests to complete then retry */
> +                       wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> +                       goto retry;
> +               }

I still think we should at least spend some time re-thinking that
"wait_iff_congested()" thing. We may not actually be congested, but
might be unable to write anything out because of our allocation flags
(ie not allowed to recurse into the filesystems), so we might be in
the situation that we have a lot of dirty pages that we can't directly
do anything about.

Now, we will have woken kswapd, so something *will* hopefully be done
about them eventually, but at no time do we actually really wait for
it. We'll just busy-loop.

So at a minimum, I think we should yield to kswapd. We do do that
"cond_resched()" in wait_iff_congested(), but I'm not entirely
convinced that is at all enough to wait for kswapd to *do* something.

So before we really decide to see if we should oom, I think we should
have at least one  forced io_schedule_timeout(), whether we're
congested or not.

And yes, as Tetsuo Handa said, any kind of short wait might be too
short for IO to really complete, but *something* will have completed.
Unless we're so far up the creek that we really should just oom.

But I suspect we'll have to just try things out and tweak it. This
patch looks like a reasonable starting point to me.

Tetsuo, mind trying it out and maybe tweaking it a bit for the load
you have? Does it seem to improve on your situation?

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-16 18:34                     ` Linus Torvalds
@ 2015-10-16 18:49                       ` Tetsuo Handa
  2015-10-19 12:57                         ` Michal Hocko
  2015-10-19 12:53                       ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2015-10-16 18:49 UTC (permalink / raw)
  To: torvalds, mhocko
  Cc: rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm,
	linux-kernel, skozina, mgorman, riel

Linus Torvalds wrote:
> Tetsuo, mind trying it out and maybe tweaking it a bit for the load
> you have? Does it seem to improve on your situation?

Yes, I already tried it and just replied to Michal.

I tested for one hour using various memory stressing programs.
As far as I tested, I did not hit silent hang up (

 MemAlloc-Info: X stalling task, 0 dying task, 0 victim task.

where X > 0).

----------------------------------------
[  134.510993] Mem-Info:
[  134.511940] active_anon:408777 inactive_anon:2088 isolated_anon:24
[  134.511940]  active_file:15 inactive_file:24 isolated_file:0
[  134.511940]  unevictable:0 dirty:4 writeback:1 unstable:0
[  134.511940]  slab_reclaimable:3109 slab_unreclaimable:5594
[  134.511940]  mapped:679 shmem:2156 pagetables:2077 bounce:0
[  134.511940]  free:12911 free_pcp:31 free_cma:0
[  134.521256] Node 0 DMA free:7256kB min:400kB low:500kB high:600kB active_anon:6560kB inactive_anon:180kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:80kB shmem:184kB slab_reclaimable:236kB slab_unreclaimable:296kB kernel_stack:48kB pagetables:556kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  134.532779] lowmem_reserve[]: 0 1714 1714 1714
[  134.534455] Node 0 DMA32 free:44388kB min:44652kB low:55812kB high:66976kB active_anon:1628548kB inactive_anon:8172kB active_file:60kB inactive_file:96kB unevictable:0kB isolated(anon):96kB isolated(file):0kB present:2080640kB managed:1759252kB mlocked:0kB dirty:16kB writeback:4kB mapped:2636kB shmem:8440kB slab_reclaimable:12200kB slab_unreclaimable:22080kB kernel_stack:3584kB pagetables:7752kB unstable:0kB bounce:0kB free_pcp:240kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1016 all_unreclaimable? yes
[  134.545830] lowmem_reserve[]: 0 0 0 0
[  134.547404] Node 0 DMA: 16*4kB (UME) 16*8kB (UME) 10*16kB (UME) 6*32kB (UME) 1*64kB (M) 2*128kB (UE) 1*256kB (M) 2*512kB (UE) 3*1024kB (UME) 1*2048kB (U) 0*4096kB = 7264kB
[  134.552766] Node 0 DMA32: 1158*4kB (UME) 638*8kB (UE) 244*16kB (UME) 163*32kB (UE) 73*64kB (UE) 34*128kB (UME) 17*256kB (UME) 10*512kB (UME) 7*1024kB (UM) 0*2048kB 0*4096kB = 44520kB
[  134.558111] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  134.560358] 2195 total pagecache pages
[  134.562043] 0 pages in swap cache
[  134.563604] Swap cache stats: add 0, delete 0, find 0/0
[  134.565441] Free swap  = 0kB
[  134.567015] Total swap = 0kB
[  134.568628] 524157 pages RAM
[  134.570034] 0 pages HighMem/MovableOnly
[  134.571681] 80368 pages reserved
[  134.573467] 0 pages hwpoisoned
----------------------------------------

Only problem I felt is that the ratio of inactive_file/writeback
(shown below) was high (compared to shown above) when I did

  $ cat < /dev/zero > /tmp/file1 & cat < /dev/zero > /tmp/file2 & cat < /dev/zero > /tmp/file3 & sleep 10; ./a.out; killall cat

but I think this patch is better than current code.

----------------------------------------
[ 1135.909600] Mem-Info:
[ 1135.910686] active_anon:321011 inactive_anon:4664 isolated_anon:0
[ 1135.910686]  active_file:3170 inactive_file:78035 isolated_file:512
[ 1135.910686]  unevictable:0 dirty:0 writeback:78618 unstable:0
[ 1135.910686]  slab_reclaimable:5739 slab_unreclaimable:6170
[ 1135.910686]  mapped:4666 shmem:8300 pagetables:1966 bounce:0
[ 1135.910686]  free:12938 free_pcp:0 free_cma:0
[ 1135.925255] Node 0 DMA free:7232kB min:400kB low:500kB high:600kB active_anon:5852kB inactive_anon:196kB active_file:120kB inactive_file:980kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:968kB mapped:248kB shmem:388kB slab_reclaimable:316kB slab_unreclaimable:272kB kernel_stack:64kB pagetables:100kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:7444 all_unreclaimable? yes
[ 1135.936728] lowmem_reserve[]: 0 1714 1714 1714
[ 1135.938486] Node 0 DMA32 free:44520kB min:44652kB low:55812kB high:66976kB active_anon:1278192kB inactive_anon:18460kB active_file:12560kB inactive_file:313176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1759252kB mlocked:0kB dirty:0kB writeback:313504kB mapped:18416kB shmem:32812kB slab_reclaimable:22640kB slab_unreclaimable:24408kB kernel_stack:4240kB pagetables:7764kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2957668 all_unreclaimable? yes
[ 1135.950355] lowmem_reserve[]: 0 0 0 0
[ 1135.952011] Node 0 DMA: 7*4kB (U) 14*8kB (UM) 13*16kB (UM) 6*32kB (UME) 1*64kB (M) 4*128kB (UME) 2*256kB (UM) 3*512kB (UME) 2*1024kB (UE) 1*2048kB (M) 0*4096kB = 7260kB
[ 1135.957169] Node 0 DMA32: 241*4kB (UE) 929*8kB (UE) 496*16kB (UME) 277*32kB (UE) 135*64kB (UME) 17*128kB (UME) 3*256kB (E) 16*512kB (ME) 0*1024kB 0*2048kB 0*4096kB = 44972kB
[ 1135.963047] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1135.965472] 90009 total pagecache pages
[ 1135.967078] 0 pages in swap cache
[ 1135.968581] Swap cache stats: add 0, delete 0, find 0/0
[ 1135.970424] Free swap  = 0kB
[ 1135.971828] Total swap = 0kB
[ 1135.973248] 524157 pages RAM
[ 1135.974655] 0 pages HighMem/MovableOnly
[ 1135.976230] 80368 pages reserved
[ 1135.977745] 0 pages hwpoisoned
----------------------------------------

I can still hit OOM livelock (

 MemAlloc-Info: X stalling task, Y dying task, Z victim task.

where X > 0 && Y > 0).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-16 18:49                       ` Tetsuo Handa
@ 2015-10-19 12:57                         ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2015-10-19 12:57 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: torvalds, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina, mgorman, riel

On Sat 17-10-15 03:49:39, Tetsuo Handa wrote:
> Linus Torvalds wrote:
> > Tetsuo, mind trying it out and maybe tweaking it a bit for the load
> > you have? Does it seem to improve on your situation?
> 
> Yes, I already tried it and just replied to Michal.
> 
> I tested for one hour using various memory stressing programs.
> As far as I tested, I did not hit silent hang up (

Thank you for your testing!

[...]

> Only problem I felt is that the ratio of inactive_file/writeback
> (shown below) was high (compared to shown above) when I did

Yes this is the lack of congestion on the bdi as Linus expected.
Another patch I've just posted should help in that regards. At least it
seems to help in my testing.

[...]

> I can still hit OOM livelock (
> 
>  MemAlloc-Info: X stalling task, Y dying task, Z victim task.
> 
> where X > 0 && Y > 0).

This seems a separate issue, though.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-16 18:34                     ` Linus Torvalds
  2015-10-16 18:49                       ` Tetsuo Handa
@ 2015-10-19 12:53                       ` Michal Hocko
  1 sibling, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2015-10-19 12:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina, Mel Gorman, Rik van Riel

On Fri 16-10-15 11:34:48, Linus Torvalds wrote:
> On Fri, Oct 16, 2015 at 8:57 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >
> > OK so here is what I am playing with currently. It is not complete
> > yet.
> 
> So this looks like it's going in a reasonable direction. However:
> 
> > +               if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> > +                               ac->high_zoneidx, alloc_flags, target)) {
> > +                       /* Wait for some write requests to complete then retry */
> > +                       wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> > +                       goto retry;
> > +               }
> 
> I still think we should at least spend some time re-thinking that
> "wait_iff_congested()" thing.

You are right. I thought we would be congested most of the time because
of the heavy IO but a quick test has shown that the zone is marked
congested but the nr_wb_congested is zero all the time. That is most
probably because the IO is throttled severly by the lack of memory as
well.

> We may not actually be congested, but
> might be unable to write anything out because of our allocation flags
> (ie not allowed to recurse into the filesystems), so we might be in
> the situation that we have a lot of dirty pages that we can't directly
> do anything about.
> 
> Now, we will have woken kswapd, so something *will* hopefully be done
> about them eventually, but at no time do we actually really wait for
> it. We'll just busy-loop.
> 
> So at a minimum, I think we should yield to kswapd. We do do that
> "cond_resched()" in wait_iff_congested(), but I'm not entirely
> convinced that is at all enough to wait for kswapd to *do* something.

I went with congestion_wait which is what we used to do in the past
before wait_iff_congested has been introduced. The primary reason for
the change was that congestion_wait used to cause unhealthy stalls in
the direct reclaim where the bdi wasn't really congested and so we were
sleeping for the full timeout.

Now I think we can do better even with congestion_wait. We do not have
to wait when we did_some_progress so we won't affect a regular direct
reclaim path and we can reduce sleeping to:

dirty+writeback > reclaimable/2

This is a good signal that the reason for no progress is the stale
IO most likely and we need to wait even if the bdi itself is not
congested. We can also increase the timeout to HZ/10 because this is an
extreme slow path - we are not doing any progress and stalling is better
than OOM.

This is a diff on top of the previous patch. I even think that this part
would deserve a separate patch for a better bisect-ability. My testing
shows that close-to-oom behaves better (I can use more memory for
memeaters without hitting OOM)

What do you think?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e28028681c59..fed1bb7ea43a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3188,8 +3187,21 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 */
 		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
 				ac->high_zoneidx, alloc_flags, target)) {
-			/* Wait for some write requests to complete then retry */
-			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+			unsigned long writeback = zone_page_state(zone, NR_WRITEBACK),
+				      dirty = zone_page_state(zone, NR_FILE_DIRTY);
+			if (did_some_progress)
+				goto retry;
+
+			/*
+			 * If we didn't make any progress and have a lot of
+			 * dirty + writeback pages then we should wait for
+			 * an IO to complete to slow down the reclaim and
+			 * prevent from pre mature OOM
+			 */
+			if (2*(writeback + dirty) > reclaimable)
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+			else
+				cond_resched();
 			goto retry;
 		}
 	}

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-12 15:25         ` Silent hang up caused by pages being not scanned? Tetsuo Handa
  2015-10-12 21:23           ` Linus Torvalds
@ 2015-10-13 13:32           ` Michal Hocko
  2015-10-13 16:19             ` Tetsuo Handa
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2015-10-13 13:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On Tue 13-10-15 00:25:53, Tetsuo Handa wrote:
[...]
> What is strange, the values printed by this debug printk() patch did not
> change as time went by. Thus, I think that this is not a problem of lack of
> CPU time for scanning pages. I suspect that there is a bug that nobody is
> scanning pages.
> 
> ----------
> [   66.821450] zone_reclaimable returned 1 at line 2646
> [   66.823020] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32
> [   66.824935] shrink_zones returned 1 at line 2706
> [   66.826392] zones_reclaimable=1 at line 2765
> [   66.827865] do_try_to_free_pages returned 1 at line 2938
> [   67.102322] __perform_reclaim returned 1 at line 2854
> [   67.103968] did_some_progress=1 at line 3301
> (...snipped...)
> [  281.439977] zone_reclaimable returned 1 at line 2646
> [  281.439977] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32
> [  281.439978] shrink_zones returned 1 at line 2706
> [  281.439978] zones_reclaimable=1 at line 2765
> [  281.439979] do_try_to_free_pages returned 1 at line 2938
> [  281.439979] __perform_reclaim returned 1 at line 2854
> [  281.439980] did_some_progress=1 at line 3301

This is really interesting because even with reclaimable LRUs this low
we should eventually scan them enough times to convince zone_reclaimable
to fail. PAGES_SCANNED in your logs seems to be constant, though, which
suggests somebody manages to free a page every time before we get down
to priority 0 and manage to scan something finally. This is pretty much
pathological behavior and I have hard time to imagine how would that be
possible but it clearly shows that zone_reclaimable heuristic is not
working properly.

I can see two options here. Either we teach zone_reclaimable to be less
fragile or remove zone_reclaimable from shrink_zones altogether. Both of
them are risky because we have a long history of changes in this areas
which made other subtle behavior changes but I guess that the first
option should be less fragile. What about the following patch? I am not
happy about it because the condition is rather rough and a deeper
inspection is really needed to check all the call sites but it should be
good for testing.
--- 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-13 13:32           ` Michal Hocko
@ 2015-10-13 16:19             ` Tetsuo Handa
  2015-10-14 13:22               ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2015-10-13 16:19 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Michal Hocko wrote:
> I can see two options here. Either we teach zone_reclaimable to be less
> fragile or remove zone_reclaimable from shrink_zones altogether. Both of
> them are risky because we have a long history of changes in this areas
> which made other subtle behavior changes but I guess that the first
> option should be less fragile. What about the following patch? I am not
> happy about it because the condition is rather rough and a deeper
> inspection is really needed to check all the call sites but it should be
> good for testing.

While zone_reclaimable() for Node 0 DMA32 became false by your patch,
zone_reclaimable() for Node 0 DMA kept returning true, and as a result
overall result (i.e. zones_reclaimable) remained true.

  $ ./a.out

---------- When there is no data to write ----------
[  162.942371] MIN=11163 FREE=11155 (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=16
[  162.944541] MIN=100 FREE=1824 (ACTIVE_FILE=3+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  162.946560] zone_reclaimable returned 1 at line 2665
[  162.948722] shrink_zones returned 1 at line 2716
(...snipped...)
[  164.897587] zones_reclaimable=1 at line 2775
[  164.899172] do_try_to_free_pages returned 1 at line 2948
[  167.087119] __perform_reclaim returned 1 at line 2854
[  167.088868] did_some_progress=1 at line 3301
(...snipped...)
[  261.577944] MIN=11163 FREE=11155 (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  261.580093] MIN=100 FREE=1824 (ACTIVE_FILE=3+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  261.582333] zone_reclaimable returned 1 at line 2665
[  261.583841] shrink_zones returned 1 at line 2716
(...snipped...)
[  264.728434] zones_reclaimable=1 at line 2775
[  264.730002] do_try_to_free_pages returned 1 at line 2948
[  268.191368] __perform_reclaim returned 1 at line 2854
[  268.193113] did_some_progress=1 at line 3301
---------- When there is no data to write ----------

Complete log (with your patch inside) is at
http://I-love.SAKURA.ne.jp/tmp/serial-20151014.txt.xz .

By the way, the OOM killer seems to be invoked prematurely for different load
if your patch is applied.

  $ cat < /dev/zero > /tmp/log & sleep 10; ./a.out

---------- When there is a lot of data to write ----------
[   69.019271] Mem-Info:
[   69.019755] active_anon:335006 inactive_anon:2084 isolated_anon:23
[   69.019755]  active_file:12197 inactive_file:65310 isolated_file:31
[   69.019755]  unevictable:0 dirty:533 writeback:51020 unstable:0
[   69.019755]  slab_reclaimable:4753 slab_unreclaimable:4134
[   69.019755]  mapped:9639 shmem:2144 pagetables:2030 bounce:0
[   69.019755]  free:12972 free_pcp:45 free_cma:0
[   69.026260] Node 0 DMA free:7300kB min:400kB low:500kB high:600kB active_anon:5232kB inactive_anon:96kB active_file:424kB inactive_file:1068kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:164kB writeback:972kB mapped:416kB shmem:104kB slab_reclaimable:304kB slab_unreclaimable:244kB kernel_stack:96kB pagetables:256kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
[   69.037189] lowmem_reserve[]: 0 1729 1729 1729
[   69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   69.052017] lowmem_reserve[]: 0 0 0 0
[   69.053818] Node 0 DMA: 17*4kB (UME) 8*8kB (UME) 6*16kB (UME) 2*32kB (UM) 2*64kB (UE) 4*128kB (UME) 1*256kB (U) 2*512kB (UE) 3*1024kB (UME) 1*2048kB (U) 0*4096kB = 7332kB
[   69.059597] Node 0 DMA32: 632*4kB (UME) 454*8kB (UME) 507*16kB (UME) 310*32kB (UME) 177*64kB (UE) 61*128kB (UME) 15*256kB (ME) 19*512kB (M) 10*1024kB (M) 0*2048kB 0*4096kB = 67136kB
[   69.065810] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   69.068305] 72477 total pagecache pages
[   69.069932] 0 pages in swap cache
[   69.071435] Swap cache stats: add 0, delete 0, find 0/0
[   69.073354] Free swap  = 0kB
[   69.074822] Total swap = 0kB
[   69.076660] 524157 pages RAM
[   69.078113] 0 pages HighMem/MovableOnly
[   69.079930] 76615 pages reserved
[   69.081406] 0 pages hwpoisoned
---------- When there is a lot of data to write ----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-13 16:19             ` Tetsuo Handa
@ 2015-10-14 13:22               ` Michal Hocko
  2015-10-14 14:38                 ` Tetsuo Handa
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2015-10-14 13:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On Wed 14-10-15 01:19:09, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I can see two options here. Either we teach zone_reclaimable to be less
> > fragile or remove zone_reclaimable from shrink_zones altogether. Both of
> > them are risky because we have a long history of changes in this areas
> > which made other subtle behavior changes but I guess that the first
> > option should be less fragile. What about the following patch? I am not
> > happy about it because the condition is rather rough and a deeper
> > inspection is really needed to check all the call sites but it should be
> > good for testing.
> 
> While zone_reclaimable() for Node 0 DMA32 became false by your patch,
> zone_reclaimable() for Node 0 DMA kept returning true, and as a result
> overall result (i.e. zones_reclaimable) remained true.

Ahh, right you are. ZONE_DMA might have 0 or close to 0 pages on
LRUs while it is still protected from allocations which are not
targeted for this zone. My patch clearly haven't considered that. The
fix for that would be quite straightforward. We have to consider
lowmem_reserve of the zone wrt. the allocation/reclaim gfp target
zone. But this is getting more and more ugly (see the patch below just
for testing/demonstration purposes).

The OOM report is really interesting:

> [   69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no

so your whole file LRUs are either dirty or under writeback and
reclaimable pages are below min wmark. This alone is quite suspicious.
Why hasn't balance_dirty_pages throttled writers and allowed them to
make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes}
configuration on that system.

Also why throttle_vm_writeout haven't slown the reclaim down?

Anyway this is exactly the case where zone_reclaimable helps us to
prevent OOM because we are looping over the remaining LRU pages without
making progress... This just shows how subtle all this is :/

I have to think about this much more..
---

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-14 13:22               ` Michal Hocko
@ 2015-10-14 14:38                 ` Tetsuo Handa
  2015-10-14 14:59                   ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2015-10-14 14:38 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Michal Hocko wrote:
> The OOM report is really interesting:
> 
> > [   69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> 
> so your whole file LRUs are either dirty or under writeback and
> reclaimable pages are below min wmark. This alone is quite suspicious.

I did

  $ cat < /dev/zero > /tmp/log

for 10 seconds before starting

  $ ./a.out

Thus, so much memory was waiting for writeback on XFS filesystem.

> Why hasn't balance_dirty_pages throttled writers and allowed them to
> make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes}
> configuration on that system.

All values are defaults of plain CentOS 7 installation.

# sysctl -a | grep ^vm.
vm.admin_reserve_kbytes = 8192
vm.block_dump = 0
vm.compact_unevictable_allowed = 1
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 30
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200
vm.drop_caches = 0
vm.extfrag_threshold = 500
vm.hugepages_treat_as_movable = 0
vm.hugetlb_shm_group = 0
vm.laptop_mode = 0
vm.legacy_va_layout = 0
vm.lowmem_reserve_ratio = 256   256     32
vm.max_map_count = 65530
vm.memory_failure_early_kill = 0
vm.memory_failure_recovery = 1
vm.min_free_kbytes = 45056
vm.min_slab_ratio = 5
vm.min_unmapped_ratio = 1
vm.mmap_min_addr = 4096
vm.nr_hugepages = 0
vm.nr_hugepages_mempolicy = 0
vm.nr_overcommit_hugepages = 0
vm.nr_pdflush_threads = 0
vm.numa_zonelist_order = default
vm.oom_dump_tasks = 1
vm.oom_kill_allocating_task = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
vm.page-cluster = 3
vm.panic_on_oom = 0
vm.percpu_pagelist_fraction = 0
vm.stat_interval = 1
vm.swappiness = 30
vm.user_reserve_kbytes = 54808
vm.vfs_cache_pressure = 100
vm.zone_reclaim_mode = 0

> 
> Also why throttle_vm_writeout haven't slown the reclaim down?

Too difficult question for me.

> 
> Anyway this is exactly the case where zone_reclaimable helps us to
> prevent OOM because we are looping over the remaining LRU pages without
> making progress... This just shows how subtle all this is :/
> 
> I have to think about this much more..

I'm suspicious about tweaking current reclaim logic.
Could you please respond to Linus's comments?

There are more moles than kernel developers can find. I think that
what we can do for short term is to prepare for moles that kernel
developers could not find, and for long term is to reform page
allocator for preventing moles from living.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-14 14:38                 ` Tetsuo Handa
@ 2015-10-14 14:59                   ` Michal Hocko
  2015-10-14 15:06                     ` Tetsuo Handa
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2015-10-14 14:59 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On Wed 14-10-15 23:38:00, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > Why hasn't balance_dirty_pages throttled writers and allowed them to
> > make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes}
> > configuration on that system.
> 
> All values are defaults of plain CentOS 7 installation.

So this is 3.10 kernel, right?

> # sysctl -a | grep ^vm.
> vm.dirty_background_ratio = 10
> vm.dirty_bytes = 0
> vm.dirty_expire_centisecs = 3000
> vm.dirty_ratio = 30
[...]

OK, this is nothing unusual. And I _suspect_ that the throttling simply
didn't cope with the writer speed and a large anon memory consumer.
Dirtyable memory was quite high until your anon hammer bumped in
and reduced dirtyable memory down so the file LRU is full of dirty pages
when we get under serious memory pressure. Anonymous pages are not
reclaimable so the whole memory pressure goes to file LRUs and bang.

> > Also why throttle_vm_writeout haven't slown the reclaim down?
> 
> Too difficult question for me.
> 
> > 
> > Anyway this is exactly the case where zone_reclaimable helps us to
> > prevent OOM because we are looping over the remaining LRU pages without
> > making progress... This just shows how subtle all this is :/
> > 
> > I have to think about this much more..
> 
> I'm suspicious about tweaking current reclaim logic.
> Could you please respond to Linus's comments?

Yes I plan to I just didn't get to finish my email yet.
 
> There are more moles than kernel developers can find. I think that
> what we can do for short term is to prepare for moles that kernel
> developers could not find, and for long term is to reform page
> allocator for preventing moles from living.

This is much easier said than done :/ The current code is full of
heuristics grown over time based on very different requirements from
different kernel subsystems. There is no simple solution for this
problem I am afraid.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-14 14:59                   ` Michal Hocko
@ 2015-10-14 15:06                     ` Tetsuo Handa
  0 siblings, 0 replies; 18+ messages in thread
From: Tetsuo Handa @ 2015-10-14 15:06 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Michal Hocko wrote:
> On Wed 14-10-15 23:38:00, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > Why hasn't balance_dirty_pages throttled writers and allowed them to
> > > make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes}
> > > configuration on that system.
> > 
> > All values are defaults of plain CentOS 7 installation.
> 
> So this is 3.10 kernel, right?

The userland is CentOS 7 but the kernel is linux-next-20151009.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-10-19 12:57 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-14  8:03 Silent hang up caused by pages being not scanned? Hillf Danton
  -- strict thread matches above, loose matches on Subject: below --
2015-09-28 16:18 can't oom-kill zap the victim's memory? Tetsuo Handa
2015-10-02 12:36 ` Michal Hocko
2015-10-03  6:02   ` Can't we use timeout based OOM warning/killing? Tetsuo Handa
2015-10-06 14:51     ` Tetsuo Handa
2015-10-12  6:43       ` Tetsuo Handa
2015-10-12 15:25         ` Silent hang up caused by pages being not scanned? Tetsuo Handa
2015-10-12 21:23           ` Linus Torvalds
2015-10-13 12:21             ` Tetsuo Handa
2015-10-13 16:37               ` Linus Torvalds
2015-10-14 12:21                 ` Tetsuo Handa
2015-10-15 13:14                 ` Michal Hocko
2015-10-16 15:57                   ` Michal Hocko
2015-10-16 18:34                     ` Linus Torvalds
2015-10-16 18:49                       ` Tetsuo Handa
2015-10-19 12:57                         ` Michal Hocko
2015-10-19 12:53                       ` Michal Hocko
2015-10-13 13:32           ` Michal Hocko
2015-10-13 16:19             ` Tetsuo Handa
2015-10-14 13:22               ` Michal Hocko
2015-10-14 14:38                 ` Tetsuo Handa
2015-10-14 14:59                   ` Michal Hocko
2015-10-14 15:06                     ` Tetsuo Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox