* Re: Silent hang up caused by pages being not scanned?
@ 2015-10-14 8:03 Hillf Danton
0 siblings, 0 replies; 18+ messages in thread
From: Hillf Danton @ 2015-10-14 8:03 UTC (permalink / raw)
To: 'Tetsuo Handa'
Cc: Linus Torvalds, Michal Hocko, David Rientjes, Johannes Weiner,
linux-kernel, linux-mm
> >
> > In particular, I think that you'll find that you will have to change
> > the heuristics in __alloc_pages_slowpath() where we currently do
> >
> > if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || ..
> >
> > when the "did_some_progress" logic changes that radically.
> >
>
> Yes. But we can't simply do
>
> if (order <= PAGE_ALLOC_COSTLY_ORDER || ..
>
> because we won't be able to call out_of_memory(), can we?
>
Can you please try a simplified retry logic?
thanks
Hillf
--- a/mm/page_alloc.c Wed Oct 14 14:45:28 2015
+++ b/mm/page_alloc.c Wed Oct 14 15:43:31 2015
@@ -3154,8 +3154,7 @@ retry:
/* Keep reclaiming pages as long as there is reasonable progress */
pages_reclaimed += did_some_progress;
- if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
- ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
+ if (did_some_progress) {
/* Wait for some write requests to complete then retry */
wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
goto retry;
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: can't oom-kill zap the victim's memory? @ 2015-09-28 16:18 Tetsuo Handa 2015-10-02 12:36 ` Michal Hocko 0 siblings, 1 reply; 18+ messages in thread From: Tetsuo Handa @ 2015-09-28 16:18 UTC (permalink / raw) To: mhocko, rientjes Cc: oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > The point I've tried to made is that oom unmapper running in a detached > context (e.g. kernel thread) vs. directly in the oom context doesn't > make any difference wrt. lock because the holders of the lock would loop > inside the allocator anyway because we do not fail small allocations. We tried to allow small allocations to fail. It resulted in unstable system with obscure bugs. We tried to allow small !__GFP_FS allocations to fail. It failed to fail by effectively __GFP_NOFAIL allocations. We are now trying to allow zapping OOM victim's mm. Michal is already skeptical about this approach due to lock dependency. We already spent 9 months on this OOM livelock. No silver bullet yet. Proposed approaches are too drastic to backport for existing users. I think we are out of bullet. Until we complete adding/testing __GFP_NORETRY (or __GFP_KILLABLE) to most of callsites, timeout based workaround will be the only bullet we can use. Michal's panic_on_oom_timeout and David's "global access to memory reserves" will be acceptable for some users if these approaches are used as opt-in. Likewise, my memdie_task_skip_secs / memdie_task_panic_secs will be acceptable for those who want to retry a bit more rather than panic on accidental livelock if this approach is used as opt-in. Tetsuo Handa wrote: > Excuse me, but thinking about CLONE_VM without CLONE_THREAD case... > Isn't there possibility of hitting livelocks at > > /* > * If current has a pending SIGKILL or is exiting, then automatically > * select it. The goal is to allow it to allocate so that it may > * quickly exit and free its memory. > * > * But don't select if current has already released its mm and cleared > * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. > */ > if (current->mm && > (fatal_signal_pending(current) || task_will_free_mem(current))) { > mark_oom_victim(current); > return true; > } > > if current thread receives SIGKILL just before reaching here, for we don't > send SIGKILL to all threads sharing the mm? Seems that CLONE_VM without CLONE_THREAD is irrelevant here. We have sequences like Do a GFP_KENREL allocation. Hold a lock. Do a GFP_NOFS allocation. Release a lock. where an example is seen in VFS operations which receive pathname from user space using getname() and then call VFS functions and filesystem code takes locks which can contend with other threads. ------------------------------------------------------------ diff --git a/fs/namei.c b/fs/namei.c index d68c21f..d51c333 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -4005,6 +4005,8 @@ int vfs_symlink(struct inode *dir, struct dentry *dentry, const char *oldname) if (error) return error; + if (fatal_signal_pending(current)) + printk(KERN_INFO "Calling symlink with SIGKILL pending\n"); error = dir->i_op->symlink(dir, dentry, oldname); if (!error) fsnotify_create(dir, dentry); @@ -4021,6 +4023,10 @@ SYSCALL_DEFINE3(symlinkat, const char __user *, oldname, struct path path; unsigned int lookup_flags = 0; + if (!strcmp(current->comm, "a.out")) { + printk(KERN_INFO "Sending SIGKILL to current thread\n"); + do_send_sig_info(SIGKILL, SEND_SIG_FORCED, current, true); + } from = getname(oldname); if (IS_ERR(from)) return PTR_ERR(from); diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c index 996481e..2b6faa5 100644 --- a/fs/xfs/xfs_symlink.c +++ b/fs/xfs/xfs_symlink.c @@ -240,6 +240,8 @@ xfs_symlink( if (error) goto out_trans_cancel; + if (fatal_signal_pending(current)) + printk(KERN_INFO "Calling xfs_ilock() with SIGKILL pending\n"); xfs_ilock(dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL | XFS_IOLOCK_PARENT | XFS_ILOCK_PARENT); unlock_dp_on_error = true; ------------------------------------------------------------ [ 119.534976] Sending SIGKILL to current thread [ 119.535898] Calling symlink with SIGKILL pending [ 119.536870] Calling xfs_ilock() with SIGKILL pending Any program can potentially hit this silent livelock. We can't predict what locks the OOM victim threads will depend on after TIF_MEMDIE was set by the OOM killer. Therefore, I think that TIF_MEMDIE disables the OOM killer indefinitely is one of possible causes regarding silent hangup troubles. Michal Hocko wrote: > I really hate to do "easy" things now just to feel better about > particular case which will kick us back little bit later. And from my > own experience I can tell you that a more non-deterministic OOM behavior > is thing people complain about. I believe that not waiting for TIF_MEMDIE thread indefinitely is the first choice we can propose people to try. From my own experience I can tell you that some customers are really sensitive about bugs which halt their systems (e.g. https://access.redhat.com/solutions/68466 ). Opt-in version of TIF_MEMDIE timeout should be acceptable for people who prefer avoiding silent hangup over non-deterministic OOM behavior if they were explained about the truth of current memory allocator's behavior. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: can't oom-kill zap the victim's memory? 2015-09-28 16:18 can't oom-kill zap the victim's memory? Tetsuo Handa @ 2015-10-02 12:36 ` Michal Hocko 2015-10-03 6:02 ` Can't we use timeout based OOM warning/killing? Tetsuo Handa 0 siblings, 1 reply; 18+ messages in thread From: Michal Hocko @ 2015-10-02 12:36 UTC (permalink / raw) To: Tetsuo Handa Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Tue 29-09-15 01:18:00, Tetsuo Handa wrote: > Michal Hocko wrote: > > The point I've tried to made is that oom unmapper running in a detached > > context (e.g. kernel thread) vs. directly in the oom context doesn't > > make any difference wrt. lock because the holders of the lock would loop > > inside the allocator anyway because we do not fail small allocations. > > We tried to allow small allocations to fail. It resulted in unstable system > with obscure bugs. Have they been reported/fixed? All kernel paths doing an allocation are _supposed_ to check and handle ENOMEM. If they are not then they are buggy and should be fixed. > We tried to allow small !__GFP_FS allocations to fail. It failed to fail by > effectively __GFP_NOFAIL allocations. What do you mean by that? An opencoded __GFP_NOFAIL? > We are now trying to allow zapping OOM victim's mm. Michal is already > skeptical about this approach due to lock dependency. I am not sure where this came from. I am all for this approach. It will not solve the problem completely for sure but it can help in many cases already. > We already spent 9 months on this OOM livelock. No silver bullet yet. > Proposed approaches are too drastic to backport for existing users. > I think we are out of bullet. Not at all. We have this problem since ever basically. And we have a lot of legacy issues to care about. But nobody could reasonably expect this will be solved in a short time period. > Until we complete adding/testing __GFP_NORETRY (or __GFP_KILLABLE) to most > of callsites, This is simply not doable. There are thousand of allocation sites all over the kernel. > timeout based workaround will be the only bullet we can use. Those are the last resort which only paper over real bugs which should be fixed. I would agree with your urging if this was something that can easily happen on a _properly_ configured system. System which can blow into an OOM storm is far from being configured properly. If you have an untrusted users running on your system you should better put them into a highly restricted environment and limit as much as possible. I can completely understand your frustration about the pace of the progress here but this is nothing new and we should strive for long term vision which would be much less fragile than what we have right now. No timeout based solution is the way in that direction. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Can't we use timeout based OOM warning/killing? 2015-10-02 12:36 ` Michal Hocko @ 2015-10-03 6:02 ` Tetsuo Handa 2015-10-06 14:51 ` Tetsuo Handa 0 siblings, 1 reply; 18+ messages in thread From: Tetsuo Handa @ 2015-10-03 6:02 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > On Tue 29-09-15 01:18:00, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > The point I've tried to made is that oom unmapper running in a detached > > > context (e.g. kernel thread) vs. directly in the oom context doesn't > > > make any difference wrt. lock because the holders of the lock would loop > > > inside the allocator anyway because we do not fail small allocations. > > > > We tried to allow small allocations to fail. It resulted in unstable system > > with obscure bugs. > > Have they been reported/fixed? All kernel paths doing an allocation are > _supposed_ to check and handle ENOMEM. If they are not then they are > buggy and should be fixed. > Kernel developers are not interested in testing OOM cases. I proposed a SystemTap-based mandatory memory allocation failure injection for testing OOM cases, but there was no response. Most of memory allocation failure paths in the kernel remain untested. Unless you persuade all kernel developers to test OOM cases and add a gfp flag which bypasses memory allocation failure injection test (e.g. __GFP_FITv1_PASSED) and change any !__GFP_FITv1_PASSED && !__GFP_NOFAIL allocations always fail, we can't check that "all kernel paths doing an allocation are _supposed_ to check and handle ENOMEM". > > We tried to allow small !__GFP_FS allocations to fail. It failed to fail by > > effectively __GFP_NOFAIL allocations. > > What do you mean by that? An opencoded __GFP_NOFAIL? > Yes. XFS livelock is an example I can trivially reproduce. Loss of reliability of buffered write()s is another example. [ 1721.405074] buffer_io_error: 36 callbacks suppressed [ 1721.406263] Buffer I/O error on dev sda1, logical block 34652401, lost async page write [ 1721.406996] Buffer I/O error on dev sda1, logical block 34650278, lost async page write [ 1721.407125] Buffer I/O error on dev sda1, logical block 34652330, lost async page write [ 1721.407197] Buffer I/O error on dev sda1, logical block 34653485, lost async page write [ 1721.407203] Buffer I/O error on dev sda1, logical block 34652398, lost async page write [ 1721.407232] Buffer I/O error on dev sda1, logical block 34650494, lost async page write [ 1721.407356] Buffer I/O error on dev sda1, logical block 34652361, lost async page write [ 1721.407386] Buffer I/O error on dev sda1, logical block 34653484, lost async page write [ 1721.407481] Buffer I/O error on dev sda1, logical block 34652396, lost async page write [ 1721.407504] Buffer I/O error on dev sda1, logical block 34650291, lost async page write [ 1723.369963] XFS: a.out(8241) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1723.810033] XFS: a.out(7788) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1725.434057] XFS: a.out(8171) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1725.448049] XFS: a.out(7810) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1725.470757] XFS: a.out(8122) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1725.474061] XFS: a.out(7881) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1725.586610] XFS: a.out(8241) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1726.026702] XFS: a.out(7770) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1726.043988] XFS: a.out(7788) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1727.682001] XFS: a.out(8122) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1727.688661] XFS: a.out(8171) possible memory allocation deadlock in kmem_alloc (mode:0x8250) [ 1727.785214] XFS: a.out(8241) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1728.226640] XFS: a.out(7770) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1728.290648] XFS: a.out(7788) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) [ 1729.930028] XFS: a.out(8171) possible memory allocation deadlock in kmem_alloc (mode:0x8250) > > We are now trying to allow zapping OOM victim's mm. Michal is already > > skeptical about this approach due to lock dependency. > > I am not sure where this came from. I am all for this approach. It will > not solve the problem completely for sure but it can help in many cases > already. > Sorry. This was my misunderstanding. But I still think that we need to be prepared for cases where zapping OOM victim's mm approach fails. ( http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp ) > > We already spent 9 months on this OOM livelock. No silver bullet yet. > > Proposed approaches are too drastic to backport for existing users. > > I think we are out of bullet. > > Not at all. We have this problem since ever basically. And we have a lot > of legacy issues to care about. But nobody could reasonably expect this > will be solved in a short time period. > What people generally imagine with OOM killer is that OOM killer is invoked when the system is out of memory. But we know that there are many possible cases where OOM killer messages are not printed. We did not make effort to break people free from the belief that OOM killer is invoked when the system is out of memory, nor make effort to provide people a mean to warn OOM situation, after we recognized the "too small to fail" memory-allocation rule ( https://lwn.net/Articles/627419/ ) which was 9 months ago. > > Until we complete adding/testing __GFP_NORETRY (or __GFP_KILLABLE) to most > > of callsites, > > This is simply not doable. There are thousand of allocation sites all > over the kernel. But changing the default behavior (i.e. implicitly behave like __GFP_NORETRY inside memory allocator unless __GFP_NOFAIL is passed) is also not doable. You will need to ask for ACKs from thousand of allocation sites all over the kernel but that is not realistic. An example. I proposed a patch which changes the default behavior in XFS and got a feedback ( http://marc.info/?l=linux-mm&m=144279862227010 ) that fundamentally changing the allocation behavior of the filesystem requires some indication of the testing and characterization of how the change has impacted low memory balance and performance of the filesystem. You will need to ask for ACKs from all filesystem developers. Another example. I don't like that permission checks for access requests from user space start failing with ENOMEM error when memory is tight. It is not happy that access requests by critical processes are failed by inconsequential process's memory consumption. ( https://www.mail-archive.com/tomoyo-users-en@lists.osdn.me/msg00008.html ) This problem is not limited to permission checks. If a process executed a program using execve() and that process reached the point of no return in the execve() operation, any memory allocation failure before reaching the point of handling ENOMEM errors (e.g. failing to load shared libraries before calling the main() function of the new program), the process will be killed. If the process were the global init process, the system will panic(). Despite we mean to simply enforce only "all kernel paths doing an allocation are _supposed_ to check and handle ENOMEM", we have a period where memory allocation failure in the user space results in an unrecoverable failure. We depend on /proc/$pid/oom_score_adj for protecting critical processes from inconsequential process. I'm happy to give up memory allocation upon SIGKILL, but I'm not happy to give up upon ENOMEM without making effort to solve OOM situation. > > > timeout based workaround will be the only bullet we can use. > > Those are the last resort which only paper over real bugs which should > be fixed. I would agree with your urging if this was something that can > easily happen on a _properly_ configured system. System which can blow > into an OOM storm is far from being configured properly. If you have an > untrusted users running on your system you should better put them into a > highly restricted environment and limit as much as possible. People are reporting hang up problems. I'm suspecting that some of them are caused by silent OOM. I showed you that there are many possible paths which can lead to silent hang up. But we are forcing people to use kernels without means to find out what was happening. Therefore, "there is no report" does not mean that "we are not hitting OOM livelock problems". Without means to find out what was happening, we will "overlook real bugs" before "paper over real bugs". The means are expected to work without knowledge to use trace points functionality, are expected to run without memory allocation, are expected to dump output without administrator's operation, are expected to work before power reset by watchdog timers. > > I can completely understand your frustration about the pace of the > progress here but this is nothing new and we should strive for long term > vision which would be much less fragile than what we have right now. No > timeout based solution is the way in that direction. Can we stop randomly setting TIF_MEMDIE to only one task and staying silent forever in the hope that the task can make a quick exit? As long as small allocations do not fail, this TIF_MEMDIE logic is prone to livelock. We won't be able to change small allocations to fail (like Linus said at http://lkml.kernel.org/r/CA+55aFw=OLSdh-5Ut2vjy=4Yf1fTXqpzoDHdF7XnT5gDHs6sYA@mail.gmail.com and I said in this post) in the near future. Like I said at http://lkml.kernel.org/r/201510012113.HEA98301.SVFQOFtFOHLMOJ@I-love.SAKURA.ne.jp , can't we start adding a mean to emit some diagnostic kernel messages automatically? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Can't we use timeout based OOM warning/killing? 2015-10-03 6:02 ` Can't we use timeout based OOM warning/killing? Tetsuo Handa @ 2015-10-06 14:51 ` Tetsuo Handa 2015-10-12 6:43 ` Tetsuo Handa 0 siblings, 1 reply; 18+ messages in thread From: Tetsuo Handa @ 2015-10-06 14:51 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Tetsuo Handa wrote: > Sorry. This was my misunderstanding. But I still think that we need to be > prepared for cases where zapping OOM victim's mm approach fails. > ( http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp ) I tested whether it is easy/difficult to make zapping OOM victim's mm approach fail. The result seems that not difficult to make it fail. ---------- Reproducer start ---------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sched.h> #include <sys/mman.h> static int reader(void *unused) { char c; int fd = open("/proc/self/cmdline", O_RDONLY); while (pread(fd, &c, 1, 0) == 1); return 0; } static int writer(void *unused) { const int fd = open("/proc/self/exe", O_RDONLY); static void *ptr[10000]; int i; sleep(2); while (1) { for (i = 0; i < 10000; i++) ptr[i] = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0); for (i = 0; i < 10000; i++) munmap(ptr[i], 4096); } return 0; } int main(int argc, char *argv[]) { int zero_fd = open("/dev/zero", O_RDONLY); char *buf = NULL; unsigned long size = 0; int i; for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } for (i = 0; i < 100; i++) { clone(reader, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL); } clone(writer, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL); read(zero_fd, buf, size); /* Will cause OOM due to overcommit */ return * (char *) NULL; /* Kill all threads. */ } ---------- Reproducer end ---------- (I wrote this program for trying to mimic a trouble that a customer's system hung up with a lot of ps processes blocked at reading /proc/pid/ entries due to unkillable down_read(&mm->mmap_sem) in __access_remote_vm(). Though I couldn't identify what function was holding the mmap_sem for writing...) Uptime > 429 of http://I-love.SAKURA.ne.jp/tmp/serial-20151006.txt.xz showed a OOM livelock that (1) thread group leader is blocked at down_read(&mm->mmap_sem) in exit_mm() called from do_exit(). (2) writer thread is blocked at down_write(&mm->mmap_sem) in vm_mmap_pgoff() called from SyS_mmap_pgoff() called from SyS_mmap(). (3) many reader threads are blocking the writer thread because of down_read(&mm->mmap_sem) called from proc_pid_cmdline_read(). (4) while the thread group leader is blocked at down_read(&mm->mmap_sem), some of the reader threads are trying to allocate memory via page fault. So, zapping the first OOM victim's mm might fail by chance. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Can't we use timeout based OOM warning/killing? 2015-10-06 14:51 ` Tetsuo Handa @ 2015-10-12 6:43 ` Tetsuo Handa 2015-10-12 15:25 ` Silent hang up caused by pages being not scanned? Tetsuo Handa 0 siblings, 1 reply; 18+ messages in thread From: Tetsuo Handa @ 2015-10-12 6:43 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Tetsuo Handa wrote: > So, zapping the first OOM victim's mm might fail by chance. I retested with a slightly different version. ---------- Reproducer start ---------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sched.h> #include <sys/mman.h> static int writer(void *unused) { const int fd = open("/proc/self/exe", O_RDONLY); while (1) { void *ptr = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0); munmap(ptr, 4096); } return 0; } int main(int argc, char *argv[]) { char buffer[128] = { }; const pid_t pid = fork(); if (pid == 0) { /* down_write(&mm->mmap_sem) requester which is chosen as an OOM victim. */ int i; for (i = 0; i < 9; i++) clone(writer, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL); writer(NULL); } snprintf(buffer, sizeof(buffer) - 1, "/proc/%u/stat", pid); if (fork() == 0) { /* down_read(&mm->mmap_sem) requester. */ const int fd = open(buffer, O_RDONLY); while (pread(fd, buffer, sizeof(buffer), 0) > 0); _exit(0); } else { /* A dummy process for invoking the OOM killer. */ char *buf = NULL; unsigned long size = 0; const int fd = open("/dev/zero", O_RDONLY); for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } read(fd, buf, size); /* Will cause OOM due to overcommit */ return 0; } } ---------- Reproducer end ---------- Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151012.txt.xz . Uptime between 101 and 300 is a silent hang up (i.e. no OOM killer messages, no SIGKILL pending tasks, no TIF_MEMDIE tasks) which I solved using SysRq-f at uptime = 289. I don't know the reason of this silent hang up, but the memory unzapping kernel thread will not help because there is no OOM victim. ---------- [ 101.438951] MemAlloc-Info: 10 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 111.817922] MemAlloc-Info: 12 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 122.281828] MemAlloc-Info: 13 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 132.793724] MemAlloc-Info: 14 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 143.336154] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 289.343187] sysrq: SysRq : Manual OOM execution (...snipped...) [ 292.065650] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 302.590736] kworker/3:2 invoked oom-killer: gfp_mask=0x24000c0, order=-1, oom_score_adj=0 (...snipped...) [ 302.690047] MemAlloc-Info: 4 stalling task, 0 dying task, 0 victim task. ---------- Uptime between 379 and 605 is a mmap_sem livelock after the OOM killer was invoked. ---------- [ 380.039897] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 380.042500] [ 467] 0 467 14047 1815 28 3 0 0 systemd-journal [ 380.045055] [ 482] 0 482 10413 259 23 3 0 -1000 systemd-udevd [ 380.047637] [ 504] 0 504 12795 119 25 3 0 -1000 auditd [ 380.050127] [ 1244] 0 1244 82428 4257 81 3 0 0 firewalld [ 380.052536] [ 1247] 70 1247 6988 61 21 3 0 0 avahi-daemon [ 380.055028] [ 1250] 0 1250 54104 1372 42 4 0 0 rsyslogd [ 380.057505] [ 1251] 0 1251 137547 2620 91 3 0 0 tuned [ 380.059996] [ 1255] 0 1255 4823 77 15 3 0 0 irqbalance [ 380.062552] [ 1256] 0 1256 1095 37 8 3 0 0 rngd [ 380.065020] [ 1259] 0 1259 53626 441 60 3 0 0 abrtd [ 380.067383] [ 1260] 0 1260 53001 341 58 5 0 0 abrt-watch-log [ 380.069965] [ 1265] 0 1265 8673 83 21 3 0 0 systemd-logind [ 380.072554] [ 1266] 81 1266 6663 117 18 3 0 -900 dbus-daemon [ 380.075122] [ 1272] 0 1272 31577 154 21 3 0 0 crond [ 380.077544] [ 1314] 70 1314 6988 57 19 3 0 0 avahi-daemon [ 380.080013] [ 1427] 0 1427 46741 225 44 3 0 0 vmtoolsd [ 380.082478] [ 1969] 0 1969 25942 3100 48 3 0 0 dhclient [ 380.084969] [ 1990] 999 1990 128626 1929 50 4 0 0 polkitd [ 380.087516] [ 2073] 0 2073 20629 214 45 3 0 -1000 sshd [ 380.090065] [ 2201] 0 2201 7320 68 21 3 0 0 xinetd [ 380.092465] [ 3215] 0 3215 22773 257 44 3 0 0 master [ 380.094879] [ 3217] 89 3217 22816 249 45 3 0 0 qmgr [ 380.097304] [ 3249] 0 3249 75245 315 97 3 0 0 nmbd [ 380.099666] [ 3259] 0 3259 92963 486 131 5 0 0 smbd [ 380.101956] [ 3282] 0 3282 27503 30 12 3 0 0 agetty [ 380.104277] [ 3283] 0 3283 21788 154 49 3 0 0 login [ 380.106574] [ 3286] 0 3286 92963 486 126 5 0 0 smbd [ 380.108835] [ 3296] 1000 3296 28864 117 13 3 0 0 bash [ 380.111073] [ 3374] 89 3374 22799 249 46 3 0 0 pickup [ 380.113298] [ 3378] 89 3378 22836 252 45 3 0 0 cleanup [ 380.115555] [ 3385] 89 3385 22800 248 44 3 0 0 trivial-rewrite [ 380.117811] [ 3392] 0 3392 22825 265 48 3 0 0 local [ 380.119995] [ 3393] 0 3393 30828 59 17 3 0 0 anacron [ 380.122183] [ 3417] 1000 3417 541715 397587 787 6 0 0 a.out [ 380.124315] [ 3418] 1000 3418 1081 24 8 3 0 0 a.out [ 380.126410] [ 3419] 1000 3419 1042 21 7 3 0 0 a.out [ 380.128535] Out of memory: Kill process 3417 (a.out) score 890 or sacrifice child [ 380.130392] Killed process 3418 (a.out) total-vm:4324kB, anon-rss:96kB, file-rss:0kB [ 392.704028] MemAlloc-Info: 7 stalling task, 10 dying task, 1 victim task. (...snipped...) [ 601.129977] a.out R running task 0 3417 3296 0x00000080 [ 601.131899] ffff8800774dba10 ffffffff8112b174 0000000000000100 0000000000000000 [ 601.134026] 0000000000000000 0000000000000000 00000000a23cb49d 0000000000000000 [ 601.136076] ffff880077603200 00000000024280ca 0000000000000000 ffff880077603200 [ 601.138090] Call Trace: [ 601.139145] [<ffffffff8112b174>] ? try_to_free_pages+0x94/0xc0 [ 601.140831] [<ffffffff8111a8c4>] ? out_of_memory+0x2f4/0x460 [ 601.142489] [<ffffffff8111fa63>] ? __alloc_pages_nodemask+0x613/0xc30 [ 601.144328] [<ffffffff81161c40>] ? alloc_pages_vma+0xb0/0x200 [ 601.145994] [<ffffffff81143056>] ? handle_mm_fault+0xfa6/0x1370 [ 601.147677] [<ffffffff8162f557>] ? native_iret+0x7/0x7 [ 601.149258] [<ffffffff81058217>] ? __do_page_fault+0x177/0x400 [ 601.150966] [<ffffffff810584d0>] ? do_page_fault+0x30/0x80 [ 601.152625] [<ffffffff81630518>] ? page_fault+0x28/0x30 [ 601.154159] [<ffffffff813230c0>] ? __clear_user+0x20/0x50 [ 601.155723] [<ffffffff81327a68>] ? iov_iter_zero+0x68/0x250 [ 601.157329] [<ffffffff813fc6c8>] ? read_iter_zero+0x38/0xa0 [ 601.158923] [<ffffffff81187f04>] ? __vfs_read+0xc4/0xf0 [ 601.160453] [<ffffffff8118868a>] ? vfs_read+0x7a/0x120 [ 601.161961] [<ffffffff811893a0>] ? SyS_read+0x50/0xc0 [ 601.163513] [<ffffffff8162e9ee>] ? entry_SYSCALL_64_fastpath+0x12/0x71 [ 601.165254] a.out D ffff8800777b7e08 0 3418 3417 0x00100084 [ 601.167118] ffff8800777b7e08 ffff880077606400 ffff8800777b8000 ffff880036032e00 [ 601.169137] ffff880036032de8 ffffffff00000000 ffffffff00000001 ffff8800777b7e20 [ 601.171159] ffffffff8162a570 ffff880077606400 ffff8800777b7ea8 ffffffff8162d8eb [ 601.173183] Call Trace: [ 601.174193] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.175661] [<ffffffff8162d8eb>] rwsem_down_write_failed+0x1fb/0x350 [ 601.177388] [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30 [ 601.179194] [<ffffffff81322f93>] call_rwsem_down_write_failed+0x13/0x20 [ 601.180971] [<ffffffff8162d05f>] ? down_write+0x1f/0x30 [ 601.182509] [<ffffffff81147abe>] vm_munmap+0x2e/0x60 [ 601.183992] [<ffffffff811489fd>] SyS_munmap+0x1d/0x30 [ 601.185485] [<ffffffff8162e9ee>] entry_SYSCALL_64_fastpath+0x12/0x71 [ 601.187224] a.out D ffff88007c60fdf0 0 3420 3417 0x00000084 [ 601.189130] ffff88007c60fdf0 ffff880078e15780 ffff88007c610000 ffff880036032de8 [ 601.191158] ffff880036032e00 ffff88007c60ff58 ffff880078e15780 ffff88007c60fe08 [ 601.193180] ffffffff8162a570 ffff880078e15780 ffff88007c60fe68 ffffffff8162d698 [ 601.195217] Call Trace: [ 601.196226] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.197683] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.199407] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.201192] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.202711] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.204328] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.205874] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.207376] a.out D ffff88007c24fdf0 0 3421 3417 0x00000084 [ 601.209286] ffff88007c24fdf0 ffff880078e13200 ffff88007c250000 ffff880036032de8 [ 601.211316] ffff880036032e00 ffff88007c24ff58 ffff880078e13200 ffff88007c24fe08 [ 601.213335] ffffffff8162a570 ffff880078e13200 ffff88007c24fe68 ffffffff8162d698 [ 601.215356] Call Trace: [ 601.216377] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.217831] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.219529] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.221296] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.222802] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.224403] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.225958] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.227453] a.out D ffff88007823bdf0 0 3422 3417 0x00000084 [ 601.229348] ffff88007823bdf0 ffff880078e10000 ffff88007823c000 ffff880036032de8 [ 601.231395] ffff880036032e00 ffff88007823bf58 ffff880078e10000 ffff88007823be08 [ 601.233427] ffffffff8162a570 ffff880078e10000 ffff88007823be68 ffffffff8162d698 [ 601.235472] Call Trace: [ 601.236504] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.237989] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.239720] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.241583] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.243144] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.244777] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.246307] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.247823] a.out D ffff88007c483df0 0 3423 3417 0x00000084 [ 601.249719] ffff88007c483df0 ffff880078e13e80 ffff88007c484000 ffff880036032de8 [ 601.251765] ffff880036032e00 ffff88007c483f58 ffff880078e13e80 ffff88007c483e08 [ 601.253808] ffffffff8162a570 ffff880078e13e80 ffff88007c483e68 ffffffff8162d698 [ 601.255831] Call Trace: [ 601.256850] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.258286] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.260005] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.261803] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.263329] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.264936] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.266504] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.268019] a.out D ffff880035893e08 0 3424 3417 0x00000084 [ 601.269940] ffff880035893e08 ffff880078e17080 ffff880035894000 ffff880036032e00 [ 601.271945] ffff880036032de8 ffffffff00000000 ffffffff00000001 ffff880035893e20 [ 601.273954] ffffffff8162a570 ffff880078e17080 ffff880035893ea8 ffffffff8162d8eb [ 601.276000] Call Trace: [ 601.277007] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.278497] [<ffffffff8162d8eb>] rwsem_down_write_failed+0x1fb/0x350 [ 601.280240] [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30 [ 601.282058] [<ffffffff81322f93>] call_rwsem_down_write_failed+0x13/0x20 [ 601.283872] [<ffffffff8162d05f>] ? down_write+0x1f/0x30 [ 601.285403] [<ffffffff81147abe>] vm_munmap+0x2e/0x60 [ 601.286924] [<ffffffff811489fd>] SyS_munmap+0x1d/0x30 [ 601.288435] [<ffffffff8162e9ee>] entry_SYSCALL_64_fastpath+0x12/0x71 [ 601.290184] a.out D ffff8800353b7df0 0 3425 3417 0x00000084 [ 601.292108] ffff8800353b7df0 ffff880078e10c80 ffff8800353b8000 ffff880036032de8 [ 601.294165] ffff880036032e00 ffff8800353b7f58 ffff880078e10c80 ffff8800353b7e08 [ 601.296206] ffffffff8162a570 ffff880078e10c80 ffff8800353b7e68 ffffffff8162d698 [ 601.298267] Call Trace: [ 601.299300] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.300755] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.302437] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.304221] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.305764] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.307389] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.308968] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.310488] a.out D ffff88007cf87df0 0 3426 3417 0x00000084 [ 601.312380] ffff88007cf87df0 ffff880078e16400 ffff88007cf88000 ffff880036032de8 [ 601.314414] ffff880036032e00 ffff88007cf87f58 ffff880078e16400 ffff88007cf87e08 [ 601.316443] ffffffff8162a570 ffff880078e16400 ffff88007cf87e68 ffffffff8162d698 [ 601.318490] Call Trace: [ 601.319536] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.321036] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.322763] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.324504] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.326071] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.327715] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.329287] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.330761] a.out D ffff8800792dfdf0 0 3427 3417 0x00000084 [ 601.332705] ffff8800792dfdf0 ffff880078e12580 ffff8800792e0000 ffff880036032de8 [ 601.334699] ffff880036032e00 ffff8800792dff58 ffff880078e12580 ffff8800792dfe08 [ 601.336750] ffffffff8162a570 ffff880078e12580 ffff8800792dfe68 ffffffff8162d698 [ 601.338794] Call Trace: [ 601.339781] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.341280] [<ffffffff8162d698>] rwsem_down_read_failed+0xf8/0x150 [ 601.343009] [<ffffffff81322f64>] call_rwsem_down_read_failed+0x14/0x30 [ 601.344813] [<ffffffff8162d032>] ? down_read+0x12/0x20 [ 601.346361] [<ffffffff810583f7>] __do_page_fault+0x357/0x400 [ 601.347990] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.349521] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.351044] a.out D ffff88007743faa8 0 3428 3417 0x00000084 [ 601.352942] ffff88007743faa8 ffff88007bda6400 ffff880077440000 ffff88007743fae0 [ 601.354990] ffff88007fccdfc0 00000001000484e5 0000000000000000 ffff88007743fac0 [ 601.357024] ffffffff8162a570 ffff88007fccdfc0 ffff88007743fb40 ffffffff8162dbed [ 601.359075] Call Trace: [ 601.360096] [<ffffffff8162a570>] schedule+0x30/0x80 [ 601.361540] [<ffffffff8162dbed>] schedule_timeout+0x11d/0x1c0 [ 601.363190] [<ffffffff810c7e00>] ? cascade+0x90/0x90 [ 601.364697] [<ffffffff8162dce9>] schedule_timeout_uninterruptible+0x19/0x20 [ 601.366574] [<ffffffff8111fc9d>] __alloc_pages_nodemask+0x84d/0xc30 [ 601.368332] [<ffffffff811609a7>] alloc_pages_current+0x87/0x110 [ 601.370002] [<ffffffff811166cf>] __page_cache_alloc+0xaf/0xc0 [ 601.371606] [<ffffffff81119225>] filemap_fault+0x1e5/0x420 [ 601.373203] [<ffffffff81244f39>] xfs_filemap_fault+0x39/0x60 [ 601.374798] [<ffffffff8113d5e7>] __do_fault+0x47/0xd0 [ 601.376315] [<ffffffff81142ec5>] handle_mm_fault+0xe15/0x1370 [ 601.377938] [<ffffffff81322f64>] ? call_rwsem_down_read_failed+0x14/0x30 [ 601.379707] [<ffffffff81058217>] __do_page_fault+0x177/0x400 [ 601.381320] [<ffffffff810584d0>] do_page_fault+0x30/0x80 [ 601.382831] [<ffffffff81630518>] page_fault+0x28/0x30 [ 601.384337] a.out R running task 0 3419 3417 0x00000080 [ 601.386257] 00000000f80745e8 ffff880034ab4400 ffff8800776d3f18 ffff8800776d3f18 [ 601.388287] 0000000000000080 0000000000000000 ffff8800776d3ec8 ffffffff81187e72 [ 601.390341] ffff880034ab4400 ffff880034ab4410 0000000000020000 0000000000000000 [ 601.392366] Call Trace: [ 601.393388] [<ffffffff81187e72>] ? __vfs_read+0x32/0xf0 [ 601.394952] [<ffffffff81290aa9>] ? security_file_permission+0xa9/0xc0 [ 601.396745] [<ffffffff8118858d>] ? rw_verify_area+0x4d/0xd0 [ 601.398359] [<ffffffff8118868a>] ? vfs_read+0x7a/0x120 [ 601.399897] [<ffffffff81189560>] ? SyS_pread64+0x90/0xb0 [ 601.401429] [<ffffffff8162e9ee>] ? entry_SYSCALL_64_fastpath+0x12/0x71 ---------- I think that I noticed three problems from this reproducer. (1) While the likeliness of hitting mmap_sem livelock would depend on how frequently down_read(&mm->mmap_sem) tasks and down_write(&mm->mmap_sem) tasks contend on the OOM victim's mm, we can hit mmap_sem livelock with even only one down_read(&mm->mmap_sem) task. On systems where processes are monitored using /proc/pid/ interface, we can by chance hit this mmap_sem livelock. (2) The OOM killer tries to kill child process of the memory hog. But the child process is not always consuming a lot of memory. The memory unzapping kernel thread might not be able to reclaim enough memory unless we choose subsequent OOM victims when the first OOM victim task got mmap_sem livelock. (3) I don't know the reason but I can observe that (when there are many tasks which got SIGKILL by the OOM killer) many of dying tasks participate in a memory allocation competition via page_fault() which cannot make forward progress because dying tasks without TIF_MEMDIE are not allowed to access the memory reserves. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Silent hang up caused by pages being not scanned? 2015-10-12 6:43 ` Tetsuo Handa @ 2015-10-12 15:25 ` Tetsuo Handa 2015-10-12 21:23 ` Linus Torvalds 2015-10-13 13:32 ` Michal Hocko 0 siblings, 2 replies; 18+ messages in thread From: Tetsuo Handa @ 2015-10-12 15:25 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Tetsuo Handa wrote: > Uptime between 101 and 300 is a silent hang up (i.e. no OOM killer messages, > no SIGKILL pending tasks, no TIF_MEMDIE tasks) which I solved using SysRq-f > at uptime = 289. I don't know the reason of this silent hang up, but the > memory unzapping kernel thread will not help because there is no OOM victim. > > ---------- > [ 101.438951] MemAlloc-Info: 10 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 111.817922] MemAlloc-Info: 12 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 122.281828] MemAlloc-Info: 13 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 132.793724] MemAlloc-Info: 14 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 143.336154] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 289.343187] sysrq: SysRq : Manual OOM execution > (...snipped...) > [ 292.065650] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task. > (...snipped...) > [ 302.590736] kworker/3:2 invoked oom-killer: gfp_mask=0x24000c0, order=-1, oom_score_adj=0 > (...snipped...) > [ 302.690047] MemAlloc-Info: 4 stalling task, 0 dying task, 0 victim task. > ---------- I examined this hang up using additional debug printk() patch. And it was observed that when this silent hang up occurs, zone_reclaimable() called from shrink_zones() called from a __GFP_FS memory allocation request is returning true forever. Since the __GFP_FS memory allocation request can never call out_of_memory() due to did_some_progree > 0, the system will silently hang up with 100% CPU usage. ---------- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0473eec..fda0bb5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2821,6 +2821,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, } #endif /* CONFIG_COMPACTION */ +pid_t dump_target_pid; + /* Perform direct synchronous page reclaim */ static int __perform_reclaim(gfp_t gfp_mask, unsigned int order, @@ -2847,6 +2849,9 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order, cond_resched(); + if (dump_target_pid == current->pid) + printk(KERN_INFO "__perform_reclaim returned %u at line %u\n", + progress, __LINE__); return progress; } @@ -3007,6 +3012,7 @@ static int malloc_watchdog(void *unused) unsigned int memdie_pending; unsigned int stalling_tasks; u8 index; + pid_t pid; not_stalling: /* Healty case. */ /* @@ -3025,12 +3031,16 @@ static int malloc_watchdog(void *unused) * and stop_memalloc_timer() within timeout duration. */ if (likely(!memalloc_counter[index])) + { + dump_target_pid = 0; goto not_stalling; + } maybe_stalling: /* Maybe something is wrong. Let's check. */ /* First, report whether there are SIGKILL tasks and/or OOM victims. */ sigkill_pending = 0; memdie_pending = 0; stalling_tasks = 0; + pid = 0; preempt_disable(); rcu_read_lock(); for_each_process_thread(g, p) { @@ -3062,8 +3072,11 @@ static int malloc_watchdog(void *unused) (fatal_signal_pending(p) ? "-dying" : ""), p->comm, p->pid, m->gfp, m->order, spent); show_stack(p, NULL); + if (!pid && (m->gfp & __GFP_FS)) + pid = p->pid; } spin_unlock(&memalloc_list_lock); + dump_target_pid = -pid; /* Wait until next timeout duration. */ schedule_timeout_interruptible(timeout); if (memalloc_counter[index]) @@ -3155,6 +3168,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, goto nopage; retry: + if (dump_target_pid == -current->pid) + dump_target_pid = -dump_target_pid; + if (gfp_mask & __GFP_KSWAPD_RECLAIM) wake_all_kswapds(order, ac); @@ -3280,6 +3296,11 @@ retry: goto noretry; /* Keep reclaiming pages as long as there is reasonable progress */ + if (dump_target_pid == current->pid) { + printk(KERN_INFO "did_some_progress=%lu at line %u\n", + did_some_progress, __LINE__); + dump_target_pid = 0; + } pages_reclaimed += did_some_progress; if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) { diff --git a/mm/vmscan.c b/mm/vmscan.c index 27d580b..cb0c22e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2527,6 +2527,8 @@ static inline bool compaction_ready(struct zone *zone, int order) return watermark_ok; } +extern pid_t dump_target_pid; + /* * This is the direct reclaim path, for page-allocating processes. We only * try to reclaim pages from zones which will satisfy the caller's allocation @@ -2619,16 +2621,41 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) sc->nr_reclaimed += nr_soft_reclaimed; sc->nr_scanned += nr_soft_scanned; if (nr_soft_reclaimed) + { + if (dump_target_pid == current->pid) + printk(KERN_INFO "nr_soft_reclaimed=%lu at line %u\n", + nr_soft_reclaimed, __LINE__); reclaimable = true; + } /* need some check for avoid more shrink_zone() */ } if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx)) + { + if (dump_target_pid == current->pid) + printk(KERN_INFO "shrink_zone returned 1 at line %u\n", + __LINE__); reclaimable = true; + } if (global_reclaim(sc) && !reclaimable && zone_reclaimable(zone)) + { + if (dump_target_pid == current->pid) { + printk(KERN_INFO "zone_reclaimable returned 1 at line %u\n", + __LINE__); + printk(KERN_INFO "(ACTIVE_FILE=%lu+INACTIVE_FILE=%lu", + zone_page_state(zone, NR_ACTIVE_FILE), + zone_page_state(zone, NR_INACTIVE_FILE)); + if (get_nr_swap_pages() > 0) + printk(KERN_CONT "+ACTIVE_ANON=%lu+INACTIVE_ANON=%lu", + zone_page_state(zone, NR_ACTIVE_ANON), + zone_page_state(zone, NR_INACTIVE_ANON)); + printk(KERN_CONT ") * 6 > PAGES_SCANNED=%lu\n", + zone_page_state(zone, NR_PAGES_SCANNED)); + } reclaimable = true; + } } /* @@ -2674,6 +2701,9 @@ retry: sc->priority); sc->nr_scanned = 0; zones_reclaimable = shrink_zones(zonelist, sc); + if (dump_target_pid == current->pid) + printk(KERN_INFO "shrink_zones returned %u at line %u\n", + zones_reclaimable, __LINE__); total_scanned += sc->nr_scanned; if (sc->nr_reclaimed >= sc->nr_to_reclaim) @@ -2707,11 +2737,21 @@ retry: delayacct_freepages_end(); if (sc->nr_reclaimed) + { + if (dump_target_pid == current->pid) + printk(KERN_INFO "sc->nr_reclaimed=%lu at line %u\n", + sc->nr_reclaimed, __LINE__); return sc->nr_reclaimed; + } /* Aborted reclaim to try compaction? don't OOM, then */ if (sc->compaction_ready) + { + if (dump_target_pid == current->pid) + printk(KERN_INFO "sc->compaction_ready=%u at line %u\n", + sc->compaction_ready, __LINE__); return 1; + } /* Untapped cgroup reserves? Don't OOM, retry. */ if (!sc->may_thrash) { @@ -2720,6 +2760,9 @@ retry: goto retry; } + if (dump_target_pid == current->pid) + printk(KERN_INFO "zones_reclaimable=%u at line %u\n", + zones_reclaimable, __LINE__); /* Any of the zones still reclaimable? Don't OOM. */ if (zones_reclaimable) return 1; @@ -2875,7 +2918,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, * point. */ if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask)) + { + if (dump_target_pid == current->pid) + printk(KERN_INFO "throttle_direct_reclaim returned 1 at line %u\n", + __LINE__); return 1; + } trace_mm_vmscan_direct_reclaim_begin(order, sc.may_writepage, @@ -2885,6 +2933,9 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, trace_mm_vmscan_direct_reclaim_end(nr_reclaimed); + if (dump_target_pid == current->pid) + printk(KERN_INFO "do_try_to_free_pages returned %lu at line %u\n", + nr_reclaimed, __LINE__); return nr_reclaimed; } ---------- What is strange, the values printed by this debug printk() patch did not change as time went by. Thus, I think that this is not a problem of lack of CPU time for scanning pages. I suspect that there is a bug that nobody is scanning pages. ---------- [ 66.821450] zone_reclaimable returned 1 at line 2646 [ 66.823020] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32 [ 66.824935] shrink_zones returned 1 at line 2706 [ 66.826392] zones_reclaimable=1 at line 2765 [ 66.827865] do_try_to_free_pages returned 1 at line 2938 [ 67.102322] __perform_reclaim returned 1 at line 2854 [ 67.103968] did_some_progress=1 at line 3301 (...snipped...) [ 281.439977] zone_reclaimable returned 1 at line 2646 [ 281.439977] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32 [ 281.439978] shrink_zones returned 1 at line 2706 [ 281.439978] zones_reclaimable=1 at line 2765 [ 281.439979] do_try_to_free_pages returned 1 at line 2938 [ 281.439979] __perform_reclaim returned 1 at line 2854 [ 281.439980] did_some_progress=1 at line 3301 ---------- Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151013.txt.xz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-12 15:25 ` Silent hang up caused by pages being not scanned? Tetsuo Handa @ 2015-10-12 21:23 ` Linus Torvalds 2015-10-13 12:21 ` Tetsuo Handa 2015-10-13 13:32 ` Michal Hocko 1 sibling, 1 reply; 18+ messages in thread From: Linus Torvalds @ 2015-10-12 21:23 UTC (permalink / raw) To: Tetsuo Handa Cc: Michal Hocko, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina On Mon, Oct 12, 2015 at 8:25 AM, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > I examined this hang up using additional debug printk() patch. And it was > observed that when this silent hang up occurs, zone_reclaimable() called from > shrink_zones() called from a __GFP_FS memory allocation request is returning > true forever. Since the __GFP_FS memory allocation request can never call > out_of_memory() due to did_some_progree > 0, the system will silently hang up > with 100% CPU usage. I wouldn't blame the zones_reclaimable() logic itself, but yeah, that looks bad. So the do_try_to_free_pages() logic that does that /* Any of the zones still reclaimable? Don't OOM. */ if (zones_reclaimable) return 1; is rather dubious. The history of that odd line is pretty dubious too: it used to be that we would return success if "shrink_zones()" succeeded or if "nr_reclaimed" was non-zero, but that "shrink_zones()" logic got rewritten, and I don't think the current situation is all that sane. And returning 1 there is actively misleading to callers, since it makes them think that it made progress. So I think you should look at what happens if you just remove that illogical and misleading return value. HOWEVER. I think that it's very true that we have then tuned all our *other* heuristics for taking this thing into account, so I suspect that we'll find that we'll need to tweak other places. But this crazy "let's say that we made progress even when we didn't" thing looks just wrong. In particular, I think that you'll find that you will have to change the heuristics in __alloc_pages_slowpath() where we currently do if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || .. when the "did_some_progress" logic changes that radically. Because while the current return value looks insane, all the other testing and tweaking has been done with that very odd return value in place. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-12 21:23 ` Linus Torvalds @ 2015-10-13 12:21 ` Tetsuo Handa 2015-10-13 16:37 ` Linus Torvalds 0 siblings, 1 reply; 18+ messages in thread From: Tetsuo Handa @ 2015-10-13 12:21 UTC (permalink / raw) To: torvalds Cc: mhocko, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Linus Torvalds wrote: > On Mon, Oct 12, 2015 at 8:25 AM, Tetsuo Handa > <penguin-kernel@i-love.sakura.ne.jp> wrote: > > > > I examined this hang up using additional debug printk() patch. And it was > > observed that when this silent hang up occurs, zone_reclaimable() called from > > shrink_zones() called from a __GFP_FS memory allocation request is returning > > true forever. Since the __GFP_FS memory allocation request can never call > > out_of_memory() due to did_some_progree > 0, the system will silently hang up > > with 100% CPU usage. > > I wouldn't blame the zones_reclaimable() logic itself, but yeah, that looks bad. > I compared "hang up after the OOM killer is invoked" and "hang up before the OOM killer is invoked" by always printing the values. } reclaimable = true; } + else if (dump_target_pid == current->pid) { + printk(KERN_INFO "(ACTIVE_FILE=%lu+INACTIVE_FILE=%lu", + zone_page_state(zone, NR_ACTIVE_FILE), + zone_page_state(zone, NR_INACTIVE_FILE)); + if (get_nr_swap_pages() > 0) + printk(KERN_CONT "+ACTIVE_ANON=%lu+INACTIVE_ANON=%lu", + zone_page_state(zone, NR_ACTIVE_ANON), + zone_page_state(zone, NR_INACTIVE_ANON)); + printk(KERN_CONT ") * 6 > PAGES_SCANNED=%lu\n", + zone_page_state(zone, NR_PAGES_SCANNED)); + } } /* For the former case, most of trials showed that (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 . Sometimes PAGES_SCANNED > 0 (as grep'ed below), but ACTIVE_FILE and INACTIVE_FILE seems to be always 0. ---------- [ 195.905057] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 195.927430] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 206.317088] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 206.338007] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 216.723776] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 216.744618] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 227.129653] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 227.151238] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 237.650232] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 237.671343] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 277.980310] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 278.001481] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 288.339220] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 288.361908] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 298.682988] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 298.704055] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9 [ 350.368952] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 350.389770] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 360.724821] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 360.746100] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 845.231887] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=27 [ 845.233770] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 845.253196] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=27 [ 845.254910] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 1397.628073] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 1397.649165] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 1408.207041] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 [ 1408.228762] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2 ---------- For the latter case, most of output showed that ACTIVE_FILE + INACTIVE_FILE > 0. ---------- [ 142.647201] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 142.648883] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 142.842868] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 142.955817] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.086363] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.231120] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.359238] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.473342] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.618103] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.746210] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 143.908162] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.035415] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.161926] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.306435] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.434265] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.436099] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 144.643374] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.773239] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 144.902309] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.046154] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.185410] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.317218] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.460304] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.654212] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.817362] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 145.945136] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 146.086303] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 146.242127] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 153.489868] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 153.491593] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 153.674246] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 153.839478] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 154.003234] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 154.155085] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 154.322187] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 154.447355] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 154.653150] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 154.782216] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 154.939439] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.105921] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.278386] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.440832] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.623970] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.625766] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.831074] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 155.996903] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 156.139137] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 156.318492] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 156.484300] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 156.667411] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 156.817246] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 157.012323] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 157.159483] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 157.323193] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 157.488399] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 157.654198] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 164.339172] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 164.340896] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 164.583026] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 164.797386] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 164.965110] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 165.124935] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 165.431304] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 165.700317] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 165.862071] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.029257] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.198312] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.356224] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.559302] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.684486] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.898551] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 166.900496] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 167.175960] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 167.324390] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 167.526150] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 167.693365] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 167.878407] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.061503] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.225306] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.416398] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.617395] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.783201] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 168.989053] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 169.196126] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 175.361136] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 175.362865] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 175.626817] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 175.797361] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 176.006389] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 176.211479] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 176.433890] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 176.630951] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 176.855509] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 177.049814] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 177.258218] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 177.455404] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 177.665085] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 177.874173] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.057217] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.059056] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.350935] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.559404] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.782483] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 178.982803] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 179.203930] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 179.428321] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 179.611349] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 179.851164] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 180.034220] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 180.279197] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 180.455284] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 180.811445] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 186.368405] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 186.370115] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 186.614733] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 186.845695] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.024274] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.211389] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.427147] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.552333] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.734117] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 187.935811] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.138296] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.354041] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.559245] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.641776] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.716434] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 188.718199] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 189.015952] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 189.218976] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 189.440131] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 189.659238] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 189.882360] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 190.087342] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 190.314442] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 190.408926] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 190.631240] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 190.850326] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 191.067488] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 191.283243] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 ---------- So, something is preventing ACTIVE_FILE and INACTIVE_FILE to become 0 ? I also tried below change, but the result was same. Therefore, this problem seems to be independent with "!__GFP_FS allocations do not fail". (Complete log with below change (uptime > 101) is at http://I-love.SAKURA.ne.jp/tmp/serial-20151013-2.txt.xz . ) ---------- --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2736,7 +2736,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, * and the OOM killer can't be invoked, but * keep looping as per tradition. */ - *did_some_progress = 1; goto out; } if (pm_suspended_storage()) ---------- ---------- [ 102.719555] (ACTIVE_FILE=3+INACTIVE_FILE=3) * 6 > PAGES_SCANNED=19 [ 102.721234] (ACTIVE_FILE=1+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 102.722908] shrink_zones returned 1 at line 2717 ---------- > So the do_try_to_free_pages() logic that does that > > /* Any of the zones still reclaimable? Don't OOM. */ > if (zones_reclaimable) > return 1; > > is rather dubious. The history of that odd line is pretty dubious too: > it used to be that we would return success if "shrink_zones()" > succeeded or if "nr_reclaimed" was non-zero, but that "shrink_zones()" > logic got rewritten, and I don't think the current situation is all > that sane. > > And returning 1 there is actively misleading to callers, since it > makes them think that it made progress. > > So I think you should look at what happens if you just remove that > illogical and misleading return value. > If I remove /* Any of the zones still reclaimable? Don't OOM. */ if (zones_reclaimable) return 1; the OOM killer is invoked even when there are so much memory which can be reclaimed after written to disk. This is definitely premature invocation of the OOM killer. $ cat < /dev/zero > /tmp/log & sleep 10; ./a.out ---------- When there is a lot of data to write ---------- [ 489.952827] Mem-Info: [ 489.953840] active_anon:328227 inactive_anon:3033 isolated_anon:26 [ 489.953840] active_file:2309 inactive_file:80915 isolated_file:0 [ 489.953840] unevictable:0 dirty:53 writeback:80874 unstable:0 [ 489.953840] slab_reclaimable:4975 slab_unreclaimable:4256 [ 489.953840] mapped:2973 shmem:4192 pagetables:1939 bounce:0 [ 489.953840] free:12963 free_pcp:60 free_cma:0 [ 489.963395] Node 0 DMA free:7300kB min:400kB low:500kB high:600kB active_anon:5728kB inactive_anon:88kB active_file:140kB inactive_file:1276kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:1300kB mapped:140kB shmem:160kB slab_reclaimable:256kB slab_unreclaimable:180kB kernel_stack:64kB pagetables:180kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9768 all_unreclaimable? yes [ 489.974035] lowmem_reserve[]: 0 1729 1729 1729 [ 489.975813] Node 0 DMA32 free:44552kB min:44652kB low:55812kB high:66976kB active_anon:1307180kB inactive_anon:12044kB active_file:9096kB inactive_file:322384kB unevictable:0kB isolated(anon):104kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:216kB writeback:322196kB mapped:11752kB shmem:16608kB slab_reclaimable:19644kB slab_unreclaimable:16844kB kernel_stack:3584kB pagetables:7576kB unstable:0kB bounce:0kB free_pcp:240kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:2419896 all_unreclaimable? yes [ 489.988452] lowmem_reserve[]: 0 0 0 0 [ 489.990043] Node 0 DMA: 2*4kB (UE) 1*8kB (M) 4*16kB (UME) 1*32kB (E) 2*64kB (UE) 3*128kB (UME) 2*256kB (UM) 2*512kB (ME) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 7280kB [ 489.995142] Node 0 DMA32: 578*4kB (UME) 726*8kB (UE) 447*16kB (UE) 253*32kB (UME) 155*64kB (UME) 42*128kB (UME) 3*256kB (UME) 2*512kB (UM) 4*1024kB (U) 0*2048kB 0*4096kB = 44552kB [ 490.000511] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 490.002914] 87434 total pagecache pages [ 490.004612] 0 pages in swap cache [ 490.006138] Swap cache stats: add 0, delete 0, find 0/0 [ 490.007976] Free swap = 0kB [ 490.009329] Total swap = 0kB [ 490.011033] 524157 pages RAM [ 490.012352] 0 pages HighMem/MovableOnly [ 490.013903] 76615 pages reserved [ 490.015260] 0 pages hwpoisoned ---------- When there is a lot of data to write ---------- $ ./a.out ---------- When there is no data to write ---------- [ 792.359024] Mem-Info: [ 792.360001] active_anon:413751 inactive_anon:6226 isolated_anon:0 [ 792.360001] active_file:0 inactive_file:0 isolated_file:0 [ 792.360001] unevictable:0 dirty:0 writeback:0 unstable:0 [ 792.360001] slab_reclaimable:1243 slab_unreclaimable:3638 [ 792.360001] mapped:104 shmem:6236 pagetables:1033 bounce:0 [ 792.360001] free:12965 free_pcp:126 free_cma:0 [ 792.368559] Node 0 DMA free:7292kB min:400kB low:500kB high:600kB active_anon:7040kB inactive_anon:160kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:160kB slab_reclaimable:24kB slab_unreclaimable:172kB kernel_stack:64kB pagetables:460kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8 all_unreclaimable? yes [ 792.378240] lowmem_reserve[]: 0 1729 1729 1729 [ 792.379834] Node 0 DMA32 free:44568kB min:44652kB low:55812kB high:66976kB active_anon:1647964kB inactive_anon:24744kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:0kB writeback:0kB mapped:416kB shmem:24784kB slab_reclaimable:4948kB slab_unreclaimable:14380kB kernel_stack:3104kB pagetables:3672kB unstable:0kB bounce:0kB free_pcp:504kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8 all_unreclaimable? yes [ 792.390085] lowmem_reserve[]: 0 0 0 0 [ 792.391643] Node 0 DMA: 3*4kB (UE) 0*8kB 3*16kB (UE) 24*32kB (ME) 11*64kB (UME) 5*128kB (UM) 2*256kB (ME) 3*512kB (ME) 1*1024kB (E) 1*2048kB (E) 0*4096kB = 7292kB [ 792.396201] Node 0 DMA32: 242*4kB (UME) 386*8kB (UME) 397*16kB (UME) 199*32kB (UE) 105*64kB (UME) 37*128kB (UME) 24*256kB (UME) 20*512kB (UME) 0*1024kB 0*2048kB 0*4096kB = 44616kB [ 792.401136] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 792.403356] 6250 total pagecache pages [ 792.404803] 0 pages in swap cache [ 792.406208] Swap cache stats: add 0, delete 0, find 0/0 [ 792.407896] Free swap = 0kB [ 792.409172] Total swap = 0kB [ 792.410460] 524157 pages RAM [ 792.411752] 0 pages HighMem/MovableOnly [ 792.413106] 76615 pages reserved [ 792.414493] 0 pages hwpoisoned ---------- When there is no data to write ---------- > HOWEVER. > > I think that it's very true that we have then tuned all our *other* > heuristics for taking this thing into account, so I suspect that we'll > find that we'll need to tweak other places. But this crazy "let's say > that we made progress even when we didn't" thing looks just wrong. > > In particular, I think that you'll find that you will have to change > the heuristics in __alloc_pages_slowpath() where we currently do > > if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || .. > > when the "did_some_progress" logic changes that radically. > Yes. But we can't simply do if (order <= PAGE_ALLOC_COSTLY_ORDER || .. because we won't be able to call out_of_memory(), can we? > Because while the current return value looks insane, all the other > testing and tweaking has been done with that very odd return value in > place. > > Linus > Well, did I encounter a difficult to fix problem? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-13 12:21 ` Tetsuo Handa @ 2015-10-13 16:37 ` Linus Torvalds 2015-10-14 12:21 ` Tetsuo Handa 2015-10-15 13:14 ` Michal Hocko 0 siblings, 2 replies; 18+ messages in thread From: Linus Torvalds @ 2015-10-13 16:37 UTC (permalink / raw) To: Tetsuo Handa Cc: Michal Hocko, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina On Tue, Oct 13, 2015 at 5:21 AM, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > If I remove > > /* Any of the zones still reclaimable? Don't OOM. */ > if (zones_reclaimable) > return 1; > > the OOM killer is invoked even when there are so much memory which can be > reclaimed after written to disk. This is definitely premature invocation of > the OOM killer. Right. The rest of the code knows that the return value right now means "there is no memory at all" rather than "I made progress". > Yes. But we can't simply do > > if (order <= PAGE_ALLOC_COSTLY_ORDER || .. > > because we won't be able to call out_of_memory(), can we? So I think that whole thing is kind of senseless. Not just that particular conditional, but what it *does* too. What can easily happen is that we are a blocking allocation, but because we're __GFP_FS or something, the code doesn't actually start writing anything out. Nor is anything congested. So the thing just loops. And looping is stupid, because we may be not able to actually free anything exactly because of limitations like __GFP_FS. So (a) the looping condition is senseless (b) what we do when looping is senseless and we actually do try to wake up kswapd in the loop, but we never *wait* for it, so that's largely pointless too. So *of*course* the direct reclaim code has to set "I made progress", because if it doesn't lie and say so, then the code will randomly not loop, and will oom, and things go to hell. But I hate the "let's tweak the zone_reclaimable" idea, because it doesn't actually fix anything. It just perpetuates this "the code doesn't make sense, so let's add *more* senseless heusristics to this whole loop". So instead of that senseless thing, how about trying something *sensible*. Make the code do something that we can actually explain as making sense. I'd suggest something like: - add a "retry count" - if direct reclaim made no progress, or made less progress than the target: if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry; - regardless of whether we made progress or not: if (retry count < X) goto retry; if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then goto retry where 'X" is something sane that limits our CPU use, but also guarantees that we don't end up waiting *too* long (if a single allocation takes more than a big fraction of a second, we should probably stop trying). The whole time-based thing might even be explicit. There's nothing wrong with doing something like unsigned long timeout = jiffies + HZ/4; at the top of the function, and making the whole retry logic actually say something like if (time_after(timeout, jiffies)) goto noretry; (or make *that* trigger the oom logic, or whatever). Now, I realize the above suggestions are big changes, and they'll likely break things and we'll still need to tweak things, but dammit, wouldn't that be better than just randomly tweaking the insane zone_reclaimable logic? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-13 16:37 ` Linus Torvalds @ 2015-10-14 12:21 ` Tetsuo Handa 2015-10-15 13:14 ` Michal Hocko 1 sibling, 0 replies; 18+ messages in thread From: Tetsuo Handa @ 2015-10-14 12:21 UTC (permalink / raw) To: torvalds Cc: mhocko, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Linus Torvalds wrote: > On Tue, Oct 13, 2015 at 5:21 AM, Tetsuo Handa > <penguin-kernel@i-love.sakura.ne.jp> wrote: > > > > If I remove > > > > /* Any of the zones still reclaimable? Don't OOM. */ > > if (zones_reclaimable) > > return 1; > > > > the OOM killer is invoked even when there are so much memory which can be > > reclaimed after written to disk. This is definitely premature invocation of > > the OOM killer. > > Right. The rest of the code knows that the return value right now > means "there is no memory at all" rather than "I made progress". > > > Yes. But we can't simply do > > > > if (order <= PAGE_ALLOC_COSTLY_ORDER || .. > > > > because we won't be able to call out_of_memory(), can we? > > So I think that whole thing is kind of senseless. Not just that > particular conditional, but what it *does* too. > > What can easily happen is that we are a blocking allocation, but > because we're __GFP_FS or something, the code doesn't actually start > writing anything out. Nor is anything congested. So the thing just > loops. congestion_wait() sounds like a source of silent hang up. http://lkml.kernel.org/r/201406052145.CIB35534.OQLVMSJFOHtFOF@I-love.SAKURA.ne.jp > > And looping is stupid, because we may be not able to actually free > anything exactly because of limitations like __GFP_FS. > > So > > (a) the looping condition is senseless > > (b) what we do when looping is senseless > > and we actually do try to wake up kswapd in the loop, but we never > *wait* for it, so that's largely pointless too. Aren't we waiting for kswapd forever? In other words, we never check whether kswapd can make some progress. http://lkml.kernel.org/r/20150812091104.GA14940@dhcp22.suse.cz > > So *of*course* the direct reclaim code has to set "I made progress", > because if it doesn't lie and say so, then the code will randomly not > loop, and will oom, and things go to hell. > > But I hate the "let's tweak the zone_reclaimable" idea, because it > doesn't actually fix anything. It just perpetuates this "the code > doesn't make sense, so let's add *more* senseless heusristics to this > whole loop". I also don't think that tweaking current reclaim logic solves bugs which bothered me via unexplained hangups / reboots. To me, current memory allocator is too puzzling that it is as if if (there_is_much_free_memory() == TRUE) goto OK; if (do_some_heuristic1() == SUCCESS) goto OK; if (do_some_heuristic2() == SUCCESS) goto OK; if (do_some_heuristic3() == SUCCESS) goto OK; (...snipped...) if (do_some_heuristicN() == SUCCESS) goto OK; while (1); and we don't know how many heuristics we need to add in order to avoid reaching the "while (1);". (We are reaching the "while (1);" before if (out_of_memory() == SUCCESS) goto OK; is called.) > > So instead of that senseless thing, how about trying something > *sensible*. Make the code do something that we can actually explain as > making sense. > > I'd suggest something like: > > - add a "retry count" > > - if direct reclaim made no progress, or made less progress than the target: > > if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry; Yes. > > - regardless of whether we made progress or not: > > if (retry count < X) goto retry; > > if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then > goto retry I tried sleeping for reducing CPU usage and reporting via SysRq-w. http://lkml.kernel.org/r/201411231353.BDE90173.FQOMJtHOLVFOFS@I-love.SAKURA.ne.jp I complained at http://lkml.kernel.org/r/201502162023.GGE26089.tJOOFQMFFHLOVS@I-love.SAKURA.ne.jp | Oh, why every thread trying to allocate memory has to repeat | the loop that might defer somebody who can make progress if CPU time was | given? I wish only somebody like kswapd repeats the loop on behalf of all | threads waiting at memory allocation slowpath... Direct reclaim can defer termination upon SIGKILL if blocked at unkillable lock. If performance were not a problem, is direct reclaim mandatory? Of course, performance is the problem. Thus we would try direct reclaim for at least once. But I wish memory allocation logic were as simple as (1) If there are enough free memory, allocate it. (2) If there are not enough free memory, join on the waitqueue list wait_event_timeout(waiter, memory_reclaimed, timeout) and wait for reclaiming kernel threads (e.g. kswapd) to wake the waiters up. If the caller is willing to give up upon SIGKILL (e.g. __GFP_KILLABLE) then wait_event_killable_timeout(waiter, memory_reclaimed, timeout) and return NULL upon SIGKILL. (3) Whenever reclaiming kernel threads reclaimed memory and there are waiters, wake the waiters up. (4) If reclaiming kernel threads cannot reclaim memory, the caller will wake up due to timeout, and invoke the OOM killer unless the caller does not want (e.g. __GFP_NO_OOMKILL). > > where 'X" is something sane that limits our CPU use, but also > guarantees that we don't end up waiting *too* long (if a single > allocation takes more than a big fraction of a second, we should > probably stop trying). Isn't a second too short for waiting for swapping / writeback? > > The whole time-based thing might even be explicit. There's nothing > wrong with doing something like > > unsigned long timeout = jiffies + HZ/4; > > at the top of the function, and making the whole retry logic actually > say something like > > if (time_after(timeout, jiffies)) goto noretry; > > (or make *that* trigger the oom logic, or whatever). I prefer time-based thing, for my customer's usage (where watchdog timeout is configured to 10 seconds) will require kernel messages (maybe OOM killer messages) printed within a few seconds. > > Now, I realize the above suggestions are big changes, and they'll > likely break things and we'll still need to tweak things, but dammit, > wouldn't that be better than just randomly tweaking the insane > zone_reclaimable logic? > > Linus Yes, this will be big changes. But this change will be better than living with "no means for understanding what was happening are available" v.s. "really interesting things are observed if means are available". -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-13 16:37 ` Linus Torvalds 2015-10-14 12:21 ` Tetsuo Handa @ 2015-10-15 13:14 ` Michal Hocko 2015-10-16 15:57 ` Michal Hocko 1 sibling, 1 reply; 18+ messages in thread From: Michal Hocko @ 2015-10-15 13:14 UTC (permalink / raw) To: Linus Torvalds Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Mel Gorman, Rik van Riel [CC Mel and Rik as well - this has diverged from the original thread considerably but the current topic started here: http://lkml.kernel.org/r/201510130025.EJF21331.FFOQJtVOMLFHSO%40I-love.SAKURA.ne.jp ] On Tue 13-10-15 09:37:06, Linus Torvalds wrote: > So instead of that senseless thing, how about trying something > *sensible*. Make the code do something that we can actually explain as > making sense. I do agree that zone_reclaimable is subtle and hackish way to wait for the writeback/kswapd to clean up pages which cannot be reclaimed from the direct reclaim. > I'd suggest something like: > > - add a "retry count" > > - if direct reclaim made no progress, or made less progress than the target: > > if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry; > > - regardless of whether we made progress or not: > > if (retry count < X) goto retry; > > if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then > goto retry This will certainly cap the reclaim retries but there are risks with this approach afaics. First of all other allocators might piggy back on the current reclaimer and push it to the OOM killer even when we are not really OOM. Maybe this is possible currently as well but it is less likely because NR_PAGES_SCANNED is reset on a freed page which allows the reclaimer another round. I am also not sure it would help with pathological cases like the one discussed here. If you have only a small amount of reclaimable memory on the LRU lists then you scan them quite quickly which will consume retries. Maybe a sufficient timeout can help but I am afraid we can still hit the OOM prematurely because a large part of the memory is still under writeback (which might be a slow device - e.g. an USB stick). We used have this kind of problems in memcg reclaim. We do not have (resp. didn't have until recently with CONFIG_CGROUP_WRITEBACK) dirty memory throttling for memory cgroups so the LRU can become full of dirty data really quickly and that led to memcg OOM killer. We are not doing zone_reclaimable and other heuristics so we had to explicitly wait_on_page_writeback in the reclaim to prevent from premature OOM killer. Ugly hack but the only thing that worked reliably. Time based solutions were tried and failed with different workloads and quite randomly depending on the load/storage. > where 'X" is something sane that limits our CPU use, but also > guarantees that we don't end up waiting *too* long (if a single > allocation takes more than a big fraction of a second, we should > probably stop trying). > > The whole time-based thing might even be explicit. There's nothing > wrong with doing something like > > unsigned long timeout = jiffies + HZ/4; > > at the top of the function, and making the whole retry logic actually > say something like > > if (time_after(timeout, jiffies)) goto noretry; > > (or make *that* trigger the oom logic, or whatever). > > Now, I realize the above suggestions are big changes, and they'll > likely break things and we'll still need to tweak things, but dammit, > wouldn't that be better than just randomly tweaking the insane > zone_reclaimable logic? Yes zone_reclaimable is subtle and imho it is used even at the wrong level. We should decide whether we are really OOM at __alloc_pages_slowpath. We definitely need a big picture logic to tell us when it makes sense to drop the ball and trigger OOM killer or fail the allocation request. E.g. free + reclaimable + writeback < min_wmark on all usable zones for more than X rounds of direct reclaim without any progress is a sufficient signal to go OOM. Costly/noretry allocations can fail earlier of course. This is obviously a half baked idea which needs much more consideration all I am trying to say is that we need a high level metric to tell OOM condition. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-15 13:14 ` Michal Hocko @ 2015-10-16 15:57 ` Michal Hocko 2015-10-16 18:34 ` Linus Torvalds 0 siblings, 1 reply; 18+ messages in thread From: Michal Hocko @ 2015-10-16 15:57 UTC (permalink / raw) To: Linus Torvalds Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Mel Gorman, Rik van Riel On Thu 15-10-15 15:14:09, Michal Hocko wrote: > On Tue 13-10-15 09:37:06, Linus Torvalds wrote: [...] > > Now, I realize the above suggestions are big changes, and they'll > > likely break things and we'll still need to tweak things, but dammit, > > wouldn't that be better than just randomly tweaking the insane > > zone_reclaimable logic? > > Yes zone_reclaimable is subtle and imho it is used even at the > wrong level. We should decide whether we are really OOM at > __alloc_pages_slowpath. We definitely need a big picture logic to tell > us when it makes sense to drop the ball and trigger OOM killer or fail > the allocation request. > > E.g. free + reclaimable + writeback < min_wmark on all usable zones for > more than X rounds of direct reclaim without any progress is > a sufficient signal to go OOM. Costly/noretry allocations can fail earlier > of course. This is obviously a half baked idea which needs much more > consideration all I am trying to say is that we need a high level metric > to tell OOM condition. OK so here is what I am playing with currently. It is not complete yet. Anyway I have tested it with 2 scenarios on a swapless system with 2G of RAM both do $ cat writer.sh #!/bin/sh size=$((1<<30)) block=$((4<<10)) writer() { ( while true do dd if=/dev/zero of=/mnt/data/file.$1 bs=$block count=$(($size/$block)) rm /mnt/data/file.$1 sync done ) & } writer 1 writer 2 sleep 10s # allow to accumulate enough dirty pages 1) massive OOM start 100 memeaters each 80M run in parallel (anon private MAP_POPULATE mapping). This will trigger many OOM killers and the overall count is what I was interested in. The test is considered finished when we get a steady state - writers can make progress and there is no more OOM killing for some time. $ grep "invoked oom-killer" base-run-oom.log | wc -l 78 $ grep "invoked oom-killer" test-run-oom.log | wc -l 63 So it looks like we have triggered less OOM killing with the patch applied. I haven't checked those too closely but it seems like at least two instances might not have triggered with the current implementation because DMA32 zone is considered reclaimable. But this check is inherently racy so we cannot be sure. $ grep "DMA32.*all_unreclaimable? no" test2-run-oom.log | wc -l 2 2) almost OOM situation invoke 10 memeaters in parallel and try to fill up all the memory without triggering the OOM killer. This is quite hard and it required a lot of tunning. I've ended up with: #!/bin/sh pkill mem_eater sync echo 3 > /proc/sys/vm/drop_caches sync size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo) sh writer.sh & sleep 10s for i in $(seq 10) do memcg_test/tools/mem_eater $size & done wait and this one doesn't hit the OOM killer with the original implementation while it hits it with the patch applied: [ 32.727001] DMA32 free:5428kB min:5532kB low:6912kB high:8296kB active_anon:1802520kB inactive_anon:204kB active_file:6692kB inactive_file:137184k B unevictable:0kB isolated(anon):136kB isolated(file):32kB present:2080640kB managed:1997880kB mlocked:0kB dirty:0kB writeback:137168kB mapped:6408kB shmem:204kB slab_reclaimable:20472kB slab_unreclaimable:13276kB kernel_stack:1456kB pagetables:4756kB unstable:0kB bounce:0kB free_pcp:120kB local_p cp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:948764 all_unreclaimable? yes There is a lot of memory in the writeback but all_unreclaimable is yes so who knows maybe it is just a coincidence we haven't triggered OOM in the original kernel. Anyway the two implementation will be hard to compare because workloads are very different but I think something like below should be more readable and deterministic than what we have right now. It will need some more tuning for sure and I will be playing with it some more. I would just like to hear opinions whether this approach makes sense. If yes I will post it separately in a new thread for a wider discussion. This email thread seems to be full of detours already. --- ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-16 15:57 ` Michal Hocko @ 2015-10-16 18:34 ` Linus Torvalds 2015-10-16 18:49 ` Tetsuo Handa 2015-10-19 12:53 ` Michal Hocko 0 siblings, 2 replies; 18+ messages in thread From: Linus Torvalds @ 2015-10-16 18:34 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Mel Gorman, Rik van Riel On Fri, Oct 16, 2015 at 8:57 AM, Michal Hocko <mhocko@kernel.org> wrote: > > OK so here is what I am playing with currently. It is not complete > yet. So this looks like it's going in a reasonable direction. However: > + if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), > + ac->high_zoneidx, alloc_flags, target)) { > + /* Wait for some write requests to complete then retry */ > + wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50); > + goto retry; > + } I still think we should at least spend some time re-thinking that "wait_iff_congested()" thing. We may not actually be congested, but might be unable to write anything out because of our allocation flags (ie not allowed to recurse into the filesystems), so we might be in the situation that we have a lot of dirty pages that we can't directly do anything about. Now, we will have woken kswapd, so something *will* hopefully be done about them eventually, but at no time do we actually really wait for it. We'll just busy-loop. So at a minimum, I think we should yield to kswapd. We do do that "cond_resched()" in wait_iff_congested(), but I'm not entirely convinced that is at all enough to wait for kswapd to *do* something. So before we really decide to see if we should oom, I think we should have at least one forced io_schedule_timeout(), whether we're congested or not. And yes, as Tetsuo Handa said, any kind of short wait might be too short for IO to really complete, but *something* will have completed. Unless we're so far up the creek that we really should just oom. But I suspect we'll have to just try things out and tweak it. This patch looks like a reasonable starting point to me. Tetsuo, mind trying it out and maybe tweaking it a bit for the load you have? Does it seem to improve on your situation? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-16 18:34 ` Linus Torvalds @ 2015-10-16 18:49 ` Tetsuo Handa 2015-10-19 12:57 ` Michal Hocko 2015-10-19 12:53 ` Michal Hocko 1 sibling, 1 reply; 18+ messages in thread From: Tetsuo Handa @ 2015-10-16 18:49 UTC (permalink / raw) To: torvalds, mhocko Cc: rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina, mgorman, riel Linus Torvalds wrote: > Tetsuo, mind trying it out and maybe tweaking it a bit for the load > you have? Does it seem to improve on your situation? Yes, I already tried it and just replied to Michal. I tested for one hour using various memory stressing programs. As far as I tested, I did not hit silent hang up ( MemAlloc-Info: X stalling task, 0 dying task, 0 victim task. where X > 0). ---------------------------------------- [ 134.510993] Mem-Info: [ 134.511940] active_anon:408777 inactive_anon:2088 isolated_anon:24 [ 134.511940] active_file:15 inactive_file:24 isolated_file:0 [ 134.511940] unevictable:0 dirty:4 writeback:1 unstable:0 [ 134.511940] slab_reclaimable:3109 slab_unreclaimable:5594 [ 134.511940] mapped:679 shmem:2156 pagetables:2077 bounce:0 [ 134.511940] free:12911 free_pcp:31 free_cma:0 [ 134.521256] Node 0 DMA free:7256kB min:400kB low:500kB high:600kB active_anon:6560kB inactive_anon:180kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:80kB shmem:184kB slab_reclaimable:236kB slab_unreclaimable:296kB kernel_stack:48kB pagetables:556kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 134.532779] lowmem_reserve[]: 0 1714 1714 1714 [ 134.534455] Node 0 DMA32 free:44388kB min:44652kB low:55812kB high:66976kB active_anon:1628548kB inactive_anon:8172kB active_file:60kB inactive_file:96kB unevictable:0kB isolated(anon):96kB isolated(file):0kB present:2080640kB managed:1759252kB mlocked:0kB dirty:16kB writeback:4kB mapped:2636kB shmem:8440kB slab_reclaimable:12200kB slab_unreclaimable:22080kB kernel_stack:3584kB pagetables:7752kB unstable:0kB bounce:0kB free_pcp:240kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1016 all_unreclaimable? yes [ 134.545830] lowmem_reserve[]: 0 0 0 0 [ 134.547404] Node 0 DMA: 16*4kB (UME) 16*8kB (UME) 10*16kB (UME) 6*32kB (UME) 1*64kB (M) 2*128kB (UE) 1*256kB (M) 2*512kB (UE) 3*1024kB (UME) 1*2048kB (U) 0*4096kB = 7264kB [ 134.552766] Node 0 DMA32: 1158*4kB (UME) 638*8kB (UE) 244*16kB (UME) 163*32kB (UE) 73*64kB (UE) 34*128kB (UME) 17*256kB (UME) 10*512kB (UME) 7*1024kB (UM) 0*2048kB 0*4096kB = 44520kB [ 134.558111] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 134.560358] 2195 total pagecache pages [ 134.562043] 0 pages in swap cache [ 134.563604] Swap cache stats: add 0, delete 0, find 0/0 [ 134.565441] Free swap = 0kB [ 134.567015] Total swap = 0kB [ 134.568628] 524157 pages RAM [ 134.570034] 0 pages HighMem/MovableOnly [ 134.571681] 80368 pages reserved [ 134.573467] 0 pages hwpoisoned ---------------------------------------- Only problem I felt is that the ratio of inactive_file/writeback (shown below) was high (compared to shown above) when I did $ cat < /dev/zero > /tmp/file1 & cat < /dev/zero > /tmp/file2 & cat < /dev/zero > /tmp/file3 & sleep 10; ./a.out; killall cat but I think this patch is better than current code. ---------------------------------------- [ 1135.909600] Mem-Info: [ 1135.910686] active_anon:321011 inactive_anon:4664 isolated_anon:0 [ 1135.910686] active_file:3170 inactive_file:78035 isolated_file:512 [ 1135.910686] unevictable:0 dirty:0 writeback:78618 unstable:0 [ 1135.910686] slab_reclaimable:5739 slab_unreclaimable:6170 [ 1135.910686] mapped:4666 shmem:8300 pagetables:1966 bounce:0 [ 1135.910686] free:12938 free_pcp:0 free_cma:0 [ 1135.925255] Node 0 DMA free:7232kB min:400kB low:500kB high:600kB active_anon:5852kB inactive_anon:196kB active_file:120kB inactive_file:980kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:968kB mapped:248kB shmem:388kB slab_reclaimable:316kB slab_unreclaimable:272kB kernel_stack:64kB pagetables:100kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:7444 all_unreclaimable? yes [ 1135.936728] lowmem_reserve[]: 0 1714 1714 1714 [ 1135.938486] Node 0 DMA32 free:44520kB min:44652kB low:55812kB high:66976kB active_anon:1278192kB inactive_anon:18460kB active_file:12560kB inactive_file:313176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1759252kB mlocked:0kB dirty:0kB writeback:313504kB mapped:18416kB shmem:32812kB slab_reclaimable:22640kB slab_unreclaimable:24408kB kernel_stack:4240kB pagetables:7764kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2957668 all_unreclaimable? yes [ 1135.950355] lowmem_reserve[]: 0 0 0 0 [ 1135.952011] Node 0 DMA: 7*4kB (U) 14*8kB (UM) 13*16kB (UM) 6*32kB (UME) 1*64kB (M) 4*128kB (UME) 2*256kB (UM) 3*512kB (UME) 2*1024kB (UE) 1*2048kB (M) 0*4096kB = 7260kB [ 1135.957169] Node 0 DMA32: 241*4kB (UE) 929*8kB (UE) 496*16kB (UME) 277*32kB (UE) 135*64kB (UME) 17*128kB (UME) 3*256kB (E) 16*512kB (ME) 0*1024kB 0*2048kB 0*4096kB = 44972kB [ 1135.963047] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 1135.965472] 90009 total pagecache pages [ 1135.967078] 0 pages in swap cache [ 1135.968581] Swap cache stats: add 0, delete 0, find 0/0 [ 1135.970424] Free swap = 0kB [ 1135.971828] Total swap = 0kB [ 1135.973248] 524157 pages RAM [ 1135.974655] 0 pages HighMem/MovableOnly [ 1135.976230] 80368 pages reserved [ 1135.977745] 0 pages hwpoisoned ---------------------------------------- I can still hit OOM livelock ( MemAlloc-Info: X stalling task, Y dying task, Z victim task. where X > 0 && Y > 0). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-16 18:49 ` Tetsuo Handa @ 2015-10-19 12:57 ` Michal Hocko 0 siblings, 0 replies; 18+ messages in thread From: Michal Hocko @ 2015-10-19 12:57 UTC (permalink / raw) To: Tetsuo Handa Cc: torvalds, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina, mgorman, riel On Sat 17-10-15 03:49:39, Tetsuo Handa wrote: > Linus Torvalds wrote: > > Tetsuo, mind trying it out and maybe tweaking it a bit for the load > > you have? Does it seem to improve on your situation? > > Yes, I already tried it and just replied to Michal. > > I tested for one hour using various memory stressing programs. > As far as I tested, I did not hit silent hang up ( Thank you for your testing! [...] > Only problem I felt is that the ratio of inactive_file/writeback > (shown below) was high (compared to shown above) when I did Yes this is the lack of congestion on the bdi as Linus expected. Another patch I've just posted should help in that regards. At least it seems to help in my testing. [...] > I can still hit OOM livelock ( > > MemAlloc-Info: X stalling task, Y dying task, Z victim task. > > where X > 0 && Y > 0). This seems a separate issue, though. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-16 18:34 ` Linus Torvalds 2015-10-16 18:49 ` Tetsuo Handa @ 2015-10-19 12:53 ` Michal Hocko 1 sibling, 0 replies; 18+ messages in thread From: Michal Hocko @ 2015-10-19 12:53 UTC (permalink / raw) To: Linus Torvalds Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker, Christoph Lameter, Andrew Morton, Johannes Weiner, Vladimir Davydov, linux-mm, Linux Kernel Mailing List, Stanislav Kozina, Mel Gorman, Rik van Riel On Fri 16-10-15 11:34:48, Linus Torvalds wrote: > On Fri, Oct 16, 2015 at 8:57 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > OK so here is what I am playing with currently. It is not complete > > yet. > > So this looks like it's going in a reasonable direction. However: > > > + if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), > > + ac->high_zoneidx, alloc_flags, target)) { > > + /* Wait for some write requests to complete then retry */ > > + wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50); > > + goto retry; > > + } > > I still think we should at least spend some time re-thinking that > "wait_iff_congested()" thing. You are right. I thought we would be congested most of the time because of the heavy IO but a quick test has shown that the zone is marked congested but the nr_wb_congested is zero all the time. That is most probably because the IO is throttled severly by the lack of memory as well. > We may not actually be congested, but > might be unable to write anything out because of our allocation flags > (ie not allowed to recurse into the filesystems), so we might be in > the situation that we have a lot of dirty pages that we can't directly > do anything about. > > Now, we will have woken kswapd, so something *will* hopefully be done > about them eventually, but at no time do we actually really wait for > it. We'll just busy-loop. > > So at a minimum, I think we should yield to kswapd. We do do that > "cond_resched()" in wait_iff_congested(), but I'm not entirely > convinced that is at all enough to wait for kswapd to *do* something. I went with congestion_wait which is what we used to do in the past before wait_iff_congested has been introduced. The primary reason for the change was that congestion_wait used to cause unhealthy stalls in the direct reclaim where the bdi wasn't really congested and so we were sleeping for the full timeout. Now I think we can do better even with congestion_wait. We do not have to wait when we did_some_progress so we won't affect a regular direct reclaim path and we can reduce sleeping to: dirty+writeback > reclaimable/2 This is a good signal that the reason for no progress is the stale IO most likely and we need to wait even if the bdi itself is not congested. We can also increase the timeout to HZ/10 because this is an extreme slow path - we are not doing any progress and stalling is better than OOM. This is a diff on top of the previous patch. I even think that this part would deserve a separate patch for a better bisect-ability. My testing shows that close-to-oom behaves better (I can use more memory for memeaters without hitting OOM) What do you think? --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e28028681c59..fed1bb7ea43a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3188,8 +3187,21 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, */ if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), ac->high_zoneidx, alloc_flags, target)) { - /* Wait for some write requests to complete then retry */ - wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50); + unsigned long writeback = zone_page_state(zone, NR_WRITEBACK), + dirty = zone_page_state(zone, NR_FILE_DIRTY); + if (did_some_progress) + goto retry; + + /* + * If we didn't make any progress and have a lot of + * dirty + writeback pages then we should wait for + * an IO to complete to slow down the reclaim and + * prevent from pre mature OOM + */ + if (2*(writeback + dirty) > reclaimable) + congestion_wait(BLK_RW_ASYNC, HZ/10); + else + cond_resched(); goto retry; } } -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-12 15:25 ` Silent hang up caused by pages being not scanned? Tetsuo Handa 2015-10-12 21:23 ` Linus Torvalds @ 2015-10-13 13:32 ` Michal Hocko 2015-10-13 16:19 ` Tetsuo Handa 1 sibling, 1 reply; 18+ messages in thread From: Michal Hocko @ 2015-10-13 13:32 UTC (permalink / raw) To: Tetsuo Handa Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Tue 13-10-15 00:25:53, Tetsuo Handa wrote: [...] > What is strange, the values printed by this debug printk() patch did not > change as time went by. Thus, I think that this is not a problem of lack of > CPU time for scanning pages. I suspect that there is a bug that nobody is > scanning pages. > > ---------- > [ 66.821450] zone_reclaimable returned 1 at line 2646 > [ 66.823020] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32 > [ 66.824935] shrink_zones returned 1 at line 2706 > [ 66.826392] zones_reclaimable=1 at line 2765 > [ 66.827865] do_try_to_free_pages returned 1 at line 2938 > [ 67.102322] __perform_reclaim returned 1 at line 2854 > [ 67.103968] did_some_progress=1 at line 3301 > (...snipped...) > [ 281.439977] zone_reclaimable returned 1 at line 2646 > [ 281.439977] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32 > [ 281.439978] shrink_zones returned 1 at line 2706 > [ 281.439978] zones_reclaimable=1 at line 2765 > [ 281.439979] do_try_to_free_pages returned 1 at line 2938 > [ 281.439979] __perform_reclaim returned 1 at line 2854 > [ 281.439980] did_some_progress=1 at line 3301 This is really interesting because even with reclaimable LRUs this low we should eventually scan them enough times to convince zone_reclaimable to fail. PAGES_SCANNED in your logs seems to be constant, though, which suggests somebody manages to free a page every time before we get down to priority 0 and manage to scan something finally. This is pretty much pathological behavior and I have hard time to imagine how would that be possible but it clearly shows that zone_reclaimable heuristic is not working properly. I can see two options here. Either we teach zone_reclaimable to be less fragile or remove zone_reclaimable from shrink_zones altogether. Both of them are risky because we have a long history of changes in this areas which made other subtle behavior changes but I guess that the first option should be less fragile. What about the following patch? I am not happy about it because the condition is rather rough and a deeper inspection is really needed to check all the call sites but it should be good for testing. --- ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-13 13:32 ` Michal Hocko @ 2015-10-13 16:19 ` Tetsuo Handa 2015-10-14 13:22 ` Michal Hocko 0 siblings, 1 reply; 18+ messages in thread From: Tetsuo Handa @ 2015-10-13 16:19 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > I can see two options here. Either we teach zone_reclaimable to be less > fragile or remove zone_reclaimable from shrink_zones altogether. Both of > them are risky because we have a long history of changes in this areas > which made other subtle behavior changes but I guess that the first > option should be less fragile. What about the following patch? I am not > happy about it because the condition is rather rough and a deeper > inspection is really needed to check all the call sites but it should be > good for testing. While zone_reclaimable() for Node 0 DMA32 became false by your patch, zone_reclaimable() for Node 0 DMA kept returning true, and as a result overall result (i.e. zones_reclaimable) remained true. $ ./a.out ---------- When there is no data to write ---------- [ 162.942371] MIN=11163 FREE=11155 (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=16 [ 162.944541] MIN=100 FREE=1824 (ACTIVE_FILE=3+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 162.946560] zone_reclaimable returned 1 at line 2665 [ 162.948722] shrink_zones returned 1 at line 2716 (...snipped...) [ 164.897587] zones_reclaimable=1 at line 2775 [ 164.899172] do_try_to_free_pages returned 1 at line 2948 [ 167.087119] __perform_reclaim returned 1 at line 2854 [ 167.088868] did_some_progress=1 at line 3301 (...snipped...) [ 261.577944] MIN=11163 FREE=11155 (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0 [ 261.580093] MIN=100 FREE=1824 (ACTIVE_FILE=3+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5 [ 261.582333] zone_reclaimable returned 1 at line 2665 [ 261.583841] shrink_zones returned 1 at line 2716 (...snipped...) [ 264.728434] zones_reclaimable=1 at line 2775 [ 264.730002] do_try_to_free_pages returned 1 at line 2948 [ 268.191368] __perform_reclaim returned 1 at line 2854 [ 268.193113] did_some_progress=1 at line 3301 ---------- When there is no data to write ---------- Complete log (with your patch inside) is at http://I-love.SAKURA.ne.jp/tmp/serial-20151014.txt.xz . By the way, the OOM killer seems to be invoked prematurely for different load if your patch is applied. $ cat < /dev/zero > /tmp/log & sleep 10; ./a.out ---------- When there is a lot of data to write ---------- [ 69.019271] Mem-Info: [ 69.019755] active_anon:335006 inactive_anon:2084 isolated_anon:23 [ 69.019755] active_file:12197 inactive_file:65310 isolated_file:31 [ 69.019755] unevictable:0 dirty:533 writeback:51020 unstable:0 [ 69.019755] slab_reclaimable:4753 slab_unreclaimable:4134 [ 69.019755] mapped:9639 shmem:2144 pagetables:2030 bounce:0 [ 69.019755] free:12972 free_pcp:45 free_cma:0 [ 69.026260] Node 0 DMA free:7300kB min:400kB low:500kB high:600kB active_anon:5232kB inactive_anon:96kB active_file:424kB inactive_file:1068kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:164kB writeback:972kB mapped:416kB shmem:104kB slab_reclaimable:304kB slab_unreclaimable:244kB kernel_stack:96kB pagetables:256kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no [ 69.037189] lowmem_reserve[]: 0 1729 1729 1729 [ 69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 69.052017] lowmem_reserve[]: 0 0 0 0 [ 69.053818] Node 0 DMA: 17*4kB (UME) 8*8kB (UME) 6*16kB (UME) 2*32kB (UM) 2*64kB (UE) 4*128kB (UME) 1*256kB (U) 2*512kB (UE) 3*1024kB (UME) 1*2048kB (U) 0*4096kB = 7332kB [ 69.059597] Node 0 DMA32: 632*4kB (UME) 454*8kB (UME) 507*16kB (UME) 310*32kB (UME) 177*64kB (UE) 61*128kB (UME) 15*256kB (ME) 19*512kB (M) 10*1024kB (M) 0*2048kB 0*4096kB = 67136kB [ 69.065810] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 69.068305] 72477 total pagecache pages [ 69.069932] 0 pages in swap cache [ 69.071435] Swap cache stats: add 0, delete 0, find 0/0 [ 69.073354] Free swap = 0kB [ 69.074822] Total swap = 0kB [ 69.076660] 524157 pages RAM [ 69.078113] 0 pages HighMem/MovableOnly [ 69.079930] 76615 pages reserved [ 69.081406] 0 pages hwpoisoned ---------- When there is a lot of data to write ---------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-13 16:19 ` Tetsuo Handa @ 2015-10-14 13:22 ` Michal Hocko 2015-10-14 14:38 ` Tetsuo Handa 0 siblings, 1 reply; 18+ messages in thread From: Michal Hocko @ 2015-10-14 13:22 UTC (permalink / raw) To: Tetsuo Handa Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Wed 14-10-15 01:19:09, Tetsuo Handa wrote: > Michal Hocko wrote: > > I can see two options here. Either we teach zone_reclaimable to be less > > fragile or remove zone_reclaimable from shrink_zones altogether. Both of > > them are risky because we have a long history of changes in this areas > > which made other subtle behavior changes but I guess that the first > > option should be less fragile. What about the following patch? I am not > > happy about it because the condition is rather rough and a deeper > > inspection is really needed to check all the call sites but it should be > > good for testing. > > While zone_reclaimable() for Node 0 DMA32 became false by your patch, > zone_reclaimable() for Node 0 DMA kept returning true, and as a result > overall result (i.e. zones_reclaimable) remained true. Ahh, right you are. ZONE_DMA might have 0 or close to 0 pages on LRUs while it is still protected from allocations which are not targeted for this zone. My patch clearly haven't considered that. The fix for that would be quite straightforward. We have to consider lowmem_reserve of the zone wrt. the allocation/reclaim gfp target zone. But this is getting more and more ugly (see the patch below just for testing/demonstration purposes). The OOM report is really interesting: > [ 69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no so your whole file LRUs are either dirty or under writeback and reclaimable pages are below min wmark. This alone is quite suspicious. Why hasn't balance_dirty_pages throttled writers and allowed them to make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes} configuration on that system. Also why throttle_vm_writeout haven't slown the reclaim down? Anyway this is exactly the case where zone_reclaimable helps us to prevent OOM because we are looping over the remaining LRU pages without making progress... This just shows how subtle all this is :/ I have to think about this much more.. --- ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-14 13:22 ` Michal Hocko @ 2015-10-14 14:38 ` Tetsuo Handa 2015-10-14 14:59 ` Michal Hocko 0 siblings, 1 reply; 18+ messages in thread From: Tetsuo Handa @ 2015-10-14 14:38 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > The OOM report is really interesting: > > > [ 69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > > so your whole file LRUs are either dirty or under writeback and > reclaimable pages are below min wmark. This alone is quite suspicious. I did $ cat < /dev/zero > /tmp/log for 10 seconds before starting $ ./a.out Thus, so much memory was waiting for writeback on XFS filesystem. > Why hasn't balance_dirty_pages throttled writers and allowed them to > make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes} > configuration on that system. All values are defaults of plain CentOS 7 installation. # sysctl -a | grep ^vm. vm.admin_reserve_kbytes = 8192 vm.block_dump = 0 vm.compact_unevictable_allowed = 1 vm.dirty_background_bytes = 0 vm.dirty_background_ratio = 10 vm.dirty_bytes = 0 vm.dirty_expire_centisecs = 3000 vm.dirty_ratio = 30 vm.dirty_writeback_centisecs = 500 vm.dirtytime_expire_seconds = 43200 vm.drop_caches = 0 vm.extfrag_threshold = 500 vm.hugepages_treat_as_movable = 0 vm.hugetlb_shm_group = 0 vm.laptop_mode = 0 vm.legacy_va_layout = 0 vm.lowmem_reserve_ratio = 256 256 32 vm.max_map_count = 65530 vm.memory_failure_early_kill = 0 vm.memory_failure_recovery = 1 vm.min_free_kbytes = 45056 vm.min_slab_ratio = 5 vm.min_unmapped_ratio = 1 vm.mmap_min_addr = 4096 vm.nr_hugepages = 0 vm.nr_hugepages_mempolicy = 0 vm.nr_overcommit_hugepages = 0 vm.nr_pdflush_threads = 0 vm.numa_zonelist_order = default vm.oom_dump_tasks = 1 vm.oom_kill_allocating_task = 0 vm.overcommit_kbytes = 0 vm.overcommit_memory = 0 vm.overcommit_ratio = 50 vm.page-cluster = 3 vm.panic_on_oom = 0 vm.percpu_pagelist_fraction = 0 vm.stat_interval = 1 vm.swappiness = 30 vm.user_reserve_kbytes = 54808 vm.vfs_cache_pressure = 100 vm.zone_reclaim_mode = 0 > > Also why throttle_vm_writeout haven't slown the reclaim down? Too difficult question for me. > > Anyway this is exactly the case where zone_reclaimable helps us to > prevent OOM because we are looping over the remaining LRU pages without > making progress... This just shows how subtle all this is :/ > > I have to think about this much more.. I'm suspicious about tweaking current reclaim logic. Could you please respond to Linus's comments? There are more moles than kernel developers can find. I think that what we can do for short term is to prepare for moles that kernel developers could not find, and for long term is to reform page allocator for preventing moles from living. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-14 14:38 ` Tetsuo Handa @ 2015-10-14 14:59 ` Michal Hocko 2015-10-14 15:06 ` Tetsuo Handa 0 siblings, 1 reply; 18+ messages in thread From: Michal Hocko @ 2015-10-14 14:59 UTC (permalink / raw) To: Tetsuo Handa Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina On Wed 14-10-15 23:38:00, Tetsuo Handa wrote: > Michal Hocko wrote: [...] > > Why hasn't balance_dirty_pages throttled writers and allowed them to > > make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes} > > configuration on that system. > > All values are defaults of plain CentOS 7 installation. So this is 3.10 kernel, right? > # sysctl -a | grep ^vm. > vm.dirty_background_ratio = 10 > vm.dirty_bytes = 0 > vm.dirty_expire_centisecs = 3000 > vm.dirty_ratio = 30 [...] OK, this is nothing unusual. And I _suspect_ that the throttling simply didn't cope with the writer speed and a large anon memory consumer. Dirtyable memory was quite high until your anon hammer bumped in and reduced dirtyable memory down so the file LRU is full of dirty pages when we get under serious memory pressure. Anonymous pages are not reclaimable so the whole memory pressure goes to file LRUs and bang. > > Also why throttle_vm_writeout haven't slown the reclaim down? > > Too difficult question for me. > > > > > Anyway this is exactly the case where zone_reclaimable helps us to > > prevent OOM because we are looping over the remaining LRU pages without > > making progress... This just shows how subtle all this is :/ > > > > I have to think about this much more.. > > I'm suspicious about tweaking current reclaim logic. > Could you please respond to Linus's comments? Yes I plan to I just didn't get to finish my email yet. > There are more moles than kernel developers can find. I think that > what we can do for short term is to prepare for moles that kernel > developers could not find, and for long term is to reform page > allocator for preventing moles from living. This is much easier said than done :/ The current code is full of heuristics grown over time based on very different requirements from different kernel subsystems. There is no simple solution for this problem I am afraid. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Silent hang up caused by pages being not scanned? 2015-10-14 14:59 ` Michal Hocko @ 2015-10-14 15:06 ` Tetsuo Handa 0 siblings, 0 replies; 18+ messages in thread From: Tetsuo Handa @ 2015-10-14 15:06 UTC (permalink / raw) To: mhocko Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov, linux-mm, linux-kernel, skozina Michal Hocko wrote: > On Wed 14-10-15 23:38:00, Tetsuo Handa wrote: > > Michal Hocko wrote: > [...] > > > Why hasn't balance_dirty_pages throttled writers and allowed them to > > > make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes} > > > configuration on that system. > > > > All values are defaults of plain CentOS 7 installation. > > So this is 3.10 kernel, right? The userland is CentOS 7 but the kernel is linux-next-20151009. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2015-10-19 12:57 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-10-14 8:03 Silent hang up caused by pages being not scanned? Hillf Danton -- strict thread matches above, loose matches on Subject: below -- 2015-09-28 16:18 can't oom-kill zap the victim's memory? Tetsuo Handa 2015-10-02 12:36 ` Michal Hocko 2015-10-03 6:02 ` Can't we use timeout based OOM warning/killing? Tetsuo Handa 2015-10-06 14:51 ` Tetsuo Handa 2015-10-12 6:43 ` Tetsuo Handa 2015-10-12 15:25 ` Silent hang up caused by pages being not scanned? Tetsuo Handa 2015-10-12 21:23 ` Linus Torvalds 2015-10-13 12:21 ` Tetsuo Handa 2015-10-13 16:37 ` Linus Torvalds 2015-10-14 12:21 ` Tetsuo Handa 2015-10-15 13:14 ` Michal Hocko 2015-10-16 15:57 ` Michal Hocko 2015-10-16 18:34 ` Linus Torvalds 2015-10-16 18:49 ` Tetsuo Handa 2015-10-19 12:57 ` Michal Hocko 2015-10-19 12:53 ` Michal Hocko 2015-10-13 13:32 ` Michal Hocko 2015-10-13 16:19 ` Tetsuo Handa 2015-10-14 13:22 ` Michal Hocko 2015-10-14 14:38 ` Tetsuo Handa 2015-10-14 14:59 ` Michal Hocko 2015-10-14 15:06 ` Tetsuo Handa
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox