linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] mm,page_alloc: wait for oom_lock before retrying.
@ 2019-02-13 16:30 Tetsuo Handa
  2019-02-13 16:56 ` Michal Hocko
  0 siblings, 1 reply; 4+ messages in thread
From: Tetsuo Handa @ 2019-02-13 16:30 UTC (permalink / raw)
  To: linux-mm

This is resume of https://lkml.kernel.org/r/1500202791-5427-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .



Reproducer:
----------
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>
#include <sys/prctl.h>

int main(int argc, char *argv[])
{
	static int pipe_fd[2] = { EOF, EOF };
	char *buf = NULL;
	unsigned long size = 0;
	unsigned int i;
	int fd;
	char buffer[4096];
	pipe(pipe_fd);
	signal(SIGCLD, SIG_IGN);
	if (fork() == 0) {
		prctl(PR_SET_NAME, (unsigned long) "first-victim", 0, 0, 0);
		while (1)
			pause();
	}
	close(pipe_fd[1]);
	prctl(PR_SET_NAME, (unsigned long) "normal-priority", 0, 0, 0);
	for (i = 0; i < 1024; i++)
		if (fork() == 0) {
			char c;
			/* Wait until the first-victim is OOM-killed. */
			read(pipe_fd[0], &c, 1);
			/* Try to consume CPU time via page fault. */
			memset(buffer, 0, sizeof(buffer));
			_exit(0);
		}
	close(pipe_fd[0]);
	fd = open("/dev/zero", O_RDONLY);
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	while (size) {
		int ret = read(fd, buf, size); /* Will cause OOM due to overcommit */
		if (ret <= 0)
			break;
		buf += ret;
		size -= ret;
	}
	kill(-1, SIGKILL);
	return 0; /* Not reached. */
}
----------



Before this patch: http://I-love.SAKURA.ne.jp/tmp/serial-20190212.txt.xz
Numbers from grep'ing of SysRq-t part inside the stall:

  $ grep -F 'Call Trace:' serial-20190212.txt | wc -l
  1234
  $ grep -F 'locks held by' serial-20190212.txt | wc -l
  1046
  $ grep -F '__alloc_pages_nodemask' serial-20190212.txt | wc -l
  1046
  $ grep -F '__alloc_pages_slowpath+0x16f8/0x2350' serial-20190212.txt | wc -l
  946

90% of allocating threads are sleeping at

        /*
         * Acquire the oom lock.  If that fails, somebody else is
         * making progress for us.
         */
        if (!mutex_trylock(&oom_lock)) {
                *did_some_progress = 1;
                schedule_timeout_uninterruptible(1);
                return NULL;
        }

and almost all of them are simply waiting for CPU time (indicated by a
'locks held by' line without lock information due to TASK_RUNNING state).
That is, many hundreds of allocating threads are ready to hold
the owner of oom_lock preempted.

[  504.760909] normal-priority invoked oom-killer: gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[  513.650210] CPU: 0 PID: 17881 Comm: normal-priority Kdump: loaded Not tainted 5.0.0-rc6 #826
[  513.653799] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  513.657968] Call Trace:
[  513.660026]  dump_stack+0x86/0xca
[  513.662292]  dump_header+0x10a/0x9d0
[  513.664673]  ? _raw_spin_unlock_irqrestore+0x3d/0x60
[  513.667319]  ? ___ratelimit+0x1d1/0x3c5
[  513.669682]  oom_kill_process.cold.32+0xb/0x5b9
[  513.672218]  ? check_flags.part.40+0x420/0x420
[  513.675347]  ? rcu_read_unlock_special+0x87/0x100
[  513.678734]  out_of_memory+0x287/0x7f0
[  513.681146]  ? oom_killer_disable+0x1f0/0x1f0
[  513.683629]  ? mutex_trylock+0x191/0x1e0
[  513.685983]  ? __alloc_pages_slowpath+0xa03/0x2350
[  513.688692]  __alloc_pages_slowpath+0x1cdf/0x2350
[  513.692541]  ? release_pages+0x8d6/0x12d0
[  513.696140]  ? warn_alloc+0x120/0x120
[  513.699669]  ? __lock_is_held+0xbc/0x140
[  513.703204]  ? __might_sleep+0x95/0x190
[  513.706554]  __alloc_pages_nodemask+0x510/0x5f0

[  717.991658] normal-priority R  running task    23432 17881   9439 0x80000080
[  717.994203] Call Trace:
[  717.995530]  __schedule+0x69a/0x1890
[  717.997116]  ? pci_mmcfg_check_reserved+0x120/0x120
[  717.999020]  ? __this_cpu_preempt_check+0x13/0x20
[  718.001299]  ? lockdep_hardirqs_on+0x347/0x5a0
[  718.003175]  ? preempt_schedule_irq+0x35/0x80
[  718.004966]  ? trace_hardirqs_on+0x28/0x170
[  718.006704]  preempt_schedule_irq+0x40/0x80
[  718.008440]  retint_kernel+0x1b/0x2d
[  718.010167] RIP: 0010:dump_stack+0xbc/0xca
[  718.011880] Code: c7 c0 ed 66 96 e8 7e d5 e2 fe c7 05 34 bc ed 00 ff ff ff ff 0f ba e3 09 72 09 53 9d e8 87 03 c4 fe eb 07 e8 10 02 c4 fe 53 9d <5b> 41 5c 41 5d 5d c3 90 90 90 90 90 90 90 55 48 89 e5 41 57 49 89
[  718.018262] RSP: 0000:ffff888111a672e0 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
[  718.020947] RAX: 0000000000000007 RBX: 0000000000000286 RCX: 1ffff1101563db64
[  718.023530] RDX: 0000000000000000 RSI: ffffffff95c6ff40 RDI: ffff8880ab1edab4
[  718.026597] RBP: ffff888111a672f8 R08: ffff8880ab1edab8 R09: 0000000000000006
[  718.029478] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  718.032224] R13: 00000000ffffffff R14: ffff888111a67628 R15: ffff888111a67628
[  718.035339]  dump_header+0x10a/0x9d0
[  718.037231]  ? _raw_spin_unlock_irqrestore+0x3d/0x60
[  718.039349]  ? ___ratelimit+0x1d1/0x3c5
[  718.041380]  oom_kill_process.cold.32+0xb/0x5b9
[  718.043451]  ? check_flags.part.40+0x420/0x420
[  718.045418]  ? rcu_read_unlock_special+0x87/0x100
[  718.047453]  out_of_memory+0x287/0x7f0
[  718.049245]  ? oom_killer_disable+0x1f0/0x1f0
[  718.051527]  ? mutex_trylock+0x191/0x1e0
[  718.053398]  ? __alloc_pages_slowpath+0xa03/0x2350
[  718.055478]  __alloc_pages_slowpath+0x1cdf/0x2350
[  718.057978]  ? release_pages+0x8d6/0x12d0
[  718.060245]  ? warn_alloc+0x120/0x120
[  718.062836]  ? __lock_is_held+0xbc/0x140
[  718.065815]  ? __might_sleep+0x95/0x190
[  718.068060]  __alloc_pages_nodemask+0x510/0x5f0



After this patch: http://I-love.SAKURA.ne.jp/tmp/serial-20190212-2.txt.xz
The OOM killer is smoothly invoked, though the system after all got stuck
due to a different problem.



While this patch cannot avoid delays caused by unlimited concurrent direct
reclaim, let's stop telling the lie

        /*
         * Acquire the oom lock.  If that fails, somebody else is
         * making progress for us.
         */

because many of allocating threads are preventing the owner of oom_lock from
making progress. Therefore, here again is a patch.



From 63c5c8ee7910fa9ef1c4067f1cb35a779e9d582c Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Tue, 12 Feb 2019 20:12:35 +0900
Subject: [PATCH v3] mm,page_alloc: wait for oom_lock before retrying.

When many hundreds of threads concurrently triggered a page fault, and
one of them invoked the global OOM killer, the owner of oom_lock is
preempted for minutes because they are rather depriving the owner of
oom_lock of CPU time rather than waiting for the owner of oom_lock to
make progress. We don't want to disable preemption while holding oom_lock
but we want the owner of oom_lock to complete as soon as possible.

Thus, this patch kills the dangerous assumption that sleeping for one
jiffy is sufficient for allowing the owner of oom_lock to make progress.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/page_alloc.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 35fdde0..c867513 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3618,7 +3618,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	 */
 	if (!mutex_trylock(&oom_lock)) {
 		*did_some_progress = 1;
-		schedule_timeout_uninterruptible(1);
+		if (mutex_lock_killable(&oom_lock) == 0)
+			mutex_unlock(&oom_lock);
+		else if (!tsk_is_oom_victim(current))
+			schedule_timeout_uninterruptible(1);
 		return NULL;
 	}
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-02-15 12:18 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-13 16:30 [PATCH v3] mm,page_alloc: wait for oom_lock before retrying Tetsuo Handa
2019-02-13 16:56 ` Michal Hocko
2019-02-15 10:42   ` Tetsuo Handa
2019-02-15 12:18     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox