[PATCH] mm/page_alloc: Favor kthread and dying threads over normal threads

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/page_alloc: Favor kthread and dying threads over normal threads
@ 2015-09-22 16:34 Tetsuo Handa
  0 siblings, 0 replies; 3+ messages in thread
From: Tetsuo Handa @ 2015-09-22 16:34 UTC (permalink / raw)
  To: linux-mm; +Cc: xfs, Tetsuo Handa

shrink_inactive_list() and throttle_direct_reclaim() are expecting that
dying threads should not be throttled so that they can leave memory
allocator functions and exit and release their memory shortly.
Also, throttle_direct_reclaim() is expecting that kernel threads should
not be throttled as they may be indirectly responsible for cleaning pages
necessary for reclaim to make forward progress.

Currently __GFP_WAIT && order <= PAGE_ALLOC_COSTLY_ORDER && !__GFP_NORETRY
&& !__GFP_NOFAIL allocation requests implicitly retry forever unless
TIF_MEMDIE is set by the OOM killer. Also, currently the OOM killer sets
TIF_MEMDIE to only one thread even if there are 1000 threads sharing the
mm struct. All threads get SIGKILL and are treated as dying thread, but
only OOM victim threads with TIF_MEMDIE are favored at several locations
in memory allocator functions. While OOM victim threads without TIF_MEMDIE
can acquire TIF_MEMDIE by calling out_of_memory(), they cannot acquire
TIF_MEMDIE unless they are doing __GFP_FS allocations.

Therefore, __GFP_WAIT && order <= PAGE_ALLOC_COSTLY_ORDER && !__GFP_NORETRY
&& !__GFP_NOFAIL && !__GFP_FS allocation requests by dying threads and
kernel threads are throttled by above-mentioned implicit retry loop because
they are using watermark for normal threads' normal allocation requests.

The effect of this throttling becomes visible on XFS (like kernel messages
shown below) if we revert commit cc87317726f8 ("mm: page_alloc: revert
inadvertent !__GFP_FS retry behavior change").

  [   66.089978] Kill process 8505 (a.out) sharing same memory
  [   69.748060] XFS: a.out(8082) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [   69.798580] XFS: kworker/u16:28(381) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [   69.876952] XFS: xfs-data/sda1(399) possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  [   70.359518] XFS: a.out(8412) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [   73.299509] XFS: kworker/u16:28(381) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [   73.470350] XFS: xfs-data/sda1(399) possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  [   73.664420] XFS: a.out(8082) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [   73.967434] XFS: a.out(8412) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [   76.950038] XFS: kworker/u16:28(381) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  [   76.957938] XFS: xfs-data/sda1(399) possible memory allocation deadlock in kmem_alloc (mode:0x8250)

Favoring only TIF_MEMDIE threads is prone to cause OOM livelock. Also,
favoring only dying threads still causes OOM livelock because sometimes
dying threads depend on memory allocations issued by kernel threads
(like kernel messages shown above).

Kernel threads and dying threads (especially OOM victim threads) want
higher priority than normal threads. This patch favors them (as with
throttle_direct_reclaim()) by implicitly applying ALLOC_HIGH priority.
This patch should help handling OOM events where a multi-threaded
program (e.g. java) is chosen as an OOM victim when the victim is
contended on unkillable locks (e.g. inode's mutex).

Presumably we don't need to apply ALLOC_NO_WATERMARKS priority for
TIF_MEMDIE threads if we evenly favor all OOM victim threads. But it is
outside of this patch's scope because we after all need to handle cases
where killing other threads is necessary for OOM victim threads to make
forward progress.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/page_alloc.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9bcfd70..f0c9098 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3010,6 +3010,13 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 				((current->flags & PF_MEMALLOC) ||
 				 unlikely(test_thread_flag(TIF_MEMDIE))))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
+		/*
+		 * Favor kernel threads and dying threads like
+		 * shrink_inactive_list() and throttle_direct_reclaim().
+		 */
+		else if (!atomic && ((current->flags & PF_KTHREAD) ||
+				     fatal_signal_pending(current)))
+			alloc_flags |= ALLOC_HIGH;
 	}
 #ifdef CONFIG_CMA
 	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] mm/page_alloc: Favor kthread and dying threads over normal threads
  2015-09-10 14:18 Tetsuo Handa
@ 2015-09-11 15:19 ` Tetsuo Handa
  0 siblings, 0 replies; 3+ messages in thread
From: Tetsuo Handa @ 2015-09-11 15:19 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, hannes, linux-mm

Tetsuo Handa wrote:
> From fb48bec5d08068bc68023f4684098d0ce9ab6439 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Thu, 10 Sep 2015 20:13:38 +0900
> Subject: [PATCH] mm/page_alloc: Favor kthread and dying threads over normal
>  threads

The effect of this patch (which gives higher priority to kernel threads
and dying threads) becomes clear if a different reproducer shown below

----------------------------------------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>

static int file_writer(void *unused)
{
	static char buffer[4096] = { };
	const int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
	sleep(2);
	while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
	return 0;
}

static int memory_consumer(void *unused)
{
	const int fd = open("/dev/zero", O_RDONLY);
	unsigned long size;
	char *buf = NULL;
	sleep(3);
	unlink("/tmp/file");
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	read(fd, buf, size); /* Will cause OOM due to overcommit */
	return 0;
}

int main(int argc, char *argv[])
{
	int i;
	for (i = 0; i < 2; i++)
		clone(file_writer, malloc(4 * 1024) + 4 * 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL);
	clone(memory_consumer, malloc(4 * 1024) + 4 * 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL);
	pause();
	return 0;
}
----------------------------------------

is used with "GFP_NOFS can fail" patch shown below.

----------------------------------------
diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a7a3a63..d21742c4 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -54,8 +54,9 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
 			return ptr;
 		if (!(++retries % 100))
-			xfs_err(NULL,
+			xfs_err(NULL, "%s(%u) "
 		"possible memory allocation deadlock in %s (mode:0x%x)",
+					current->comm, current->pid,
 					__func__, lflags);
 		congestion_wait(BLK_RW_ASYNC, HZ/50);
 	} while (1);
@@ -119,8 +120,9 @@ kmem_zone_alloc(kmem_zone_t *zone, xfs_km_flags_t flags)
 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
 			return ptr;
 		if (!(++retries % 100))
-			xfs_err(NULL,
+			xfs_err(NULL, "%s(%u) "
 		"possible memory allocation deadlock in %s (mode:0x%x)",
+					current->comm, current->pid,
 					__func__, lflags);
 		congestion_wait(BLK_RW_ASYNC, HZ/50);
 	} while (1);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 8ecffb3..3ea4188 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -353,8 +353,9 @@ retry:
 			 * handle buffer allocation failures we can't do much.
 			 */
 			if (!(++retries % 100))
-				xfs_err(NULL,
+				xfs_err(NULL, "%s(%u) "
 		"possible memory allocation deadlock in %s (mode:0x%x)",
+					current->comm, current->pid,
 					__func__, gfp_mask);
 
 			XFS_STATS_INC(xb_page_retries);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dcfe935..2c8873b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2680,6 +2680,9 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 {
 	unsigned int filter = SHOW_MEM_FILTER_NODES;
 
+	if (!(gfp_mask & __GFP_FS))
+		return;
+
 	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) ||
 	    debug_guardpage_minorder() > 0)
 		return;
@@ -2764,12 +2767,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			goto out;
 		/* The OOM killer does not compensate for IO-less reclaim */
 		if (!(gfp_mask & __GFP_FS)) {
-			/*
-			 * XXX: Page reclaim didn't yield anything,
-			 * and the OOM killer can't be invoked, but
-			 * keep looping as per tradition.
-			 */
-			*did_some_progress = 1;
 			goto out;
 		}
 		if (pm_suspended_storage())
----------------------------------------

Without this patch, we can observe that workqueue for writeback operation
got stuck at memory allocation (indicated by XFS's possible memory
allocation deadlock warning) like what throttle_direct_reclaim() says.

----------------------------------------
[  174.062364] systemd-journal invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[  174.064543] systemd-journal cpuset=/ mems_allowed=0
[  174.066339] CPU: 2 PID: 470 Comm: systemd-journal Not tainted 4.2.0-next-20150909+ #110
[  174.068416] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  174.070951]  0000000000000000 000000009a0f4f1a ffff880035de3af8 ffffffff8131bd76
[  174.073060]  ffff880035ce9980 ffff880035de3ba0 ffffffff81187d2d ffff880035de3b20
[  174.075067]  ffffffff8108fc93 ffff8800775d4ed0 ffff8800775d4c80 ffff8800775d4c80
[  174.077365] Call Trace:
[  174.078381]  [<ffffffff8131bd76>] dump_stack+0x4e/0x88
[  174.079888]  [<ffffffff81187d2d>] dump_header+0x82/0x232
[  174.081422]  [<ffffffff8108fc93>] ? preempt_count_add+0x43/0x90
[  174.084411]  [<ffffffff8108fc0d>] ? get_parent_ip+0xd/0x50
[  174.086105]  [<ffffffff8108fc93>] ? preempt_count_add+0x43/0x90
[  174.087817]  [<ffffffff8111b8bb>] oom_kill_process+0x35b/0x3c0
[  174.089493]  [<ffffffff810737d0>] ? has_ns_capability_noaudit+0x30/0x40
[  174.091212]  [<ffffffff810737f2>] ? has_capability_noaudit+0x12/0x20
[  174.092926]  [<ffffffff8111bb8d>] out_of_memory+0x21d/0x4a0
[  174.094552]  [<ffffffff81121184>] __alloc_pages_nodemask+0x904/0x930
[  174.096426]  [<ffffffff811643b0>] alloc_pages_vma+0xb0/0x1f0
[  174.098244]  [<ffffffff81144ed2>] handle_mm_fault+0x13f2/0x19d0
[  174.100161]  [<ffffffff81163397>] ? change_prot_numa+0x17/0x30
[  174.101943]  [<ffffffff81057912>] __do_page_fault+0x152/0x480
[  174.103483]  [<ffffffff81057c70>] do_page_fault+0x30/0x80
[  174.104982]  [<ffffffff816382e8>] page_fault+0x28/0x30
[  174.106378] Mem-Info:
[  174.107285] active_anon:314047 inactive_anon:1920 isolated_anon:16
[  174.107285]  active_file:11066 inactive_file:87440 isolated_file:0
[  174.107285]  unevictable:0 dirty:5533 writeback:81919 unstable:0
[  174.107285]  slab_reclaimable:4102 slab_unreclaimable:4889
[  174.107285]  mapped:10081 shmem:2148 pagetables:1906 bounce:0
[  174.107285]  free:13078 free_pcp:30 free_cma:0
[  174.116538] Node 0 DMA free:7312kB min:400kB low:500kB high:600kB active_anon:5204kB inactive_anon:144kB active_file:216kB inactive_file:976kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:32kB writeback:932kB mapped:296kB shmem:180kB slab_reclaimable:288kB slab_unreclaimable:300kB kernel_stack:240kB pagetables:396kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:7792 all_unreclaimable? yes
[  174.126674] lowmem_reserve[]: 0 1729 1729 1729
[  174.129129] Node 0 DMA32 free:45000kB min:44652kB low:55812kB high:66976kB active_anon:1250984kB inactive_anon:7536kB active_file:44048kB inactive_file:348784kB unevictable:0kB isolated(anon):64kB isolated(file):0kB present:2080640kB managed:1774196kB mlocked:0kB dirty:22100kB writeback:326744kB mapped:40028kB shmem:8412kB slab_reclaimable:16120kB slab_unreclaimable:19256kB kernel_stack:3920kB pagetables:7228kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  174.141837] lowmem_reserve[]: 0 0 0 0
[  174.143413] Node 0 DMA: 5*4kB (EM) 2*8kB (UE) 7*16kB (UEM) 3*32kB (UE) 3*64kB (UEM) 2*128kB (E) 2*256kB (UE) 2*512kB (UM) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7348kB
[  174.148343] Node 0 DMA32: 691*4kB (UE) 650*8kB (UEM) 242*16kB (UE) 30*32kB (UE) 6*64kB (UE) 3*128kB (U) 7*256kB (UEM) 6*512kB (UE) 24*1024kB (UEM) 1*2048kB (E) 0*4096kB = 45052kB
[  174.154113] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  174.156382] 100695 total pagecache pages
[  174.157986] 0 pages in swap cache
[  174.159431] Swap cache stats: add 0, delete 0, find 0/0
[  174.161316] Free swap  = 0kB
[  174.162748] Total swap = 0kB
[  174.164874] 524157 pages RAM
[  174.166635] 0 pages HighMem/MovableOnly
[  174.168878] 76632 pages reserved
[  174.170472] 0 pages hwpoisoned
[  174.171788] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[  174.174162] [  470]     0   470    34593     2894      31       3        0             0 systemd-journal
[  174.176546] [  485]     0   485    10290      810      23       3        0         -1000 systemd-udevd
[  174.179056] [  507]     0   507    12795      763      25       3        0         -1000 auditd
[  174.181386] [ 1688]     0  1688    82430     6883      83       3        0             0 firewalld
[  174.184146] [ 1691]    70  1691     6988      671      18       3        0             0 avahi-daemon
[  174.186614] [ 1694]     0  1694    54104     1701      40       3        0             0 rsyslogd
[  174.189930] [ 1695]     0  1695   137547     5615      88       3        0             0 tuned
[  174.192275] [ 1698]     0  1698     4823      678      15       3        0             0 irqbalance
[  174.194670] [ 1699]     0  1699     1095      358       8       3        0             0 rngd
[  174.196894] [ 1705]     0  1705    53609     2135      59       3        0             0 abrtd
[  174.199280] [ 1706]     0  1706    53001     1962      57       4        0             0 abrt-watch-log
[  174.202202] [ 1708]     0  1708     8673      726      23       3        0             0 systemd-logind
[  174.205167] [ 1709]    81  1709     6647      734      18       3        0          -900 dbus-daemon
[  174.207828] [ 1717]     0  1717    31578      802      20       3        0             0 crond
[  174.210248] [ 1756]    70  1756     6988       57      17       3        0             0 avahi-daemon
[  174.212817] [ 1900]     0  1900    46741     1920      43       3        0             0 vmtoolsd
[  174.215156] [ 2445]     0  2445    25938     3354      49       3        0             0 dhclient
[  174.217955] [ 2449]   999  2449   128626     3447      49       4        0             0 polkitd
[  174.220319] [ 2532]     0  2532    20626     1512      42       4        0         -1000 sshd
[  174.222694] [ 2661]     0  2661     7320      596      19       3        0             0 xinetd
[  174.224974] [ 4080]     0  4080    22770     1182      43       3        0             0 master
[  174.227266] [ 4247]    89  4247    22796     1533      46       3        0             0 pickup
[  174.229483] [ 4248]    89  4248    22813     1605      45       3        0             0 qmgr
[  174.231719] [ 4772]     0  4772    75242     1276      96       3        0             0 nmbd
[  174.234313] [ 4930]     0  4930    92960     3416     130       3        0             0 smbd
[  174.236671] [ 4967]     0  4967    92960     1516     125       3        0             0 smbd
[  174.239945] [ 5046]     0  5046    27503      571      12       3        0             0 agetty
[  174.242850] [11027]     0 11027    21787     1047      48       3        0             0 login
[  174.246033] [11030]  1000 11030    28865      904      14       3        0             0 bash
[  174.248385] [11108]  1000 11107   541750   295927     588       6        0             0 a.out
[  174.250806] Out of memory: Kill process 11109 (a.out) score 662 or sacrifice child
[  174.252879] Killed process 11108 (a.out) total-vm:2167000kB, anon-rss:1182716kB, file-rss:992kB
[  178.675269] XFS: crond(1717) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
[  178.729646] XFS: kworker/u16:29(382) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
[  180.805219] XFS: crond(1717) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
[  180.877987] XFS: kworker/u16:29(382) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
[  182.392209] XFS: vmtoolsd(1900) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
[  182.961922] XFS: crond(1717) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
[  183.050782] XFS: kworker/u16:29(382) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
(...snipped...)
[  255.566378] kworker/u16:29  D ffff88007fc55b40     0   382      2 0x00000000
[  255.568322] Workqueue: writeback wb_workfn (flush-8:0)
[  255.569891]  ffff880077666fc8 0000000000000046 ffff880077646600 ffff880077668000
[  255.572002]  ffff880077667030 ffff88007fc4dfc0 00000000ffff501b 0000000000000040
[  255.574117]  ffff880077667010 ffffffff81631bf8 ffff88007fc4dfc0 ffff880077667090
[  255.576197] Call Trace:
[  255.577225]  [<ffffffff81631bf8>] schedule+0x38/0x90
[  255.578825]  [<ffffffff81635672>] schedule_timeout+0x122/0x1c0
[  255.580629]  [<ffffffff810c8020>] ? cascade+0x90/0x90
[  255.582192]  [<ffffffff81635769>] schedule_timeout_uninterruptible+0x19/0x20
[  255.584132]  [<ffffffff81120eb8>] __alloc_pages_nodemask+0x638/0x930
[  255.585816]  [<ffffffff8116310c>] alloc_pages_current+0x8c/0x100
[  255.587608]  [<ffffffff8127bf7a>] xfs_buf_allocate_memory+0x17b/0x26e
[  255.589529]  [<ffffffff81246bca>] xfs_buf_get_map+0xca/0x130
[  255.591139]  [<ffffffff81247144>] xfs_buf_read_map+0x24/0xb0
[  255.592828]  [<ffffffff8126ec77>] xfs_trans_read_buf_map+0x97/0x1a0
[  255.594633]  [<ffffffff812223d3>] xfs_btree_read_buf_block.constprop.28+0x73/0xc0
[  255.596745]  [<ffffffff8122249b>] xfs_btree_lookup_get_block+0x7b/0xf0
[  255.598527]  [<ffffffff812223e9>] ? xfs_btree_read_buf_block.constprop.28+0x89/0xc0
[  255.600567]  [<ffffffff8122638e>] xfs_btree_lookup+0xbe/0x4a0
[  255.602289]  [<ffffffff8120d546>] xfs_alloc_lookup_eq+0x16/0x20
[  255.604092]  [<ffffffff8120da7d>] xfs_alloc_fixup_trees+0x23d/0x340
[  255.605915]  [<ffffffff812110cc>] ? xfs_allocbt_init_cursor+0x3c/0xc0
[  255.607577]  [<ffffffff8120f381>] xfs_alloc_ag_vextent_near+0x511/0x880
[  255.609336]  [<ffffffff8120fdb5>] xfs_alloc_ag_vextent+0xb5/0xe0
[  255.611082]  [<ffffffff81210866>] xfs_alloc_vextent+0x356/0x460
[  255.613046]  [<ffffffff8121e496>] xfs_bmap_btalloc+0x386/0x6d0
[  255.614684]  [<ffffffff8121e7e9>] xfs_bmap_alloc+0x9/0x10
[  255.616322]  [<ffffffff8121f1e9>] xfs_bmapi_write+0x4b9/0xa10
[  255.617969]  [<ffffffff8125280c>] xfs_iomap_write_allocate+0x13c/0x320
[  255.619818]  [<ffffffff812407ba>] xfs_map_blocks+0x15a/0x170
[  255.621500]  [<ffffffff8124177b>] xfs_vm_writepage+0x18b/0x5b0
[  255.623066]  [<ffffffff811228ce>] __writepage+0xe/0x30
[  255.624593]  [<ffffffff811232f3>] write_cache_pages+0x1f3/0x4a0
[  255.626287]  [<ffffffff811228c0>] ? mapping_tagged+0x10/0x10
[  255.628265]  [<ffffffff811235ec>] generic_writepages+0x4c/0x80
[  255.630194]  [<ffffffff8108fc0d>] ? get_parent_ip+0xd/0x50
[  255.631883]  [<ffffffff8108fc93>] ? preempt_count_add+0x43/0x90
[  255.633464]  [<ffffffff8124062e>] xfs_vm_writepages+0x3e/0x50
[  255.635000]  [<ffffffff81124199>] do_writepages+0x19/0x30
[  255.636549]  [<ffffffff811b3de3>] __writeback_single_inode+0x33/0x170
[  255.638345]  [<ffffffff81635fe5>] ? _raw_spin_unlock+0x15/0x40
[  255.640005]  [<ffffffff811b44a9>] writeback_sb_inodes+0x279/0x440
[  255.641636]  [<ffffffff811b46f1>] __writeback_inodes_wb+0x81/0xb0
[  255.643310]  [<ffffffff811b48cc>] wb_writeback+0x1ac/0x1e0
[  255.644866]  [<ffffffff811b4e45>] wb_workfn+0xe5/0x2f0
[  255.646384]  [<ffffffff8163606c>] ? _raw_spin_unlock_irq+0x1c/0x40
[  255.648301]  [<ffffffff8108bda9>] ? finish_task_switch+0x69/0x230
[  255.649915]  [<ffffffff81081a59>] process_one_work+0x129/0x300
[  255.651479]  [<ffffffff81081d45>] worker_thread+0x115/0x450
[  255.653019]  [<ffffffff81081c30>] ? process_one_work+0x300/0x300
[  255.654664]  [<ffffffff81087113>] kthread+0xd3/0xf0
[  255.656060]  [<ffffffff81087040>] ? kthread_create_on_node+0x1a0/0x1a0
[  255.657736]  [<ffffffff81636b1f>] ret_from_fork+0x3f/0x70
[  255.659240]  [<ffffffff81087040>] ? kthread_create_on_node+0x1a0/0x1a0
(...snipped...)
[  262.539668] Showing busy workqueues and worker pools:
[  262.540997] workqueue events: flags=0x0
[  262.542153]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=5/256
[  262.544104]     pending: vmpressure_work_fn, e1000_watchdog [e1000], vmstat_update, vmw_fb_dirty_flush [vmwgfx], console_callback
[  262.547286] workqueue events_freezable: flags=0x4
[  262.548604]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  262.550398]     pending: vmballoon_work [vmw_balloon]
[  262.552006] workqueue events_power_efficient: flags=0x80
[  262.553542]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  262.555281]     pending: neigh_periodic_work
[  262.556624] workqueue events_freezable_power_: flags=0x84
[  262.558168]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  262.560241]     in-flight: 214:disk_events_workfn
[  262.561837] workqueue writeback: flags=0x4e
[  262.563258]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256
[  262.564916]     in-flight: 382:wb_workfn wb_workfn
[  262.566530] workqueue xfs-data/sda1: flags=0xc
[  262.567905]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=6/256
[  262.569664]     in-flight: 11065:xfs_end_io, 11066:xfs_end_io, 11026:xfs_end_io, 11068:xfs_end_io, 11064:xfs_end_io, 82:xfs_end_io
[  262.572704]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=17/256
[  262.574497]     in-flight: 447:xfs_end_io, 398(RESCUER):xfs_end_io xfs_end_io xfs_end_io xfs_end_io xfs_end_io xfs_end_io xfs_end_io xfs_end_io, 11071:xfs_end_io, 11072:xfs_end_io, 11069:xfs_end_io, 11090:xfs_end_io, 11073:xfs_end_io, 11091:xfs_end_io, 23:xfs_end_io, 11070:xfs_end_io
[  262.581400] pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=4 idle: 11096 47 4
[  262.583536] pool 4: cpus=2 node=0 flags=0x0 nice=0 workers=10 manager: 86
[  262.585596] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=14 idle: 11063 11062 11061 11060 11059 30 84 11067
[  262.588545] pool 16: cpus=0-7 flags=0x4 nice=0 workers=32 idle: 380 381 379 378 377 376 375 374 373 372 371 370 369 368 367 366 365 364 363 362 361 360 359 358 277 279 6 271 69 384 383
[  263.463828] XFS: vmtoolsd(1900) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
[  264.167134] XFS: crond(1717) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
[  264.292440] XFS: pickup(4247) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
[  264.335779] XFS: kworker/u16:29(382) possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
----------------------------------------
Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20150911.txt.xz

With this patch, as far as I tested, I didn't see the warning.

Thus, I don't know whether ALLOC_HIGH is best, but I think that
favoring kernel threads can help with making forward progress.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH] mm/page_alloc: Favor kthread and dying threads over normal threads
@ 2015-09-10 14:18 Tetsuo Handa
  2015-09-11 15:19 ` Tetsuo Handa
  0 siblings, 1 reply; 3+ messages in thread
From: Tetsuo Handa @ 2015-09-10 14:18 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, hannes, linux-mm

>From fb48bec5d08068bc68023f4684098d0ce9ab6439 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 10 Sep 2015 20:13:38 +0900
Subject: [PATCH] mm/page_alloc: Favor kthread and dying threads over normal
 threads

shrink_inactive_list() and throttle_direct_reclaim() are expecting that
dying threads should not be throttled so that they can leave memory
allocator functions and die and release their memory shortly.
Also, throttle_direct_reclaim() is expecting that kernel threads should
not be throttled as they may be indirectly responsible for cleaning pages
necessary for reclaim to make forward progress.

Currently __GFP_WAIT && order <= PAGE_ALLOC_COSTLY_ORDER && !__GFP_NORETRY
&& !__GFP_NOFAIL allocation requests implicitly retry forever unless
TIF_MEMDIE is set by the OOM killer. But we unlikely can change such
requests not to retry in the near future because most of allocation failure
paths are not tested well. If we change it now and add __GFP_NOFAIL to
callers, we increase possibility of waiting for unkillable OOM victim
threads.

Also, currently the OOM killer sets TIF_MEMDIE to only one thread even if
there are 1000 threads sharing the mm struct. All threads get SIGKILL and
are treated as dying thread, but there is a problem. While OOM victim
threads with TIF_MEMDIE are favored at several locations in memory
allocator functions, OOM victim threads without TIF_MEMDIE are not favored
(except abovementioned shrink_inactive_list()) unless they are doing
__GFP_FS allocations.

Therefore, __GFP_WAIT && order <= PAGE_ALLOC_COSTLY_ORDER && !__GFP_NORETRY
allocation requests by dying threads and kernel threads are throttled by
abovementioned implicit retry loop because they are using watermark for
normal threads' normal allocation requests.

For example, kernel threads and OOM victim threads without TIF_MEMDIE can
fall into OOM livelock condition using a reproducer shown below (which
mutually blocks other threads using unkillable mutex lock).

----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>

static int file_writer(void *unused)
{
	const int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
	sleep(2);
	while (write(fd, "", 1) == 1);
	return 0;
}

static int memory_consumer(void *unused)
{
	const int fd = open("/dev/zero", O_RDONLY);
	unsigned long size;
	char *buf = NULL;
	sleep(3);
	unlink("/tmp/file");
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	read(fd, buf, size); /* Will cause OOM due to overcommit */
	return 0;
}

int main(int argc, char *argv[])
{
	int i;
	for (i = 0; i < 1000; i++)
		clone(file_writer, malloc(4 * 1024) + 4 * 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL);
	clone(memory_consumer, malloc(4 * 1024) + 4 * 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL);
	pause();
	return 0;
}
----------

The OOM killer randomly chooses only one thread (and let
__alloc_pages_slowpath() favor only that thread) when each threads are
mutually blocked. This is prone to cause OOM livelock.

But favoring only dying threads still causes OOM livelock because
sometimes dying threads depend on memory allocations issued by kernel
threads.

Kernel threads and dying threads (especially OOM victim threads) want
higher priority than normal threads. This patch favors them by
implicitly applying ALLOC_HIGH watermark.

Presumably we don't need to apply ALLOC_NO_WATERMARKS priority for
TIF_MEMDIE threads if we evenly favor all OOM victim threads. But it is
outside of this patch's scope because we after all need to handle cases
where killing other threads are necessary for OOM victim threads to make
forward progress (e.g. multiple instances of the reproducer shown above
are running concurrently).

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/page_alloc.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dcfe935..777c331 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2990,6 +2990,13 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 				((current->flags & PF_MEMALLOC) ||
 				 unlikely(test_thread_flag(TIF_MEMDIE))))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
+		/*
+		 * Favor kernel threads and dying threads like
+		 * shrink_inactive_list() and throttle_direct_reclaim().
+		 */
+		else if (!atomic && ((current->flags & PF_KTHREAD) ||
+				     fatal_signal_pending(current)))
+			alloc_flags |= ALLOC_HIGH;
 	}
 #ifdef CONFIG_CMA
 	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-09-22 16:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-22 16:34 [PATCH] mm/page_alloc: Favor kthread and dying threads over normal threads Tetsuo Handa
  -- strict thread matches above, loose matches on Subject: below --
2015-09-10 14:18 Tetsuo Handa
2015-09-11 15:19 ` Tetsuo Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox