[PATCH 00 of 11] oom deadlock fixes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00 of 11] oom deadlock fixes
@ 2008-01-03  2:09 Andrea Arcangeli
  2008-01-03  2:09 ` [PATCH 01 of 11] limit shrink zone scanning Andrea Arcangeli
                   ` (10 more replies)
  0 siblings, 11 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  2:09 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, David Rientjes

Hello,

I rebased the patchset that fixes certain oom deadlocks that are still
reproducible with mainline.

I removed the global VM_is_OOM to keep the fixes to the minimum required to not
be kernel-crashing, I adapted to the introduction of the zone-oom-lock bitflag,
so with this update I hopefully have a better chance for merging.

Despite the lack of VM_is_OOM perfect oom-killing serialization, and the need
to avoid waiting forever on TIF_MEMDIE to prevent deadlocks, in practice with
swap it seem not to generate bad spurious kills and it avoids the deadlock as
well as my older patchset. We can always perfect this later with feedback
coming from do_exit.

Thanks.
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 01 of 11] limit shrink zone scanning
  2008-01-03  2:09 [PATCH 00 of 11] oom deadlock fixes Andrea Arcangeli
@ 2008-01-03  2:09 ` Andrea Arcangeli
  2008-01-07 19:11   ` Christoph Lameter
  2008-01-03  2:09 ` [PATCH 02 of 11] avoid oom deadlock in nfs_create_request Andrea Arcangeli
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  2:09 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199324664 -3600
# Node ID 0d6cfa8fbffe9d3593fff5d50e47d0217a8ae1b7
# Parent  e28e1be3fae5183e3e36e32e3feb9a59ec59c825
limit shrink zone scanning

Assume two tasks adds to nr_scan_*active at the same time (first line of the
old buggy code), they'll effectively double their scan rate, for no good
reason. What can happen is that instead of scanning nr_entries each, they'll
scan nr_entries*2 each. The more CPUs the bigger the race and the higher the
multiplication effect and the harder it will be to detect oom. This puts a cap
on the amount of work that it makes sense to do in case the race triggers.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1114,7 +1114,7 @@ static unsigned long shrink_zone(int pri
 	 */
 	zone->nr_scan_active +=
 		(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
-	nr_active = zone->nr_scan_active;
+	nr_active = min(zone->nr_scan_active, zone_page_state(zone, NR_ACTIVE));
 	if (nr_active >= sc->swap_cluster_max)
 		zone->nr_scan_active = 0;
 	else
@@ -1122,7 +1122,7 @@ static unsigned long shrink_zone(int pri
 
 	zone->nr_scan_inactive +=
 		(zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
-	nr_inactive = zone->nr_scan_inactive;
+	nr_inactive = min(zone->nr_scan_inactive, zone_page_state(zone, NR_INACTIVE));
 	if (nr_inactive >= sc->swap_cluster_max)
 		zone->nr_scan_inactive = 0;
 	else

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 02 of 11] avoid oom deadlock in nfs_create_request
  2008-01-03  2:09 [PATCH 00 of 11] oom deadlock fixes Andrea Arcangeli
  2008-01-03  2:09 ` [PATCH 01 of 11] limit shrink zone scanning Andrea Arcangeli
@ 2008-01-03  2:09 ` Andrea Arcangeli
  2008-01-07 19:13   ` Christoph Lameter
  2008-01-03  2:09 ` [PATCH 03 of 11] prevent oom deadlocks during read/write operations Andrea Arcangeli
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  2:09 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199324664 -3600
# Node ID 3ec0754b24738f5658baf2a1ebacc7d3f2c17283
# Parent  0d6cfa8fbffe9d3593fff5d50e47d0217a8ae1b7
avoid oom deadlock in nfs_create_request

When sigkill is pending after the oom killer set TIF_MEMDIE, the task
must go away or the VM will malfunction.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -61,16 +61,20 @@ nfs_create_request(struct nfs_open_conte
 	struct nfs_server *server = NFS_SERVER(inode);
 	struct nfs_page		*req;
 
-	for (;;) {
-		/* try to allocate the request struct */
-		req = nfs_page_alloc();
-		if (req != NULL)
-			break;
+	/* try to allocate the request struct */
+	req = nfs_page_alloc();
+	if (unlikely(!req)) {
+		/*
+		 * -ENOMEM will be returned only when TIF_MEMDIE is set
+		 * so userland shouldn't risk to get confused by a new
+		 * unhandled ENOMEM errno.
+		 */
+		WARN_ON(!test_thread_flag(TIF_MEMDIE));
+		return ERR_PTR(-ENOMEM);
+	}
 
-		if (signalled() && (server->flags & NFS_MOUNT_INTR))
-			return ERR_PTR(-ERESTARTSYS);
-		yield();
-	}
+	if (signalled() && (server->flags & NFS_MOUNT_INTR))
+		return ERR_PTR(-ERESTARTSYS);
 
 	/* Initialize the request struct. Initially, we assume a
 	 * long write-back delay. This will be adjusted in

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 03 of 11] prevent oom deadlocks during read/write operations
  2008-01-03  2:09 [PATCH 00 of 11] oom deadlock fixes Andrea Arcangeli
  2008-01-03  2:09 ` [PATCH 01 of 11] limit shrink zone scanning Andrea Arcangeli
  2008-01-03  2:09 ` [PATCH 02 of 11] avoid oom deadlock in nfs_create_request Andrea Arcangeli
@ 2008-01-03  2:09 ` Andrea Arcangeli
  2008-01-07 19:15   ` Christoph Lameter
  2008-01-03  2:09 ` [PATCH 04 of 11] avoid selecting already killed tasks Andrea Arcangeli
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  2:09 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199324664 -3600
# Node ID 71f1d848763c80f336f7f9f23dcbe8f7e43c82aa
# Parent  3ec0754b24738f5658baf2a1ebacc7d3f2c17283
prevent oom deadlocks during read/write operations

We need to react to SIGKILL during read/write with huge buffers or it
becomes too easy to prevent a SIGKILLED task to run do_exit promptly
after it has been selected for oom-killage.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -927,6 +927,16 @@ page_ok:
 		isize = i_size_read(inode);
 		end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
 		if (unlikely(!isize || index > end_index)) {
+			page_cache_release(page);
+			goto out;
+		}
+
+		if (unlikely(sigismember(&current->pending.signal, SIGKILL))) {
+			/*
+			 * Must not hang almost forever in D state in
+			 * presence of sigkill and lots of ram/swap
+			 * (think during OOM).
+			 */
 			page_cache_release(page);
 			goto out;
 		}
@@ -2063,6 +2073,16 @@ static ssize_t generic_perform_write_2co
 			break;
 		}
 
+		if (unlikely(sigismember(&current->pending.signal, SIGKILL))) {
+			/*
+			 * Must not hang almost forever in D state in
+			 * presence of sigkill and lots of ram/swap
+			 * (think during OOM).
+			 */
+			status = -ENOMEM;
+			break;
+		}
+
 		page = __grab_cache_page(mapping, index);
 		if (!page) {
 			status = -ENOMEM;
@@ -2230,6 +2250,16 @@ again:
 		 */
 		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
 			status = -EFAULT;
+			break;
+		}
+
+		if (unlikely(sigismember(&current->pending.signal, SIGKILL))) {
+			status = -ENOMEM;
+			/*
+			 * Must not hang almost forever in D state in
+			 * presence of sigkill and lots of ram/swap
+			 * (think during OOM).
+			 */
 			break;
 		}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 04 of 11] avoid selecting already killed tasks
  2008-01-03  2:09 [PATCH 00 of 11] oom deadlock fixes Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2008-01-03  2:09 ` [PATCH 03 of 11] prevent oom deadlocks during read/write operations Andrea Arcangeli
@ 2008-01-03  2:09 ` Andrea Arcangeli
  2008-01-03  9:40   ` David Rientjes
  2008-01-07 19:17   ` Christoph Lameter
  2008-01-03  2:09 ` [PATCH 05 of 11] reduce the probability of an OOM livelock Andrea Arcangeli
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  2:09 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199324664 -3600
# Node ID 4cf8805b5695a8a3fb7c1d11fa4d41d0b2650cb0
# Parent  71f1d848763c80f336f7f9f23dcbe8f7e43c82aa
avoid selecting already killed tasks

If the killed task doesn't go away because it's waiting on some other
task who needs to allocate memory, to release the i_sem or some other
lock, we must fallback to killing some other task in order to kill the
original selected and already oomkilled task, but the logic that kills
the childs first, would deadlock, if the already oom-killed task was
actually the first child of the newly oom-killed task.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -363,6 +363,12 @@ static int oom_kill_process(struct task_
 	list_for_each_entry(c, &p->children, sibling) {
 		if (c->mm == p->mm)
 			continue;
+		/*
+		 * We cannot select tasks with TIF_MEMDIE already set
+		 * or we'll hard deadlock.
+		 */
+		if (unlikely(test_tsk_thread_flag(c, TIF_MEMDIE)))
+			continue;
 		if (!oom_kill_task(c))
 			return 0;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 05 of 11] reduce the probability of an OOM livelock
  2008-01-03  2:09 [PATCH 00 of 11] oom deadlock fixes Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2008-01-03  2:09 ` [PATCH 04 of 11] avoid selecting already killed tasks Andrea Arcangeli
@ 2008-01-03  2:09 ` Andrea Arcangeli
  2008-01-07 19:32   ` Christoph Lameter
  2008-01-03  2:09 ` [PATCH 06 of 11] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  2:09 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199324664 -3600
# Node ID fc9148f0ddd0ef11be29dba89e3fc96df8f0b9bf
# Parent  4cf8805b5695a8a3fb7c1d11fa4d41d0b2650cb0
reduce the probability of an OOM livelock

There's no need to loop way too many times over the lrus in order to
declare defeat and decide to kill a task. The more loops we do the more
likely there we'll run in a livelock with a page bouncing back and
forth between tasks. The maximum number of entries to check in a loop
that returns less than swap-cluster-max pages freed, should be the size
of the list (or at most twice the size of the list if you want to be
really paranoid about the PG_referenced bit).

Our objective there is to know reliably when it's time that we kill a
task, tring to free a few more pages at that already ciritical point is
worthless.

This seems to have the effect of reducing the "hang" time during oom
killing.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1211,7 +1211,6 @@ unsigned long try_to_free_pages(struct z
 	int priority;
 	int ret = 0;
 	unsigned long total_scanned = 0;
-	unsigned long nr_reclaimed = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long lru_pages = 0;
 	int i;
@@ -1237,15 +1236,17 @@ unsigned long try_to_free_pages(struct z
 	}
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+		unsigned long nr_reclaimed;
+
 		sc.nr_scanned = 0;
 		if (!priority)
 			disable_swap_token();
-		nr_reclaimed += shrink_zones(priority, zones, &sc);
+		nr_reclaimed = shrink_zones(priority, zones, &sc);
+		if (reclaim_state)
+			reclaim_state->reclaimed_slab = 0;
 		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
-		if (reclaim_state) {
+		if (reclaim_state)
 			nr_reclaimed += reclaim_state->reclaimed_slab;
-			reclaim_state->reclaimed_slab = 0;
-		}
 		total_scanned += sc.nr_scanned;
 		if (nr_reclaimed >= sc.swap_cluster_max) {
 			ret = 1;
@@ -1320,7 +1321,6 @@ static unsigned long balance_pgdat(pg_da
 	int priority;
 	int i;
 	unsigned long total_scanned;
-	unsigned long nr_reclaimed;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
@@ -1337,7 +1337,6 @@ static unsigned long balance_pgdat(pg_da
 
 loop_again:
 	total_scanned = 0;
-	nr_reclaimed = 0;
 	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);
 
@@ -1347,6 +1346,7 @@ loop_again:
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
+		unsigned long nr_reclaimed;
 
 		/* The swap token gets in the way of swapout... */
 		if (!priority)
@@ -1393,6 +1393,7 @@ loop_again:
 		 * pages behind kswapd's direction of progress, which would
 		 * cause too much scanning of the lower zones.
 		 */
+		nr_reclaimed = 0;
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 06 of 11] balance_pgdat doesn't return the number of pages freed
  2008-01-03  2:09 [PATCH 00 of 11] oom deadlock fixes Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2008-01-03  2:09 ` [PATCH 05 of 11] reduce the probability of an OOM livelock Andrea Arcangeli
@ 2008-01-03  2:09 ` Andrea Arcangeli
  2008-01-07 19:33   ` Christoph Lameter
  2008-01-03  2:09 ` [PATCH 07 of 11] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  2:09 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199324665 -3600
# Node ID 4ef302dd29164e19111c49bb0db2ad4840eace18
# Parent  fc9148f0ddd0ef11be29dba89e3fc96df8f0b9bf
balance_pgdat doesn't return the number of pages freed

nr_reclaimed would be the number of pages freed in the last pass.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1298,8 +1298,6 @@ out:
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at pages_high.
  *
- * Returns the number of pages which were actually freed.
- *
  * There is special handling here for zones which are full of pinned pages.
  * This can happen if the pages are all mlocked, or if they are all used by
  * device drivers (say, ZONE_DMA).  Or if they are all in use by hugetlb.
@@ -1315,7 +1313,7 @@ out:
  * the page allocator fallback scheme to ensure that aging of pages is balanced
  * across the zones.
  */
-static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
+static void balance_pgdat(pg_data_t *pgdat, int order)
 {
 	int all_zones_ok;
 	int priority;
@@ -1475,8 +1473,6 @@ out:
 
 		goto loop_again;
 	}
-
-	return nr_reclaimed;
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 07 of 11] don't depend on PF_EXITING tasks to go away
  2008-01-03  2:09 [PATCH 00 of 11] oom deadlock fixes Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2008-01-03  2:09 ` [PATCH 06 of 11] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
@ 2008-01-03  2:09 ` Andrea Arcangeli
  2008-01-03  9:52   ` David Rientjes
  2008-01-03  2:09 ` [PATCH 08 of 11] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  2:09 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199324665 -3600
# Node ID 686a1129469a1bad96745705ffe1567146bae222
# Parent  4ef302dd29164e19111c49bb0db2ad4840eace18
don't depend on PF_EXITING tasks to go away

A PF_EXITING task don't have TIF_MEMDIE set so it might get stuck in
memory allocations without access to the PF_MEMALLOC pool (said that
ideally do_exit would better not require memory allocations, especially
not before calling exit_mm). The same way we raise its privilege to
TIF_MEMDIE if it's the current task, we should do it even if it's not
the current task to speedup oom killing.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -223,27 +223,13 @@ static struct task_struct *select_bad_pr
 		 * Note: this may have a chance of deadlock if it gets
 		 * blocked waiting for another task which itself is waiting
 		 * for memory. Is there a better alternative?
+		 *
+		 * Better not to skip PF_EXITING tasks, since they
+		 * don't have access to the PF_MEMALLOC pool until
+		 * we select them here first.
 		 */
 		if (test_tsk_thread_flag(p, TIF_MEMDIE))
 			return ERR_PTR(-1UL);
-
-		/*
-		 * This is in the process of releasing memory so wait for it
-		 * to finish before killing some other task by mistake.
-		 *
-		 * However, if p is the current task, we allow the 'kill' to
-		 * go ahead if it is exiting: this will simply set TIF_MEMDIE,
-		 * which will allow it to gain access to memory reserves in
-		 * the process of exiting and releasing its resources.
-		 * Otherwise we could get an easy OOM deadlock.
-		 */
-		if (p->flags & PF_EXITING) {
-			if (p != current)
-				return ERR_PTR(-1UL);
-
-			chosen = p;
-			*ppoints = ULONG_MAX;
-		}
 
 		if (p->oomkilladj == OOM_DISABLE)
 			continue;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 08 of 11] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2008-01-03  2:09 [PATCH 00 of 11] oom deadlock fixes Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2008-01-03  2:09 ` [PATCH 07 of 11] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
@ 2008-01-03  2:09 ` Andrea Arcangeli
  2008-01-03  2:09 ` [PATCH 09 of 11] oom select should only take rss into account Andrea Arcangeli
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  2:09 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199325610 -3600
# Node ID 59c2caaf27ab2eba9aaaab7f9a06089bf164f22f
# Parent  686a1129469a1bad96745705ffe1567146bae222
stop useless vm trashing while we wait the TIF_MEMDIE task to exit

There's no point in trying to free memory if we're oom.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1129,6 +1129,13 @@ static unsigned long shrink_zone(int pri
 		nr_inactive = 0;
 
 	while (nr_active || nr_inactive) {
+		if (unlikely(zone_is_oom_locked(zone))) {
+			if (!test_thread_flag(TIF_MEMDIE))
+				/* get out of the way */
+				schedule_timeout_interruptible(1);
+			else
+				break;
+		}
 		if (nr_active) {
 			nr_to_scan = min(nr_active,
 					(unsigned long)sc->swap_cluster_max);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 09 of 11] oom select should only take rss into account
  2008-01-03  2:09 [PATCH 00 of 11] oom deadlock fixes Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2008-01-03  2:09 ` [PATCH 08 of 11] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
@ 2008-01-03  2:09 ` Andrea Arcangeli
  2008-01-07 19:35   ` Christoph Lameter
  2008-01-03  2:09 ` [PATCH 10 of 11] limit reclaim if enough pages have been freed Andrea Arcangeli
  2008-01-03  2:09 ` [PATCH 11 of 11] not-wait-memdie Andrea Arcangeli
  10 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  2:09 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199325618 -3600
# Node ID 03ad5aceb1e3e64d53a3537bc86dba8c268b1954
# Parent  59c2caaf27ab2eba9aaaab7f9a06089bf164f22f
oom select should only take rss into account

Running workloads where many tasks grow their virtual memory
simultaneously, so they all have a relatively small virtual memory when
oom triggers (if compared to innocent longstanding tasks), the oom
killer then selects mysql/apache and other things with very large VM but
very small RSS. RSS is the only thing that matters, killing a task with
huge VM but zero RSS is not useful. Many apps tend to have large VM but
small RSS in the first place (regardless of swapping activity) and they
shouldn't be penalized like this.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -66,7 +66,7 @@ unsigned long badness(struct task_struct
 	/*
 	 * The memory size of the process is the basis for the badness.
 	 */
-	points = mm->total_vm;
+	points = get_mm_rss(mm);
 
 	/*
 	 * After this unlock we can no longer dereference local variable `mm'
@@ -90,7 +90,7 @@ unsigned long badness(struct task_struct
 	list_for_each_entry(child, &p->children, sibling) {
 		task_lock(child);
 		if (child->mm != mm && child->mm)
-			points += child->mm->total_vm/2 + 1;
+			points += get_mm_rss(child->mm)/2 + 1;
 		task_unlock(child);
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 10 of 11] limit reclaim if enough pages have been freed
  2008-01-03  2:09 [PATCH 00 of 11] oom deadlock fixes Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2008-01-03  2:09 ` [PATCH 09 of 11] oom select should only take rss into account Andrea Arcangeli
@ 2008-01-03  2:09 ` Andrea Arcangeli
  2008-01-07 19:37   ` Christoph Lameter
  2008-01-03  2:09 ` [PATCH 11 of 11] not-wait-memdie Andrea Arcangeli
  10 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  2:09 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199325619 -3600
# Node ID 30fd9dd17ca34a24f0666be2e5e52d3369b0090b
# Parent  03ad5aceb1e3e64d53a3537bc86dba8c268b1954
limit reclaim if enough pages have been freed

No need to wipe out an huge chunk of the cache.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1149,6 +1149,8 @@ static unsigned long shrink_zone(int pri
 			nr_inactive -= nr_to_scan;
 			nr_reclaimed += shrink_inactive_list(nr_to_scan, zone,
 								sc);
+			if (nr_reclaimed >= sc->swap_cluster_max)
+				break;
 		}
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 11 of 11] not-wait-memdie
  2008-01-03  2:09 [PATCH 00 of 11] oom deadlock fixes Andrea Arcangeli
                   ` (9 preceding siblings ...)
  2008-01-03  2:09 ` [PATCH 10 of 11] limit reclaim if enough pages have been freed Andrea Arcangeli
@ 2008-01-03  2:09 ` Andrea Arcangeli
  2008-01-03  9:55   ` David Rientjes
  2008-01-07 19:43   ` Christoph Lameter
  10 siblings, 2 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  2:09 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199325619 -3600
# Node ID 504e981185254a12282df2add7c3c1131eb810dc
# Parent  30fd9dd17ca34a24f0666be2e5e52d3369b0090b
not-wait-memdie

Don't wait tif-memdie tasks forever because they may be stuck in some kernel
lock owned by some task that requires memory to exit the critical section.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -215,21 +215,14 @@ static struct task_struct *select_bad_pr
 		if (is_global_init(p))
 			continue;
 
-		/*
-		 * This task already has access to memory reserves and is
-		 * being killed. Don't allow any other task access to the
-		 * memory reserve.
-		 *
-		 * Note: this may have a chance of deadlock if it gets
-		 * blocked waiting for another task which itself is waiting
-		 * for memory. Is there a better alternative?
-		 *
-		 * Better not to skip PF_EXITING tasks, since they
-		 * don't have access to the PF_MEMALLOC pool until
-		 * we select them here first.
-		 */
-		if (test_tsk_thread_flag(p, TIF_MEMDIE))
-			return ERR_PTR(-1UL);
+		if (unlikely(test_tsk_thread_flag(p, TIF_MEMDIE))) {
+			/*
+			 * Hopefully we already waited long enough,
+			 * or exit_mm already run, but we must try to kill
+			 * another task to avoid deadlocking.
+			 */
+			continue;
+		}
 
 		if (p->oomkilladj == OOM_DISABLE)
 			continue;
@@ -478,12 +471,8 @@ retry:
 		 * issues we may have.
 		 */
 		p = select_bad_process(&points);
-
-		if (PTR_ERR(p) == -1UL)
-			goto out;
-
 		/* Found nothing?!?! Either we hang forever, or we panic. */
-		if (!p) {
+		if (unlikely(!p)) {
 			read_unlock(&tasklist_lock);
 			panic("Out of memory and no killable processes...\n");
 		}
@@ -495,7 +484,6 @@ retry:
 		break;
 	}
 
-out:
 	read_unlock(&tasklist_lock);
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04 of 11] avoid selecting already killed tasks
  2008-01-03  2:09 ` [PATCH 04 of 11] avoid selecting already killed tasks Andrea Arcangeli
@ 2008-01-03  9:40   ` David Rientjes
  2008-01-03 13:41     ` Andrea Arcangeli
  2008-01-07 19:17   ` Christoph Lameter
  1 sibling, 1 reply; 40+ messages in thread
From: David Rientjes @ 2008-01-03  9:40 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton

On Thu, 3 Jan 2008, Andrea Arcangeli wrote:

> avoid selecting already killed tasks
> 
> If the killed task doesn't go away because it's waiting on some other
> task who needs to allocate memory, to release the i_sem or some other
> lock, we must fallback to killing some other task in order to kill the
> original selected and already oomkilled task, but the logic that kills
> the childs first, would deadlock, if the already oom-killed task was
> actually the first child of the newly oom-killed task.
> 

The problem is that this can cause the parent or one of its children to be 
unnecessarily killed.

Regardless of any OOM killer sychronization that we do, it is still 
possible for the OOM killer to return after killing a task and then 
another OOM situation be triggered on a subsequent allocation attempt 
before the killed task has exited.  It's still marked as TIF_MEMDIE, so 
your change will exempt it from being a target again and one of its 
siblings or, worse, it's parent will be killed.

You can't guarantee that this couldn't have been prevented given 
sufficient time for the exiting task to die, so this change introduces the 
possibility that tasks will unnecessarily be killed to alleviate the OOM 
condition.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07 of 11] don't depend on PF_EXITING tasks to go away
  2008-01-03  2:09 ` [PATCH 07 of 11] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
@ 2008-01-03  9:52   ` David Rientjes
  2008-01-03 13:29     ` Andrea Arcangeli
  0 siblings, 1 reply; 40+ messages in thread
From: David Rientjes @ 2008-01-03  9:52 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton

On Thu, 3 Jan 2008, Andrea Arcangeli wrote:

> A PF_EXITING task don't have TIF_MEMDIE set so it might get stuck in
> memory allocations without access to the PF_MEMALLOC pool (said that
> ideally do_exit would better not require memory allocations, especially
> not before calling exit_mm). The same way we raise its privilege to
> TIF_MEMDIE if it's the current task, we should do it even if it's not
> the current task to speedup oom killing.
> 

That's partially incorrect; it's possible that the PF_EXITING task does 
have TIF_MEMDIE since the OOM killer synchronization does not guarantee 
that a killed task has fully exited before a subsequent OOM killer 
invocation occurs.

So what's going to happen here is that either the PF_EXITING task is 
current (the OOM-triggering task) and is immediately going to be chosen 
for OOM kill or the OOM killer is going to become a no-op and wait for the 
OOM condition to be alleviated in some other manner.

We're unconcerned about the former because it will be chosen for OOM kill 
and will receive the TIF_MEMDIE exemption.

The latter is more interesting and seems to be what you're targeting in 
your changelog: exiting tasks that have encountered an OOM condition that 
blocks them from continuing.  But if that's the case, we still need to 
free some memory from somewhere so the only natural thing to do is OOM 
kill another task.  That's precisely what the current code is doing:  
causing the OOM killer to become a no-op for the current case (the 
already exiting task) and freeing memory by waiting for another system 
OOM, which is guaranteed to happen.  That's sound logic since it doesn't 
do any good to OOM kill an already exiting task.

So my suggestion would be to allow the non-OOM-triggering candidate task 
to be considered as a target instead of simply returning ERR_PTR(-1UL):

	if (p->flags & PF_EXITING && p == current &&
	    !test_tsk_thread_flag(p, TIF_MEMDIE)) {
		chosen = p;
		**points = ULONG_MAX;
	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11 of 11] not-wait-memdie
  2008-01-03  2:09 ` [PATCH 11 of 11] not-wait-memdie Andrea Arcangeli
@ 2008-01-03  9:55   ` David Rientjes
  2008-01-03 13:06     ` Andrea Arcangeli
  2008-01-07 19:43   ` Christoph Lameter
  1 sibling, 1 reply; 40+ messages in thread
From: David Rientjes @ 2008-01-03  9:55 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton

On Thu, 3 Jan 2008, Andrea Arcangeli wrote:

> Don't wait tif-memdie tasks forever because they may be stuck in some kernel
> lock owned by some task that requires memory to exit the critical section.
> 

This increases the possibility that tasks will be needlessly killed in OOM 
conditions.  While the indefinite timeout is definitely not the ideal 
solution, it does have the advantage of preventing unnecessary kills.  The 
OOM synchronization is good, but it's not _that_ good: it doesn't ensure 
that the OOM killed task has fully exited before another invocation of the 
OOM killer happens.

So I think we're moving in a direction of requiring OOM killer timeouts 
and the only plausible way to do that is on a per-task level.  It would 
require another unsigned long addition to struct task_struct but would 
completely fix OOM killer deadlocks.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11 of 11] not-wait-memdie
  2008-01-03  9:55   ` David Rientjes
@ 2008-01-03 13:06     ` Andrea Arcangeli
  2008-01-03 18:54       ` David Rientjes
  0 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-03 13:06 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, Andrew Morton

On Thu, Jan 03, 2008 at 01:55:35AM -0800, David Rientjes wrote:
> On Thu, 3 Jan 2008, Andrea Arcangeli wrote:
> 
> > Don't wait tif-memdie tasks forever because they may be stuck in some kernel
> > lock owned by some task that requires memory to exit the critical section.
> > 
> 
> This increases the possibility that tasks will be needlessly killed in OOM 
> conditions.  While the indefinite timeout is definitely not the ideal 

Yes, and doing so it avoids the deadlock.

> solution, it does have the advantage of preventing unnecessary kills.  The 
> OOM synchronization is good, but it's not _that_ good: it doesn't ensure 
> that the OOM killed task has fully exited before another invocation of the 
> OOM killer happens.
> 
> So I think we're moving in a direction of requiring OOM killer timeouts 
> and the only plausible way to do that is on a per-task level.  It would 
> require another unsigned long addition to struct task_struct but would 
> completely fix OOM killer deadlocks.

Yes. That can be added incrementally in a later patch, it's a new
logic. In the meantime the deadlock is gone.

In practice there's a sort of timeout already, the oom-killing task
will schedule_timeout(1) with the zone-oom-lock hold, and all other
tasks trying to free memory (except the TIF_MEMDIE one) will also
schedule_timeout(1) inside the VM code. That tends to prevent spurious
kills already but it's far from a guarantee.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07 of 11] don't depend on PF_EXITING tasks to go away
  2008-01-03  9:52   ` David Rientjes
@ 2008-01-03 13:29     ` Andrea Arcangeli
  0 siblings, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-03 13:29 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, Andrew Morton

On Thu, Jan 03, 2008 at 01:52:26AM -0800, David Rientjes wrote:
> That's partially incorrect; it's possible that the PF_EXITING task does 
> have TIF_MEMDIE since the OOM killer synchronization does not guarantee 
> that a killed task has fully exited before a subsequent OOM killer 
> invocation occurs.

With an oom killer invocation normally PF_EXITING will be set after
TIF_MEMDIE.

Here we deal with the case of one task exiting because the C code
invokes exit(2) but TIF_MEMDIE isn't set. And then the system goes oom.

> So what's going to happen here is that either the PF_EXITING task is 
> current (the OOM-triggering task) and is immediately going to be chosen 
> for OOM kill or the OOM killer is going to become a no-op and wait for the 
> OOM condition to be alleviated in some other manner.

IIRC the system reached OOM after the PF_EXITING task invoked exit_mm
but before the task was reaped from the tasklist because the parent
couldn't yet run waitpid() (the parent is stuck in the global oom
condition). Like you said the oom killer become a noop and the system
crashed.

> We're unconcerned about the former because it will be chosen for OOM kill 
> and will receive the TIF_MEMDIE exemption.

Yes.

> The latter is more interesting and seems to be what you're targeting in 
> your changelog: exiting tasks that have encountered an OOM condition that 

Yes.

> blocks them from continuing.  But if that's the case, we still need to 
> free some memory from somewhere so the only natural thing to do is OOM 
> kill another task.

Yes we need to kill another task, the PF_EXITING task already run
exit_mm _before_ the oom condition triggered.

>  That's precisely what the current code is doing:  

Really the oom killer is currently doing this:

       if (p->flags & PF_EXITING) {
  	    if (p != current)
	         return ERR_PTR(-1UL);
	    chosen = p;
	    *ppoints = ULONG_MAX;
       }

This means if there's a PF_EXITING task that isn't the current task
(it can't be the current task because it's not runnable anymore by the
scheduler by time the system is oom, do_exit already scheduled), the
oom killer will forever try to kill that PF_EXITING task that can't
run and can't release any further memory because it quit by itself
_before_ the oom condition triggered. I seem to recall I reproduced
this with my current testcase...

> causing the OOM killer to become a no-op for the current case (the 
> already exiting task) and freeing memory by waiting for another system 
> OOM, which is guaranteed to happen.  That's sound logic since it doesn't 
> do any good to OOM kill an already exiting task.

depends if the PF_EXITING task has already run exit_mm or not. If it
didn't it does good. If it did it doesn't good.

> So my suggestion would be to allow the non-OOM-triggering candidate task 
> to be considered as a target instead of simply returning ERR_PTR(-1UL):
> 
> 	if (p->flags & PF_EXITING && p == current &&
> 	    !test_tsk_thread_flag(p, TIF_MEMDIE)) {
> 		chosen = p;
> 		**points = ULONG_MAX;
> 	}

Yes this is a workable option. In practice adding the above or not,
won't make difference 99% of the time. Removing the "return
ERR_PTR(-1UL)" is about not deadlocking 1% of the time. From my point
of view as long as mainline stops deadlocking and I can close bugzilla
I'm fine. I'm totally flexible about adding any wish like above ;).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04 of 11] avoid selecting already killed tasks
  2008-01-03  9:40   ` David Rientjes
@ 2008-01-03 13:41     ` Andrea Arcangeli
  2008-01-03 18:47       ` David Rientjes
  0 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-03 13:41 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, Andrew Morton

On Thu, Jan 03, 2008 at 01:40:09AM -0800, David Rientjes wrote:
> On Thu, 3 Jan 2008, Andrea Arcangeli wrote:
> 
> > avoid selecting already killed tasks
> > 
> > If the killed task doesn't go away because it's waiting on some other
> > task who needs to allocate memory, to release the i_sem or some other
> > lock, we must fallback to killing some other task in order to kill the
> > original selected and already oomkilled task, but the logic that kills
> > the childs first, would deadlock, if the already oom-killed task was
> > actually the first child of the newly oom-killed task.
> > 
> 
> The problem is that this can cause the parent or one of its children to be 
> unnecessarily killed.

Well, the single fact I'm skipping over the TIF_MEMDIE tasks to
prevent deadlocks, allows for spurious oom killing again. Like you
said we can later add a per-task timeout so we wait only X seconds for
a certain TIF_MEMDIE task to quit before selecting another one.

But we got to ignore those TIF_MEMDIE tasks unfortunately, or we
deadlock, no matter if we're in select_bad_process, or in
oom_kill_process. Initially I didn't notice oom_kill_process had that
problem so I was then deadlocking despite select_bad_process was
selecting the parent that didn't have TIF_MEMDIE set (but the first
child already had it).

> Regardless of any OOM killer sychronization that we do, it is still 
> possible for the OOM killer to return after killing a task and then 
> another OOM situation be triggered on a subsequent allocation attempt 
> before the killed task has exited.  It's still marked as TIF_MEMDIE, so 
> your change will exempt it from being a target again and one of its 
> siblings or, worse, it's parent will be killed.

This is the risk of suprious oom killing yes. You got to choose
between a deadlock and risking a suprious oom killing. Even when you
add your 60second timeout in the task_struct between each new TIF_MEMDIE
bitflag set, you're still going to risk spurious oom killing...

The schedule_timeout in the oom killer and in the VM that I have in my
patchset combined with your very limited functionality of
zone-oom-lock (limited because it's gone by the time out_of_memory
returns and it currently can't take into account when the TIF_MEMDIE
task actually exited) in practice didn't generate suprious kills in my
testing. It may not be enough but it's a start...

> You can't guarantee that this couldn't have been prevented given 
> sufficient time for the exiting task to die, so this change introduces the 
> possibility that tasks will unnecessarily be killed to alleviate the OOM 
> condition.

Not just to 'alleviate' the oom condition, but to prevent a system crash.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04 of 11] avoid selecting already killed tasks
  2008-01-03 13:41     ` Andrea Arcangeli
@ 2008-01-03 18:47       ` David Rientjes
  2008-01-03 19:54         ` Andrea Arcangeli
  0 siblings, 1 reply; 40+ messages in thread
From: David Rientjes @ 2008-01-03 18:47 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton

On Thu, 3 Jan 2008, Andrea Arcangeli wrote:

> But we got to ignore those TIF_MEMDIE tasks unfortunately, or we
> deadlock, no matter if we're in select_bad_process, or in
> oom_kill_process. Initially I didn't notice oom_kill_process had that
> problem so I was then deadlocking despite select_bad_process was
> selecting the parent that didn't have TIF_MEMDIE set (but the first
> child already had it).
> 

Traditionally we've only allowed one thread in the entire system to have 
TIF_MEMDIE at a time because as you give access to memory reserves to more 
threads it becomes less of a help to exiting tasks.  So by OOM killing a 
sibling or parent you could be taking away more memory from the exiting 
task; hopefully it won't be noticeable and the sibling or parent will 
quickly exit.

Perhaps instead of killing additional tasks, we should only make the 
exemption if a TIF_MEMDIE task is TASK_UNINTERRUPTIBLE during the 
select_bad_process() scan.  I haven't personally witnessed any blocking in 
the exit path of an OOM killed task that doesn't leave it in D state and 
prevents it from dying.

> This is the risk of suprious oom killing yes. You got to choose
> between a deadlock and risking a suprious oom killing. Even when you
> add your 60second timeout in the task_struct between each new TIF_MEMDIE
> bitflag set, you're still going to risk spurious oom killing...
> 

The 60-second (or whatever time limit) timeout would almost certainly 
always target tasks stuck in D state.  Those tasks will probably never 
exit if the timeout expires no matter how many times it is OOM killed.  So 
the best alternative is to then take TIF_MEMDIE away from that task, 
reduce its timeslice, and never select it again for OOM kill.

If you agree with me that an addition to struct task_struct to keep the 
jiffies of the time it received TIF_MEMDIE is beneficial then it will 
obsolete this patch.

> The schedule_timeout in the oom killer and in the VM that I have in my
> patchset combined with your very limited functionality of
> zone-oom-lock (limited because it's gone by the time out_of_memory
> returns and it currently can't take into account when the TIF_MEMDIE
> task actually exited) in practice didn't generate suprious kills in my
> testing. It may not be enough but it's a start...
> 

The zone-oom-lock wasn't intended to necessarily prevent subsequent OOM 
kills of tasks, it was intended to serialize the OOM killer so that 
multiple entrants will not be killing additional tasks when one would have 
sufficed.

It was made on a per-zone level instead of a global level, as your 
approach did, to support cpusets and memory policy OOM killings.  With a 
global approach these OOM kills would have taken longer because you were 
serializing globally and the OOM killer was dealing with a zonelist that 
wouldn't necessarily have alleviated OOM conditions in other zones.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11 of 11] not-wait-memdie
  2008-01-03 13:06     ` Andrea Arcangeli
@ 2008-01-03 18:54       ` David Rientjes
  0 siblings, 0 replies; 40+ messages in thread
From: David Rientjes @ 2008-01-03 18:54 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton

On Thu, 3 Jan 2008, Andrea Arcangeli wrote:

> > So I think we're moving in a direction of requiring OOM killer timeouts 
> > and the only plausible way to do that is on a per-task level.  It would 
> > require another unsigned long addition to struct task_struct but would 
> > completely fix OOM killer deadlocks.
> 
> Yes. That can be added incrementally in a later patch, it's a new
> logic. In the meantime the deadlock is gone.
> 

In the "meantime" assumes that this patch is added before a jiffies 
count is added to struct task_struct to detect OOM killer timeouts.  
Since per-task OOM killer timeouts will obsolete this change, it doesn't 
seem beneficial to add.

> In practice there's a sort of timeout already, the oom-killing task
> will schedule_timeout(1) with the zone-oom-lock hold, and all other
> tasks trying to free memory (except the TIF_MEMDIE one) will also
> schedule_timeout(1) inside the VM code. That tends to prevent spurious
> kills already but it's far from a guarantee.
> 

I agree, it's not a guarantee because of tasks that can get stuck in D 
state and never exit after repeatedly being OOM killed.  OOM killer 
timeouts would solve that problem and would be able to scale the offending 
task back by removing access to memory reserves and reducing its 
timeslice.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04 of 11] avoid selecting already killed tasks
  2008-01-03 18:47       ` David Rientjes
@ 2008-01-03 19:54         ` Andrea Arcangeli
  2008-01-03 20:49           ` David Rientjes
  0 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-03 19:54 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, Andrew Morton

On Thu, Jan 03, 2008 at 10:47:33AM -0800, David Rientjes wrote:
> Traditionally we've only allowed one thread in the entire system to have 
> TIF_MEMDIE at a time because as you give access to memory reserves to more 
> threads it becomes less of a help to exiting tasks.  So by OOM killing a 
> sibling or parent you could be taking away more memory from the exiting 
> task; hopefully it won't be noticeable and the sibling or parent will 
> quickly exit.

In theory no memory allocation should be required in do_exit.... in
practice sometime it can happen, but the PF_MEMALLOC pool is available
and can be emptied way before the first task has been killed, and the
potential eaters of the PF_MEMALLOC pool are much heavier users than
the do_exit path, so I doubt worrying about the memory reserves by the
time TIF_MEMDIE has been set is a valid concern.

> Perhaps instead of killing additional tasks, we should only make the 
> exemption if a TIF_MEMDIE task is TASK_UNINTERRUPTIBLE during the 
> select_bad_process() scan.  I haven't personally witnessed any blocking in 
> the exit path of an OOM killed task that doesn't leave it in D state and 
> prevents it from dying.

Initially I did this. But there are places where the task is
INTERRUPTIBLE too (like the ones I introduced). I felt it a bit weak.

> The 60-second (or whatever time limit) timeout would almost certainly 
> always target tasks stuck in D state.  Those tasks will probably never 

yes.

> exit if the timeout expires no matter how many times it is OOM killed.  So 

yes.

> the best alternative is to then take TIF_MEMDIE away from that task, 
> reduce its timeslice, and never select it again for OOM kill.

The TIF_MEMDIE undoing isn't a big deal. Sigkilling undoing is more
interesting.

> If you agree with me that an addition to struct task_struct to keep the 
> jiffies of the time it received TIF_MEMDIE is beneficial then it will 
> obsolete this patch.

I tried to prioritize and reduce and simplify the amount of stuff to
push to the minimum to be stable, but certainly I'd like to take the
more complex approach too, yet I'd keep it at the end to keep the
priority high on preventing the crash with small changes. I was being
more complex originally with a global timeout, still simpler than your
per-task timeout, and yet it wasn't merged as style changes
to such code bitrotten the patchset I guess.

> The zone-oom-lock wasn't intended to necessarily prevent subsequent OOM 
> kills of tasks, it was intended to serialize the OOM killer so that 
> multiple entrants will not be killing additional tasks when one would have 
> sufficed.

I know. I introduced the semaphore myself for this reason. But I also
had additional feedback coming from do_exit which is missing now...

> It was made on a per-zone level instead of a global level, as your 
> approach did, to support cpusets and memory policy OOM killings.  With a 
> global approach these OOM kills would have taken longer because you were 
> serializing globally and the OOM killer was dealing with a zonelist that 
> wouldn't necessarily have alleviated OOM conditions in other zones.

I know, scaling oom killing in parallel in numa is nicer but in
practice oom is rare and should never happen... so my global approach
wasn't that different ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04 of 11] avoid selecting already killed tasks
  2008-01-03 19:54         ` Andrea Arcangeli
@ 2008-01-03 20:49           ` David Rientjes
  0 siblings, 0 replies; 40+ messages in thread
From: David Rientjes @ 2008-01-03 20:49 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton

On Thu, 3 Jan 2008, Andrea Arcangeli wrote:

> In theory no memory allocation should be required in do_exit.... in
> practice sometime it can happen, but the PF_MEMALLOC pool is available
> and can be emptied way before the first task has been killed, and the
> potential eaters of the PF_MEMALLOC pool are much heavier users than
> the do_exit path, so I doubt worrying about the memory reserves by the
> time TIF_MEMDIE has been set is a valid concern.
> 

Ok.

> > the best alternative is to then take TIF_MEMDIE away from that task, 
> > reduce its timeslice, and never select it again for OOM kill.
> 
> The TIF_MEMDIE undoing isn't a big deal. Sigkilling undoing is more
> interesting.
> 

Well, that doesn't matter either if the task is stuck in D state forever.  
I was thinking that reducing the timeslice to 1 would be beneficial, 
however, for the remainder of the system's uptime since the task will have 
received the HZ timeslice when killed by the OOM killer.

> I tried to prioritize and reduce and simplify the amount of stuff to
> push to the minimum to be stable, but certainly I'd like to take the
> more complex approach too, yet I'd keep it at the end to keep the
> priority high on preventing the crash with small changes. I was being
> more complex originally with a global timeout, still simpler than your
> per-task timeout, and yet it wasn't merged as style changes
> to such code bitrotten the patchset I guess.
> 

Ok.

The global timeout would require the jiffies to be stored when the SIGKILL 
is issued and cleared in the exit path with a test_tsk_thread_flag(p, 
TIF_MEMDIE) check.  Unfortunately that doesn't work because, as you said, 
it is possible for more than one thread to have TIF_MEMDIE.  So there 
would be no way to catch tasks stuck in D state that have been OOM killed 
to be exempted from making the entire OOM killer a no-op.

> > It was made on a per-zone level instead of a global level, as your 
> > approach did, to support cpusets and memory policy OOM killings.  With a 
> > global approach these OOM kills would have taken longer because you were 
> > serializing globally and the OOM killer was dealing with a zonelist that 
> > wouldn't necessarily have alleviated OOM conditions in other zones.
> 
> I know, scaling oom killing in parallel in numa is nicer but in
> practice oom is rare and should never happen... so my global approach
> wasn't that different ;)
> 

It's becoming much more popular since the memory controller work that is 
based on cgroups uses OOM killing as a mechanism, in part, for enforcing 
its policy.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 01 of 11] limit shrink zone scanning
  2008-01-03  2:09 ` [PATCH 01 of 11] limit shrink zone scanning Andrea Arcangeli
@ 2008-01-07 19:11   ` Christoph Lameter
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2008-01-07 19:11 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton, David Rientjes

On Thu, 3 Jan 2008, Andrea Arcangeli wrote:

> Assume two tasks adds to nr_scan_*active at the same time (first line of the
> old buggy code), they'll effectively double their scan rate, for no good
> reason. What can happen is that instead of scanning nr_entries each, they'll
> scan nr_entries*2 each. The more CPUs the bigger the race and the higher the
> multiplication effect and the harder it will be to detect oom. This puts a cap
> on the amount of work that it makes sense to do in case the race triggers.

Looks like a workaround. Would it be cleaner to make the scan 
counters local variables and only them when the scan is complete?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 02 of 11] avoid oom deadlock in nfs_create_request
  2008-01-03  2:09 ` [PATCH 02 of 11] avoid oom deadlock in nfs_create_request Andrea Arcangeli
@ 2008-01-07 19:13   ` Christoph Lameter
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2008-01-07 19:13 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton, David Rientjes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 03 of 11] prevent oom deadlocks during read/write operations
  2008-01-03  2:09 ` [PATCH 03 of 11] prevent oom deadlocks during read/write operations Andrea Arcangeli
@ 2008-01-07 19:15   ` Christoph Lameter
  2008-01-07 19:26     ` Andrea Arcangeli
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2008-01-07 19:15 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton, David Rientjes

This means that killing a process with SIGKILL from user land may lead to 
OOM handling being triggered in the VM?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04 of 11] avoid selecting already killed tasks
  2008-01-03  2:09 ` [PATCH 04 of 11] avoid selecting already killed tasks Andrea Arcangeli
  2008-01-03  9:40   ` David Rientjes
@ 2008-01-07 19:17   ` Christoph Lameter
  1 sibling, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2008-01-07 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton, David Rientjes


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 03 of 11] prevent oom deadlocks during read/write operations
  2008-01-07 19:15   ` Christoph Lameter
@ 2008-01-07 19:26     ` Andrea Arcangeli
  0 siblings, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-07 19:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, Andrew Morton, David Rientjes

On Mon, Jan 07, 2008 at 11:15:49AM -0800, Christoph Lameter wrote:
> This means that killing a process with SIGKILL from user land may lead to 
> OOM handling being triggered in the VM?

Well, Andrew added the status=-ENOMEM in his version, but userland
should never get to see the status. So this should only have the
effect of being more reactive to SIGKILL. Alternatively TIF_MEMDIE
could be checked, but checking sigkill sounded nicer there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 05 of 11] reduce the probability of an OOM livelock
  2008-01-03  2:09 ` [PATCH 05 of 11] reduce the probability of an OOM livelock Andrea Arcangeli
@ 2008-01-07 19:32   ` Christoph Lameter
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2008-01-07 19:32 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton, David Rientjes

On Thu, 3 Jan 2008, Andrea Arcangeli wrote:

> @@ -1337,7 +1337,6 @@ static unsigned long balance_pgdat(pg_da
>  
>  loop_again:
>  	total_scanned = 0;
> -	nr_reclaimed = 0;
>  	sc.may_writepage = !laptop_mode;
>  	count_vm_event(PAGEOUTRUN);
>  
> @@ -1347,6 +1346,7 @@ loop_again:
>  	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
>  		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
>  		unsigned long lru_pages = 0;
> +		unsigned long nr_reclaimed;
>  
>  		/* The swap token gets in the way of swapout... */
>  		if (!priority)
> @@ -1393,6 +1393,7 @@ loop_again:
>  		 * pages behind kswapd's direction of progress, which would
>  		 * cause too much scanning of the lower zones.
>  		 */
> +		nr_reclaimed = 0;
>  		for (i = 0; i <= end_zone; i++) {
>  			struct zone *zone = pgdat->node_zones + i;
>  			int nr_slab;

nr_reclaimed is now only the number reclaimed at a certain priority. Its 
effectively lower. It seems that the only test that can be affected by 
this is:

  if (nr_reclaimed >= SWAP_CLUSTER_MAX)
                        break;

But reducing nr_reclaim makes it more likely to continue the 
loop? I thought you wanted to stop scanning?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 06 of 11] balance_pgdat doesn't return the number of pages freed
  2008-01-03  2:09 ` [PATCH 06 of 11] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
@ 2008-01-07 19:33   ` Christoph Lameter
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2008-01-07 19:33 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton, David Rientjes

Could actually be indepedent of the rest of the patchset.

Reviewed-by: Christoph Lameter <clameter@sgi.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09 of 11] oom select should only take rss into account
  2008-01-03  2:09 ` [PATCH 09 of 11] oom select should only take rss into account Andrea Arcangeli
@ 2008-01-07 19:35   ` Christoph Lameter
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2008-01-07 19:35 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton, David Rientjes


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10 of 11] limit reclaim if enough pages have been freed
  2008-01-03  2:09 ` [PATCH 10 of 11] limit reclaim if enough pages have been freed Andrea Arcangeli
@ 2008-01-07 19:37   ` Christoph Lameter
  2008-01-08  7:28     ` Andrea Arcangeli
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2008-01-07 19:37 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton, David Rientjes

On Thu, 3 Jan 2008, Andrea Arcangeli wrote:

> No need to wipe out an huge chunk of the cache.

Wiping out a larger chunk of the cache avoids triggering reclaim too 
frequently.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11 of 11] not-wait-memdie
  2008-01-03  2:09 ` [PATCH 11 of 11] not-wait-memdie Andrea Arcangeli
  2008-01-03  9:55   ` David Rientjes
@ 2008-01-07 19:43   ` Christoph Lameter
  2008-01-08  1:57     ` David Rientjes
  2008-01-08  7:31     ` Andrea Arcangeli
  1 sibling, 2 replies; 40+ messages in thread
From: Christoph Lameter @ 2008-01-07 19:43 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton, David Rientjes

On Thu, 3 Jan 2008, Andrea Arcangeli wrote:

> +		if (unlikely(test_tsk_thread_flag(p, TIF_MEMDIE))) {
> +			/*
> +			 * Hopefully we already waited long enough,
> +			 * or exit_mm already run, but we must try to kill
> +			 * another task to avoid deadlocking.
> +			 */
> +			continue;
> +		}

If all tasks are marked TIF_MEMDIE then we just scan through them return 
NULL and


>  		/* Found nothing?!?! Either we hang forever, or we panic. */
> -		if (!p) {
> +		if (unlikely(!p)) {
>  			read_unlock(&tasklist_lock);
>  			panic("Out of memory and no killable processes...\n");

panic.

Should we not wait awhile before panicing? The processes may need some 
time to terminate.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11 of 11] not-wait-memdie
  2008-01-07 19:43   ` Christoph Lameter
@ 2008-01-08  1:57     ` David Rientjes
  2008-01-08  3:25       ` Nick Piggin
  2008-01-08  7:37       ` Andrea Arcangeli
  2008-01-08  7:31     ` Andrea Arcangeli
  1 sibling, 2 replies; 40+ messages in thread
From: David Rientjes @ 2008-01-08  1:57 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrea Arcangeli, linux-mm, Andrew Morton

On Mon, 7 Jan 2008, Christoph Lameter wrote:

> > +		if (unlikely(test_tsk_thread_flag(p, TIF_MEMDIE))) {
> > +			/*
> > +			 * Hopefully we already waited long enough,
> > +			 * or exit_mm already run, but we must try to kill
> > +			 * another task to avoid deadlocking.
> > +			 */
> > +			continue;
> > +		}
> 
> If all tasks are marked TIF_MEMDIE then we just scan through them return 
> NULL and
> 

That's the problem that I've been mentioning: giving several tasks access 
to memory reserves just isn't right.  It should be given to a single 
OOM-killed task that will alleviate the OOM condition for the task that 
called out_of_memory().  For an entire system it would still be possible 
for several tasks to be TIF_MEMDIE (in the case of cpuset, memory 
controller, or mempolicy OOM killing) but never more than one task that 
shares a common zone.

> >  		/* Found nothing?!?! Either we hang forever, or we panic. */
> > -		if (!p) {
> > +		if (unlikely(!p)) {
> >  			read_unlock(&tasklist_lock);
> >  			panic("Out of memory and no killable processes...\n");
> 
> panic.
> 
> Should we not wait awhile before panicing? The processes may need some 
> time to terminate.
> 

That's only possible with my proposal of adding

	unsigned long oom_kill_jiffies;

to struct task_struct.  We can't get away with a system-wide jiffies 
variable, nor can we get away with per-cgroup, per-cpuset, or 
per-mempolicy variable.  The only way to clear such a variable is in the 
exit path (by checking test_thread_flag(tsk, TIF_MEMDIE) in do_exit()) and 
fails miserably if there are simultaneous but zone-disjoint OOMs 
occurring.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11 of 11] not-wait-memdie
  2008-01-08  1:57     ` David Rientjes
@ 2008-01-08  3:25       ` Nick Piggin
  2008-01-08  3:37         ` David Rientjes
  2008-01-08  7:45         ` Andrea Arcangeli
  2008-01-08  7:37       ` Andrea Arcangeli
  1 sibling, 2 replies; 40+ messages in thread
From: Nick Piggin @ 2008-01-08  3:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Andrea Arcangeli, linux-mm, Andrew Morton

On Tuesday 08 January 2008 12:57, David Rientjes wrote:
> On Mon, 7 Jan 2008, Christoph Lameter wrote:
> > > +		if (unlikely(test_tsk_thread_flag(p, TIF_MEMDIE))) {
> > > +			/*
> > > +			 * Hopefully we already waited long enough,
> > > +			 * or exit_mm already run, but we must try to kill
> > > +			 * another task to avoid deadlocking.
> > > +			 */
> > > +			continue;
> > > +		}
> >
> > If all tasks are marked TIF_MEMDIE then we just scan through them return
> > NULL and
>
> That's the problem that I've been mentioning: giving several tasks access
> to memory reserves just isn't right.

We already do that today in the case of regular page reclaim.

The problem is the global reserve. Once you have a kernel that doesn't
need this handwavy global reserve for forward progress, a lot of little
problems go away.


> It should be given to a single 
> OOM-killed task that will alleviate the OOM condition for the task that
> called out_of_memory().

It should be, but that task you OOM may be blocking on another one that
is waiting for memory, for example.

In practice, I think a task will not need a great deal of memory in order
to finish what it is doing and exit; but it will be more likely to be in
some oom deadlock. So neither solution is perfect, but I think this patch
will solve more cases than it introduces.


> For an entire system it would still be possible 
> for several tasks to be TIF_MEMDIE (in the case of cpuset, memory
> controller, or mempolicy OOM killing) but never more than one task that
> shares a common zone.
>
> > >  		/* Found nothing?!?! Either we hang forever, or we panic. */
> > > -		if (!p) {
> > > +		if (unlikely(!p)) {
> > >  			read_unlock(&tasklist_lock);
> > >  			panic("Out of memory and no killable processes...\n");
> >
> > panic.
> >
> > Should we not wait awhile before panicing? The processes may need some
> > time to terminate.
>
> That's only possible with my proposal of adding
>
> 	unsigned long oom_kill_jiffies;
>
> to struct task_struct.  We can't get away with a system-wide jiffies
> variable, nor can we get away with per-cgroup, per-cpuset, or
> per-mempolicy variable.  The only way to clear such a variable is in the
> exit path (by checking test_thread_flag(tsk, TIF_MEMDIE) in do_exit()) and
> fails miserably if there are simultaneous but zone-disjoint OOMs
> occurring.

Why not just have a global frequency limit on OOM events. Then the panic
has this delay factored in...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11 of 11] not-wait-memdie
  2008-01-08  3:25       ` Nick Piggin
@ 2008-01-08  3:37         ` David Rientjes
  2008-01-08  7:42           ` Nick Piggin
  2008-01-08  7:45         ` Andrea Arcangeli
  1 sibling, 1 reply; 40+ messages in thread
From: David Rientjes @ 2008-01-08  3:37 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Lameter, Andrea Arcangeli, linux-mm, Andrew Morton

On Tue, 8 Jan 2008, Nick Piggin wrote:

> The problem is the global reserve. Once you have a kernel that doesn't
> need this handwavy global reserve for forward progress, a lot of little
> problems go away.
> 

I'm specifically talking about TIF_MEMDIE here which gives access to that 
global reserve.  In OOM situations there is no easy way to guarantee that 
a task will have enough memory to exit, but that is exactly what is needed 
to alleviate the condition.  Additionally, it is not guaranteed that a 
task that has been OOM killed and given access to the global reserve will 
exit after it has exhausted that reserve in its entirety.  That's when the 
system deadlocks.

So giving access to the global reserve to multiple tasks that share memory 
in at least one of their zones for simultaneous OOM killings is not a 
complete solution.  There should be a timeout on tasks when they are OOM 
killed; if they cannot exit for the duration of that period, they lose 
access to the reserves and only then is another task selected.

> > It should be given to a single 
> > OOM-killed task that will alleviate the OOM condition for the task that
> > called out_of_memory().
> 
> It should be, but that task you OOM may be blocking on another one that
> is waiting for memory, for example.
> 

And after the timeout that I'm proposing it, or another suitable 
candidate, will be killed instead.  The dependencies are beyond the scope 
of the OOM killer badness scoring but without giving tasks a short but 
reasonable period to exit and then opting to kill another task there will 
always exist the potential for deadlock.

> > That's only possible with my proposal of adding
> >
> > 	unsigned long oom_kill_jiffies;
> >
> > to struct task_struct.  We can't get away with a system-wide jiffies
> > variable, nor can we get away with per-cgroup, per-cpuset, or
> > per-mempolicy variable.  The only way to clear such a variable is in the
> > exit path (by checking test_thread_flag(tsk, TIF_MEMDIE) in do_exit()) and
> > fails miserably if there are simultaneous but zone-disjoint OOMs
> > occurring.
> 
> Why not just have a global frequency limit on OOM events. Then the panic
> has this delay factored in...
> 

Because OOM killing is going to become more and more frequent with the 
introduction of the memory controller which uses it as a mechanism to 
enforce its policy.  And a global frequency limit does not work well for 
parallel cpuset, mempolicy, or memory controller OOM events.  That is why 
it is currently serialized by the triggering task's zonelist and not 
globally.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10 of 11] limit reclaim if enough pages have been freed
  2008-01-07 19:37   ` Christoph Lameter
@ 2008-01-08  7:28     ` Andrea Arcangeli
  0 siblings, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:28 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, Andrew Morton, David Rientjes

On Mon, Jan 07, 2008 at 11:37:19AM -0800, Christoph Lameter wrote:
> On Thu, 3 Jan 2008, Andrea Arcangeli wrote:
> 
> > No need to wipe out an huge chunk of the cache.
> 
> Wiping out a larger chunk of the cache avoids triggering reclaim too 
> frequently.

The idea is that if you want to wipe a larger chunk to batch reclaim
more aggressively you can add a new param in addition to
swap-cluster-max. But wiping such a chunk as large as it is now sounds
bad for latency reasons (think also -rt... ok any -rt guarantee is
gone the moment you enter the VM, but still this spot is easy to fix
and benefits all my kernels with PREEMPT=n too).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11 of 11] not-wait-memdie
  2008-01-07 19:43   ` Christoph Lameter
  2008-01-08  1:57     ` David Rientjes
@ 2008-01-08  7:31     ` Andrea Arcangeli
  1 sibling, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:31 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, Andrew Morton, David Rientjes

On Mon, Jan 07, 2008 at 11:43:07AM -0800, Christoph Lameter wrote:
> On Thu, 3 Jan 2008, Andrea Arcangeli wrote:
> 
> > +		if (unlikely(test_tsk_thread_flag(p, TIF_MEMDIE))) {
> > +			/*
> > +			 * Hopefully we already waited long enough,
> > +			 * or exit_mm already run, but we must try to kill
> > +			 * another task to avoid deadlocking.
> > +			 */
> > +			continue;
> > +		}
> 
> If all tasks are marked TIF_MEMDIE then we just scan through them return 
> NULL and
> 
> 
> >  		/* Found nothing?!?! Either we hang forever, or we panic. */
> > -		if (!p) {
> > +		if (unlikely(!p)) {
> >  			read_unlock(&tasklist_lock);
> >  			panic("Out of memory and no killable processes...\n");
> 
> panic.
> 
> Should we not wait awhile before panicing? The processes may need some 
> time to terminate.

I've a new patchset that would wait 60 sec for every new set
TIF_MEMDIE before paniking. that will fix this. Thanks.

Problem is with the new patchset with the memdie_jiffies, I run into
new deadlock with my testcase so I was trying to fix those new issues
before submission. For whatever reason I could never reproduce any
problem with the patchset I sent to linux-mm before introducing the
memdie_iffies. However this week I've other urgent work to do too not
in the oom area... so we'll see what I can do. If I can't fix the new
deadlocks within a few days I'll submit a new patchset anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11 of 11] not-wait-memdie
  2008-01-08  1:57     ` David Rientjes
  2008-01-08  3:25       ` Nick Piggin
@ 2008-01-08  7:37       ` Andrea Arcangeli
  1 sibling, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:37 UTC (permalink / raw)
  To: David Rientjes; +Cc: Christoph Lameter, linux-mm, Andrew Morton

On Mon, Jan 07, 2008 at 05:57:41PM -0800, David Rientjes wrote:
> That's only possible with my proposal of adding
> 
> 	unsigned long oom_kill_jiffies;
> 
> to struct task_struct.  We can't get away with a system-wide jiffies 

I already added it.

> variable, nor can we get away with per-cgroup, per-cpuset, or 
> per-mempolicy variable.  The only way to clear such a variable is in the 
> exit path (by checking test_thread_flag(tsk, TIF_MEMDIE) in do_exit()) and 
> fails miserably if there are simultaneous but zone-disjoint OOMs 
> occurring.

I don't see much issues with zone-disjoints oom with my current
patchset. The trouble is a new deadlock that I'm reproducing now, I
submit you privately a preview of the memdie_jiffies, did you see any
problem in my implementation? I guess I'll resubmit to linux-mm
too. The new deadlock I run into after adding memdie_jiffies is likely
unrelated to the memdie_jiffies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11 of 11] not-wait-memdie
  2008-01-08  3:37         ` David Rientjes
@ 2008-01-08  7:42           ` Nick Piggin
  0 siblings, 0 replies; 40+ messages in thread
From: Nick Piggin @ 2008-01-08  7:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Andrea Arcangeli, linux-mm, Andrew Morton

On Tuesday 08 January 2008 14:37, David Rientjes wrote:
> On Tue, 8 Jan 2008, Nick Piggin wrote:
> > The problem is the global reserve. Once you have a kernel that doesn't
> > need this handwavy global reserve for forward progress, a lot of little
> > problems go away.
>
> I'm specifically talking about TIF_MEMDIE here which gives access to that
> global reserve.

And I'm specifically talking about PF_MEMALLOC, which does the same.

> In OOM situations there is no easy way to guarantee that 
> a task will have enough memory to exit, but that is exactly what is needed
> to alleviate the condition.  Additionally, it is not guaranteed that a
> task that has been OOM killed and given access to the global reserve will
> exit after it has exhausted that reserve in its entirety.  That's when the
> system deadlocks.

I know all that ;) Your second point is the reason to have more than 1
MEMDIE process...


> So giving access to the global reserve to multiple tasks that share memory
> in at least one of their zones for simultaneous OOM killings is not a
> complete solution.  There should be a timeout on tasks when they are OOM
> killed; if they cannot exit for the duration of that period, they lose
> access to the reserves and only then is another task selected.

Hmm, OK I didn't realise you'd proposed that as an alternative. Maybe.
I don't know if the complexity would be worthwhile, given that there is
no sort of reentrancy limit on the global reserve pool anyway.


> > > That's only possible with my proposal of adding
> > >
> > > 	unsigned long oom_kill_jiffies;
> > >
> > > to struct task_struct.  We can't get away with a system-wide jiffies
> > > variable, nor can we get away with per-cgroup, per-cpuset, or
> > > per-mempolicy variable.  The only way to clear such a variable is in
> > > the exit path (by checking test_thread_flag(tsk, TIF_MEMDIE) in
> > > do_exit()) and fails miserably if there are simultaneous but
> > > zone-disjoint OOMs occurring.
> >
> > Why not just have a global frequency limit on OOM events. Then the panic
> > has this delay factored in...
>
> Because OOM killing is going to become more and more frequent with the
> introduction of the memory controller which uses it as a mechanism to
> enforce its policy.  And a global frequency limit does not work well for
> parallel cpuset, mempolicy, or memory controller OOM events.  That is why
> it is currently serialized by the triggering task's zonelist and not
> globally.

I don't think that's a very good reason for the complexity. If your
system is OOM-throughput-limited, then something's very wrong with
your wokload management. (and I don't buy the DoS security argument
either because the memory controller doesn't provide security last
time I looked).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11 of 11] not-wait-memdie
  2008-01-08  3:25       ` Nick Piggin
  2008-01-08  3:37         ` David Rientjes
@ 2008-01-08  7:45         ` Andrea Arcangeli
  1 sibling, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Rientjes, Christoph Lameter, Andrea Arcangeli, linux-mm,
	Andrew Morton

On Tue, Jan 08, 2008 at 02:25:31PM +1100, Nick Piggin wrote:
> We already do that today in the case of regular page reclaim.

exactly. And we have to set TIF_MEMDIE on more than one task in case
we have a locking dependency like the ones I can trivially reproduce.

> The problem is the global reserve. Once you have a kernel that doesn't
> need this handwavy global reserve for forward progress, a lot of little
> problems go away.

Yep.

> It should be, but that task you OOM may be blocking on another one that
> is waiting for memory, for example.

Exactly.

> In practice, I think a task will not need a great deal of memory in order
> to finish what it is doing and exit; but it will be more likely to be in
> some oom deadlock. So neither solution is perfect, but I think this patch
> will solve more cases than it introduces.

Yes. The memory reserve being accessed by more than one TIF_MEMDIE
task is the least of the problems. Also consider the first task will
normally not even try to use the memory reserve at all, because if we
set TIF_MEMDIE on a second task, it's because the first task was
normally totally stuck on a lock and sleeping in D state the whole time.

> Why not just have a global frequency limit on OOM events. Then the panic
> has this delay factored in...

My current pathset uses 1min as max frequency.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2008-01-08  7:45 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-03  2:09 [PATCH 00 of 11] oom deadlock fixes Andrea Arcangeli
2008-01-03  2:09 ` [PATCH 01 of 11] limit shrink zone scanning Andrea Arcangeli
2008-01-07 19:11   ` Christoph Lameter
2008-01-03  2:09 ` [PATCH 02 of 11] avoid oom deadlock in nfs_create_request Andrea Arcangeli
2008-01-07 19:13   ` Christoph Lameter
2008-01-03  2:09 ` [PATCH 03 of 11] prevent oom deadlocks during read/write operations Andrea Arcangeli
2008-01-07 19:15   ` Christoph Lameter
2008-01-07 19:26     ` Andrea Arcangeli
2008-01-03  2:09 ` [PATCH 04 of 11] avoid selecting already killed tasks Andrea Arcangeli
2008-01-03  9:40   ` David Rientjes
2008-01-03 13:41     ` Andrea Arcangeli
2008-01-03 18:47       ` David Rientjes
2008-01-03 19:54         ` Andrea Arcangeli
2008-01-03 20:49           ` David Rientjes
2008-01-07 19:17   ` Christoph Lameter
2008-01-03  2:09 ` [PATCH 05 of 11] reduce the probability of an OOM livelock Andrea Arcangeli
2008-01-07 19:32   ` Christoph Lameter
2008-01-03  2:09 ` [PATCH 06 of 11] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
2008-01-07 19:33   ` Christoph Lameter
2008-01-03  2:09 ` [PATCH 07 of 11] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
2008-01-03  9:52   ` David Rientjes
2008-01-03 13:29     ` Andrea Arcangeli
2008-01-03  2:09 ` [PATCH 08 of 11] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
2008-01-03  2:09 ` [PATCH 09 of 11] oom select should only take rss into account Andrea Arcangeli
2008-01-07 19:35   ` Christoph Lameter
2008-01-03  2:09 ` [PATCH 10 of 11] limit reclaim if enough pages have been freed Andrea Arcangeli
2008-01-07 19:37   ` Christoph Lameter
2008-01-08  7:28     ` Andrea Arcangeli
2008-01-03  2:09 ` [PATCH 11 of 11] not-wait-memdie Andrea Arcangeli
2008-01-03  9:55   ` David Rientjes
2008-01-03 13:06     ` Andrea Arcangeli
2008-01-03 18:54       ` David Rientjes
2008-01-07 19:43   ` Christoph Lameter
2008-01-08  1:57     ` David Rientjes
2008-01-08  3:25       ` Nick Piggin
2008-01-08  3:37         ` David Rientjes
2008-01-08  7:42           ` Nick Piggin
2008-01-08  7:45         ` Andrea Arcangeli
2008-01-08  7:37       ` Andrea Arcangeli
2008-01-08  7:31     ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox