[PATCH 00 of 13] oom deadlock fixes # try 2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00 of 13] oom deadlock fixes # try 2
@ 2008-01-08  7:50 Andrea Arcangeli
  2008-01-08  7:50 ` [PATCH 01 of 13] limit shrink zone scanning Andrea Arcangeli
                   ` (12 more replies)
  0 siblings, 13 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:50 UTC (permalink / raw)
  To: linux-mm

This introduces the memdie_jiffies and MEMDIE_DELAY plus some minor
improvement that probably isn't really necessary (but I found tasks looping in
fork() allocating pagetables with GFP_REPEAT and lots of tasks in
congestion_wait so I thought to improve those two bits too). I can still
reproduce one deadlock in a certain condition with this patchset while no
deadlock was happening with the previous one before memdie_jiffies for
whatever reason. I was trying to fix that last deadlock before submission but
because of the talks on linux-mm on what I already got implemented and working
fine, I'll submit this right now (the new deadlock is likely unrelated to
these changes). I'm wondering if perhaps it's related to having reintroduced
the PF_EXITING check but in theory it shouldn't because the PF_EXITING check
should go off after 60sec when we start skipping over the TIF_MEMDIE tasks.

I written the last two patches after checking stack traces while debugging the
new deadlock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 01 of 13] limit shrink zone scanning
  2008-01-08  7:50 [PATCH 00 of 13] oom deadlock fixes # try 2 Andrea Arcangeli
@ 2008-01-08  7:50 ` Andrea Arcangeli
  2008-01-08  7:50 ` [PATCH 02 of 13] avoid oom deadlock in nfs_create_request Andrea Arcangeli
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:50 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199469588 -3600
# Node ID 31ccca6f0b3ee1b340b9d32c0231eb1f957ee1ad
# Parent  e28e1be3fae5183e3e36e32e3feb9a59ec59c825
limit shrink zone scanning

Assume two tasks adds to nr_scan_*active at the same time (first line of the
old buggy code), they'll effectively double their scan rate, for no good
reason. What can happen is that instead of scanning nr_entries each, they'll
scan nr_entries*2 each. The more CPUs the bigger the race and the higher the
multiplication effect and the harder it will be to detect oom. This puts a cap
on the amount of work that it makes sense to do in case the race triggers.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1114,7 +1114,7 @@ static unsigned long shrink_zone(int pri
 	 */
 	zone->nr_scan_active +=
 		(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
-	nr_active = zone->nr_scan_active;
+	nr_active = min(zone->nr_scan_active, zone_page_state(zone, NR_ACTIVE));
 	if (nr_active >= sc->swap_cluster_max)
 		zone->nr_scan_active = 0;
 	else
@@ -1122,7 +1122,7 @@ static unsigned long shrink_zone(int pri
 
 	zone->nr_scan_inactive +=
 		(zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
-	nr_inactive = zone->nr_scan_inactive;
+	nr_inactive = min(zone->nr_scan_inactive, zone_page_state(zone, NR_INACTIVE));
 	if (nr_inactive >= sc->swap_cluster_max)
 		zone->nr_scan_inactive = 0;
 	else

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 02 of 13] avoid oom deadlock in nfs_create_request
  2008-01-08  7:50 [PATCH 00 of 13] oom deadlock fixes # try 2 Andrea Arcangeli
  2008-01-08  7:50 ` [PATCH 01 of 13] limit shrink zone scanning Andrea Arcangeli
@ 2008-01-08  7:50 ` Andrea Arcangeli
  2008-01-08  7:50 ` [PATCH 03 of 13] prevent oom deadlocks during read/write operations Andrea Arcangeli
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:50 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199469588 -3600
# Node ID ddd02ad798f6902fc561843c60f1189a44fdb439
# Parent  31ccca6f0b3ee1b340b9d32c0231eb1f957ee1ad
avoid oom deadlock in nfs_create_request

When sigkill is pending after the oom killer set TIF_MEMDIE, the task
must go away or the VM will malfunction.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -61,16 +61,20 @@ nfs_create_request(struct nfs_open_conte
 	struct nfs_server *server = NFS_SERVER(inode);
 	struct nfs_page		*req;
 
-	for (;;) {
-		/* try to allocate the request struct */
-		req = nfs_page_alloc();
-		if (req != NULL)
-			break;
+	/* try to allocate the request struct */
+	req = nfs_page_alloc();
+	if (unlikely(!req)) {
+		/*
+		 * -ENOMEM will be returned only when TIF_MEMDIE is set
+		 * so userland shouldn't risk to get confused by a new
+		 * unhandled ENOMEM errno.
+		 */
+		WARN_ON(!test_thread_flag(TIF_MEMDIE));
+		return ERR_PTR(-ENOMEM);
+	}
 
-		if (signalled() && (server->flags & NFS_MOUNT_INTR))
-			return ERR_PTR(-ERESTARTSYS);
-		yield();
-	}
+	if (signalled() && (server->flags & NFS_MOUNT_INTR))
+		return ERR_PTR(-ERESTARTSYS);
 
 	/* Initialize the request struct. Initially, we assume a
 	 * long write-back delay. This will be adjusted in

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 03 of 13] prevent oom deadlocks during read/write operations
  2008-01-08  7:50 [PATCH 00 of 13] oom deadlock fixes # try 2 Andrea Arcangeli
  2008-01-08  7:50 ` [PATCH 01 of 13] limit shrink zone scanning Andrea Arcangeli
  2008-01-08  7:50 ` [PATCH 02 of 13] avoid oom deadlock in nfs_create_request Andrea Arcangeli
@ 2008-01-08  7:50 ` Andrea Arcangeli
  2008-01-08  7:50 ` [PATCH 04 of 13] avoid selecting already killed tasks Andrea Arcangeli
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:50 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199469588 -3600
# Node ID 4091a7ef36c80c3d2fa0d60a7b8bd885da68154d
# Parent  ddd02ad798f6902fc561843c60f1189a44fdb439
prevent oom deadlocks during read/write operations

We need to react to SIGKILL during read/write with huge buffers or it
becomes too easy to prevent a SIGKILLED task to run do_exit promptly
after it has been selected for oom-killage.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -927,6 +927,16 @@ page_ok:
 		isize = i_size_read(inode);
 		end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
 		if (unlikely(!isize || index > end_index)) {
+			page_cache_release(page);
+			goto out;
+		}
+
+		if (unlikely(sigismember(&current->pending.signal, SIGKILL))) {
+			/*
+			 * Must not hang almost forever in D state in
+			 * presence of sigkill and lots of ram/swap
+			 * (think during OOM).
+			 */
 			page_cache_release(page);
 			goto out;
 		}
@@ -2063,6 +2073,16 @@ static ssize_t generic_perform_write_2co
 			break;
 		}
 
+		if (unlikely(sigismember(&current->pending.signal, SIGKILL))) {
+			/*
+			 * Must not hang almost forever in D state in
+			 * presence of sigkill and lots of ram/swap
+			 * (think during OOM).
+			 */
+			status = -ENOMEM;
+			break;
+		}
+
 		page = __grab_cache_page(mapping, index);
 		if (!page) {
 			status = -ENOMEM;
@@ -2230,6 +2250,16 @@ again:
 		 */
 		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
 			status = -EFAULT;
+			break;
+		}
+
+		if (unlikely(sigismember(&current->pending.signal, SIGKILL))) {
+			status = -ENOMEM;
+			/*
+			 * Must not hang almost forever in D state in
+			 * presence of sigkill and lots of ram/swap
+			 * (think during OOM).
+			 */
 			break;
 		}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 04 of 13] avoid selecting already killed tasks
  2008-01-08  7:50 [PATCH 00 of 13] oom deadlock fixes # try 2 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2008-01-08  7:50 ` [PATCH 03 of 13] prevent oom deadlocks during read/write operations Andrea Arcangeli
@ 2008-01-08  7:50 ` Andrea Arcangeli
  2008-01-08  7:50 ` [PATCH 05 of 13] reduce the probability of an OOM livelock Andrea Arcangeli
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:50 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199470015 -3600
# Node ID e08fdb8dad51268d7a786625fc54c65f277f736b
# Parent  4091a7ef36c80c3d2fa0d60a7b8bd885da68154d
avoid selecting already killed tasks

If the killed task doesn't go away because it's waiting on some other
task who needs to allocate memory, to release the i_sem or some other
lock, we must fallback to killing some other task in order to kill the
original selected and already oomkilled task, but the logic that kills
the childs first, would deadlock, if the already oom-killed task was
actually the first child of the newly oom-killed task.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1178,6 +1178,7 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	unsigned long memdie_jiffies;
 };
 
 /*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -30,6 +30,8 @@ int sysctl_oom_kill_allocating_task;
 int sysctl_oom_kill_allocating_task;
 static DEFINE_SPINLOCK(zone_scan_mutex);
 /* #define DEBUG */
+
+#define MEMDIE_DELAY (60*HZ)
 
 /**
  * badness - calculate a numeric value for how bad this task has been
@@ -287,7 +289,8 @@ static void __oom_kill_task(struct task_
 	 * exit() and clear out its resources quickly...
 	 */
 	p->time_slice = HZ;
-	set_tsk_thread_flag(p, TIF_MEMDIE);
+	if (!test_and_set_tsk_thread_flag(p, TIF_MEMDIE))
+		p->memdie_jiffies = jiffies;
 
 	force_sig(SIGKILL, p);
 }
@@ -362,6 +365,13 @@ static int oom_kill_process(struct task_
 	/* Try to kill a child first */
 	list_for_each_entry(c, &p->children, sibling) {
 		if (c->mm == p->mm)
+			continue;
+		/*
+		 * We cannot select tasks with TIF_MEMDIE already set
+		 * or we'll hard deadlock.
+		 */
+		if (unlikely(test_tsk_thread_flag(c, TIF_MEMDIE) &&
+			     time_before(c->memdie_jiffies + MEMDIE_DELAY, jiffies)))
 			continue;
 		if (!oom_kill_task(c))
 			return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 05 of 13] reduce the probability of an OOM livelock
  2008-01-08  7:50 [PATCH 00 of 13] oom deadlock fixes # try 2 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2008-01-08  7:50 ` [PATCH 04 of 13] avoid selecting already killed tasks Andrea Arcangeli
@ 2008-01-08  7:50 ` Andrea Arcangeli
  2008-01-08  7:50 ` [PATCH 06 of 13] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:50 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199470021 -3600
# Node ID 351a3906181f5c0fe0137b6f066f725bd65673ba
# Parent  e08fdb8dad51268d7a786625fc54c65f277f736b
reduce the probability of an OOM livelock

There's no need to loop way too many times over the lrus in order to
declare defeat and decide to kill a task. The more loops we do the more
likely there we'll run in a livelock with a page bouncing back and
forth between tasks. The maximum number of entries to check in a loop
that returns less than swap-cluster-max pages freed, should be the size
of the list (or at most twice the size of the list if you want to be
really paranoid about the PG_referenced bit).

Our objective there is to know reliably when it's time that we kill a
task, tring to free a few more pages at that already ciritical point is
worthless.

This seems to have the effect of reducing the "hang" time during oom
killing.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1211,7 +1211,6 @@ unsigned long try_to_free_pages(struct z
 	int priority;
 	int ret = 0;
 	unsigned long total_scanned = 0;
-	unsigned long nr_reclaimed = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long lru_pages = 0;
 	int i;
@@ -1237,15 +1236,17 @@ unsigned long try_to_free_pages(struct z
 	}
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+		unsigned long nr_reclaimed;
+
 		sc.nr_scanned = 0;
 		if (!priority)
 			disable_swap_token();
-		nr_reclaimed += shrink_zones(priority, zones, &sc);
+		nr_reclaimed = shrink_zones(priority, zones, &sc);
+		if (reclaim_state)
+			reclaim_state->reclaimed_slab = 0;
 		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
-		if (reclaim_state) {
+		if (reclaim_state)
 			nr_reclaimed += reclaim_state->reclaimed_slab;
-			reclaim_state->reclaimed_slab = 0;
-		}
 		total_scanned += sc.nr_scanned;
 		if (nr_reclaimed >= sc.swap_cluster_max) {
 			ret = 1;
@@ -1320,7 +1321,6 @@ static unsigned long balance_pgdat(pg_da
 	int priority;
 	int i;
 	unsigned long total_scanned;
-	unsigned long nr_reclaimed;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
@@ -1337,7 +1337,6 @@ static unsigned long balance_pgdat(pg_da
 
 loop_again:
 	total_scanned = 0;
-	nr_reclaimed = 0;
 	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);
 
@@ -1347,6 +1346,7 @@ loop_again:
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
+		unsigned long nr_reclaimed;
 
 		/* The swap token gets in the way of swapout... */
 		if (!priority)
@@ -1393,6 +1393,7 @@ loop_again:
 		 * pages behind kswapd's direction of progress, which would
 		 * cause too much scanning of the lower zones.
 		 */
+		nr_reclaimed = 0;
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 06 of 13] balance_pgdat doesn't return the number of pages freed
  2008-01-08  7:50 [PATCH 00 of 13] oom deadlock fixes # try 2 Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2008-01-08  7:50 ` [PATCH 05 of 13] reduce the probability of an OOM livelock Andrea Arcangeli
@ 2008-01-08  7:50 ` Andrea Arcangeli
  2008-01-08  7:50 ` [PATCH 07 of 13] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:50 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199470022 -3600
# Node ID dd5900d0aa4e5f1b81364346465be53db897246f
# Parent  351a3906181f5c0fe0137b6f066f725bd65673ba
balance_pgdat doesn't return the number of pages freed

nr_reclaimed would be the number of pages freed in the last pass.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1298,8 +1298,6 @@ out:
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at pages_high.
  *
- * Returns the number of pages which were actually freed.
- *
  * There is special handling here for zones which are full of pinned pages.
  * This can happen if the pages are all mlocked, or if they are all used by
  * device drivers (say, ZONE_DMA).  Or if they are all in use by hugetlb.
@@ -1315,7 +1313,7 @@ out:
  * the page allocator fallback scheme to ensure that aging of pages is balanced
  * across the zones.
  */
-static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
+static void balance_pgdat(pg_data_t *pgdat, int order)
 {
 	int all_zones_ok;
 	int priority;
@@ -1475,8 +1473,6 @@ out:
 
 		goto loop_again;
 	}
-
-	return nr_reclaimed;
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 07 of 13] don't depend on PF_EXITING tasks to go away
  2008-01-08  7:50 [PATCH 00 of 13] oom deadlock fixes # try 2 Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2008-01-08  7:50 ` [PATCH 06 of 13] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
@ 2008-01-08  7:50 ` Andrea Arcangeli
  2008-01-08  7:50 ` [PATCH 08 of 13] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:50 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199470022 -3600
# Node ID ee9691f08d054949b7718cff94c4f132d97626de
# Parent  dd5900d0aa4e5f1b81364346465be53db897246f
don't depend on PF_EXITING tasks to go away

A PF_EXITING task don't have TIF_MEMDIE set so it might get stuck in
memory allocations without access to the PF_MEMALLOC pool (said that
ideally do_exit would better not require memory allocations, especially
not before calling exit_mm). The same way we raise its privilege to
TIF_MEMDIE if it's the current task, we should do it even if it's not
the current task to speedup oom killing.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -233,16 +233,16 @@ static struct task_struct *select_bad_pr
 		 * This is in the process of releasing memory so wait for it
 		 * to finish before killing some other task by mistake.
 		 *
-		 * However, if p is the current task, we allow the 'kill' to
-		 * go ahead if it is exiting: this will simply set TIF_MEMDIE,
-		 * which will allow it to gain access to memory reserves in
-		 * the process of exiting and releasing its resources.
-		 * Otherwise we could get an easy OOM deadlock.
+		 * We must however set TIF_MEMDIE on this task so we select it with
+		 * maximum points. This PF_EXITING task may be out of the scheduler
+		 * and zombie and it may have released all its memory already and
+		 * furthermore we want to give it access to all the memory reserves.
+		 *
+		 * If it's too late and this selected task can't release any memory
+		 * anymore the memdie_jiffies will timeout and fallback in killing
+		 * a new task later.
 		 */
 		if (p->flags & PF_EXITING) {
-			if (p != current)
-				return ERR_PTR(-1UL);
-
 			chosen = p;
 			*ppoints = ULONG_MAX;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 08 of 13] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2008-01-08  7:50 [PATCH 00 of 13] oom deadlock fixes # try 2 Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2008-01-08  7:50 ` [PATCH 07 of 13] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
@ 2008-01-08  7:50 ` Andrea Arcangeli
  2008-01-08  7:50 ` [PATCH 09 of 13] oom select should only take rss into account Andrea Arcangeli
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:50 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199470022 -3600
# Node ID be951f4c07326327719ad105f14be41296fcf753
# Parent  ee9691f08d054949b7718cff94c4f132d97626de
stop useless vm trashing while we wait the TIF_MEMDIE task to exit

There's no point in trying to free memory if we're oom.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1129,6 +1129,13 @@ static unsigned long shrink_zone(int pri
 		nr_inactive = 0;
 
 	while (nr_active || nr_inactive) {
+		if (unlikely(zone_is_oom_locked(zone))) {
+			if (!test_thread_flag(TIF_MEMDIE))
+				/* get out of the way */
+				schedule_timeout_interruptible(1);
+			else
+				break;
+		}
 		if (nr_active) {
 			nr_to_scan = min(nr_active,
 					(unsigned long)sc->swap_cluster_max);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 09 of 13] oom select should only take rss into account
  2008-01-08  7:50 [PATCH 00 of 13] oom deadlock fixes # try 2 Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2008-01-08  7:50 ` [PATCH 08 of 13] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
@ 2008-01-08  7:50 ` Andrea Arcangeli
  2008-01-08  7:50 ` [PATCH 10 of 13] limit reclaim if enough pages have been freed Andrea Arcangeli
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:50 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199470022 -3600
# Node ID 6c433e92ef119dd39893c6b54e41154866c32ef8
# Parent  be951f4c07326327719ad105f14be41296fcf753
oom select should only take rss into account

Running workloads where many tasks grow their virtual memory
simultaneously, so they all have a relatively small virtual memory when
oom triggers (if compared to innocent longstanding tasks), the oom
killer then selects mysql/apache and other things with very large VM but
very small RSS. RSS is the only thing that matters, killing a task with
huge VM but zero RSS is not useful. Many apps tend to have large VM but
small RSS in the first place (regardless of swapping activity) and they
shouldn't be penalized like this.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -68,7 +68,7 @@ unsigned long badness(struct task_struct
 	/*
 	 * The memory size of the process is the basis for the badness.
 	 */
-	points = mm->total_vm;
+	points = get_mm_rss(mm);
 
 	/*
 	 * After this unlock we can no longer dereference local variable `mm'
@@ -92,7 +92,7 @@ unsigned long badness(struct task_struct
 	list_for_each_entry(child, &p->children, sibling) {
 		task_lock(child);
 		if (child->mm != mm && child->mm)
-			points += child->mm->total_vm/2 + 1;
+			points += get_mm_rss(child->mm)/2 + 1;
 		task_unlock(child);
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 10 of 13] limit reclaim if enough pages have been freed
  2008-01-08  7:50 [PATCH 00 of 13] oom deadlock fixes # try 2 Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2008-01-08  7:50 ` [PATCH 09 of 13] oom select should only take rss into account Andrea Arcangeli
@ 2008-01-08  7:50 ` Andrea Arcangeli
  2008-01-08  7:50 ` [PATCH 11 of 13] not-wait-memdie Andrea Arcangeli
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:50 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199470022 -3600
# Node ID 0a13c24681cf4851555c87358fc2ec2465f9ef39
# Parent  6c433e92ef119dd39893c6b54e41154866c32ef8
limit reclaim if enough pages have been freed

No need to wipe out an huge chunk of the cache.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1149,6 +1149,8 @@ static unsigned long shrink_zone(int pri
 			nr_inactive -= nr_to_scan;
 			nr_reclaimed += shrink_inactive_list(nr_to_scan, zone,
 								sc);
+			if (nr_reclaimed >= sc->swap_cluster_max)
+				break;
 		}
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 11 of 13] not-wait-memdie
  2008-01-08  7:50 [PATCH 00 of 13] oom deadlock fixes # try 2 Andrea Arcangeli
                   ` (9 preceding siblings ...)
  2008-01-08  7:50 ` [PATCH 10 of 13] limit reclaim if enough pages have been freed Andrea Arcangeli
@ 2008-01-08  7:50 ` Andrea Arcangeli
  2008-01-08  7:50 ` [PATCH 12 of 13] gfp-repeat stop with TIF_MEMDIE Andrea Arcangeli
  2008-01-08  7:50 ` [PATCH 13 of 13] congestion wait Andrea Arcangeli
  12 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:50 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199470022 -3600
# Node ID ecc696d359edebbfe35566510f78a4be445c8f67
# Parent  0a13c24681cf4851555c87358fc2ec2465f9ef39
not-wait-memdie

Don't wait tif-memdie tasks forever because they may be stuck in some kernel
lock owned by some task that requires memory to exit the critical section.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -222,12 +222,16 @@ static struct task_struct *select_bad_pr
 		 * being killed. Don't allow any other task access to the
 		 * memory reserve.
 		 *
-		 * Note: this may have a chance of deadlock if it gets
-		 * blocked waiting for another task which itself is waiting
-		 * for memory. Is there a better alternative?
+		 * But if the TIF_MEMDIE task stays around for more than
+		 * MEMDIE_DELAY jiffies, ignore it and fallback killing
+		 * another task.
 		 */
-		if (test_tsk_thread_flag(p, TIF_MEMDIE))
-			return ERR_PTR(-1UL);
+		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+			if (time_before(p->memdie_jiffies + MEMDIE_DELAY, jiffies))
+				continue;
+			else
+				return ERR_PTR(-1UL);
+		}
 
 		/*
 		 * This is in the process of releasing memory so wait for it

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 12 of 13] gfp-repeat stop with TIF_MEMDIE
  2008-01-08  7:50 [PATCH 00 of 13] oom deadlock fixes # try 2 Andrea Arcangeli
                   ` (10 preceding siblings ...)
  2008-01-08  7:50 ` [PATCH 11 of 13] not-wait-memdie Andrea Arcangeli
@ 2008-01-08  7:50 ` Andrea Arcangeli
  2008-01-08  7:50 ` [PATCH 13 of 13] congestion wait Andrea Arcangeli
  12 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:50 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User andrea@cpushare.com
# Date 1199692960 -3600
# Node ID 74af3b1477511c7bd6a526b47195ddf95a5424dc
# Parent  ecc696d359edebbfe35566510f78a4be445c8f67
gfp-repeat stop with TIF_MEMDIE

Let the GFP_REPEAT task quit if TIF_MEMDIE is set.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1617,7 +1617,8 @@ nofail_alloc:
 	if (!(gfp_mask & __GFP_NORETRY)) {
 		if ((order <= PAGE_ALLOC_COSTLY_ORDER) ||
 						(gfp_mask & __GFP_REPEAT))
-			do_retry = 1;
+			if (likely(!test_thread_flag(TIF_MEMDIE)))
+				do_retry = 1;
 		if (gfp_mask & __GFP_NOFAIL)
 			do_retry = 1;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 13 of 13] congestion wait
  2008-01-08  7:50 [PATCH 00 of 13] oom deadlock fixes # try 2 Andrea Arcangeli
                   ` (11 preceding siblings ...)
  2008-01-08  7:50 ` [PATCH 12 of 13] gfp-repeat stop with TIF_MEMDIE Andrea Arcangeli
@ 2008-01-08  7:50 ` Andrea Arcangeli
  12 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2008-01-08  7:50 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User andrea@cpushare.com
# Date 1199701210 -3600
# Node ID 352591adebd643c51fe629c5ee343342f60b24f0
# Parent  74af3b1477511c7bd6a526b47195ddf95a5424dc
congestion wait

Don't block in congestion_wait if memdie is set.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -83,6 +83,9 @@ long congestion_wait(int rw, long timeou
 	DEFINE_WAIT(wait);
 	wait_queue_head_t *wqh = &congestion_wqh[rw];
 
+	if (unlikely(test_thread_flag(TIF_MEMDIE)))
+		return timeout;
+
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
 	ret = io_schedule_timeout(timeout);
 	finish_wait(wqh, &wait);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-01-08  7:50 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-08  7:50 [PATCH 00 of 13] oom deadlock fixes # try 2 Andrea Arcangeli
2008-01-08  7:50 ` [PATCH 01 of 13] limit shrink zone scanning Andrea Arcangeli
2008-01-08  7:50 ` [PATCH 02 of 13] avoid oom deadlock in nfs_create_request Andrea Arcangeli
2008-01-08  7:50 ` [PATCH 03 of 13] prevent oom deadlocks during read/write operations Andrea Arcangeli
2008-01-08  7:50 ` [PATCH 04 of 13] avoid selecting already killed tasks Andrea Arcangeli
2008-01-08  7:50 ` [PATCH 05 of 13] reduce the probability of an OOM livelock Andrea Arcangeli
2008-01-08  7:50 ` [PATCH 06 of 13] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
2008-01-08  7:50 ` [PATCH 07 of 13] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
2008-01-08  7:50 ` [PATCH 08 of 13] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
2008-01-08  7:50 ` [PATCH 09 of 13] oom select should only take rss into account Andrea Arcangeli
2008-01-08  7:50 ` [PATCH 10 of 13] limit reclaim if enough pages have been freed Andrea Arcangeli
2008-01-08  7:50 ` [PATCH 11 of 13] not-wait-memdie Andrea Arcangeli
2008-01-08  7:50 ` [PATCH 12 of 13] gfp-repeat stop with TIF_MEMDIE Andrea Arcangeli
2008-01-08  7:50 ` [PATCH 13 of 13] congestion wait Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox