[PATCH 00 of 16] OOM related fixes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00 of 16] OOM related fixes
@ 2007-06-08 20:02 Andrea Arcangeli
  2007-06-08 20:02 ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli
                   ` (16 more replies)
  0 siblings, 17 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:02 UTC (permalink / raw)
  To: linux-mm

Hello everyone,

this is a set of fixes done in the context of a quite evil workload reading
from nfs large files with big read buffers in parallel from many tasks at
the same time until the system goes oom. Mostly all of these fixes seems to be
required to fix the customer workload on top of an older sles kernel. The
forward port of the fixes has been already tested successfully on similar evil
workloads.

mainline vanilla running a somewhat simulated workload:

Jun  8 06:06:56 kvm kernel: Out of memory: Killed process 3282 (klauncher).
Jun  8 06:17:35 kvm kernel: Out of memory: kill process 3002 (qmgr) score 11225 or a child
Jun  8 06:17:35 kvm kernel: Out of memory: kill process 3001 (pickup) score 11216 or a child
Jun  8 06:17:35 kvm kernel: Out of memory: kill process 2186 (hald) score 11004 or a child
Jun  8 06:17:35 kvm kernel: Out of memory: kill process 3515 (bash) score 9447 or a child
Jun  8 06:17:35 kvm kernel: Out of memory: kill process 2186 (hald) score 8558 or a child
Jun  8 06:17:35 kvm kernel: Out of memory: kill process 2142 (dbus-daemon) score 5591 or a child
Jun  8 06:17:35 kvm kernel: Out of memory: kill process 3549 (recursive_readd) score 4597 or a child
Jun  8 06:17:43 kvm kernel: Out of memory: kill process 3591 (pickup) score 9756 or a child
Jun  8 06:17:43 kvm kernel: Out of memory: kill process 2204 (hald-addon-acpi) score 4121 or a child
Jun  8 06:17:43 kvm kernel: Out of memory: kill process 3515 (bash) score 3808 or a child
Jun  8 06:17:45 kvm kernel: Out of memory: kill process 3555 (recursive_readd) score 2330 or a child
Jun  8 06:17:53 kvm kernel: Out of memory: kill process 3554 (recursive_readd) score 2605 or a child
Jun  8 06:18:00 kvm kernel: Out of memory: kill process 3170 (nscd) score 1985 or a child
Jun  8 06:18:00 kvm kernel: Out of memory: kill process 3187 (nscd) score 1985 or a child
Jun  8 06:18:00 kvm kernel: Out of memory: kill process 3188 (nscd) score 1985 or a child
Jun  8 06:18:00 kvm kernel: Out of memory: kill process 2855 (portmap) score 1965 or a child
Jun  8 06:18:00 kvm kernel: Out of memory: kill process 3551 (recursive_readd) score 859 or a child
[ eventually it deadlocks and stops killing new tasks ]

mainline + fixes running the same simulated workload:

Jun  8 13:35:32 kvm kernel: Out of memory: kill process 3494 (recursive_readd) score 3822 or a child
Jun  8 13:35:33 kvm kernel: Out of memory: kill process 3494 (recursive_readd) score 3822 or a child
Jun  8 13:35:33 kvm kernel: Out of memory: kill process 3494 (recursive_readd) score 3822 or a child
Jun  8 13:37:33 kvm kernel: Out of memory: kill process 3505 (recursive_readd) score 622 or a child
Jun  8 13:37:34 kvm kernel: Out of memory: kill process 3510 (recursive_readd) score 418 or a child
Jun  8 13:37:36 kvm kernel: Out of memory: kill process 3535 (recursive_readd) score 377 or a child
Jun  8 13:37:36 kvm kernel: Out of memory: kill process 3498 (recursive_readd) score 370 or a child
Jun  8 13:37:36 kvm kernel: Out of memory: kill process 3516 (recursive_readd) score 364 or a child
Jun  8 13:37:36 kvm kernel: Out of memory: kill process 3515 (recursive_readd) score 357 or a child
Jun  8 13:40:49 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun  8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun  8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun  8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun  8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun  8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun  8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun  8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun  8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun  8 13:40:50 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun  8 13:40:51 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun  8 13:40:51 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun  8 13:40:51 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun  8 13:40:51 kvm kernel: Out of memory: kill process 3537 (recursive_readd) score 2391 or a child
Jun  8 13:41:55 kvm kernel: Out of memory: kill process 3558 (recursive_readd) score 356 or a child
Jun  8 13:41:56 kvm kernel: Out of memory: kill process 3578 (recursive_readd) score 355 or a child
Jun  8 13:41:56 kvm kernel: Out of memory: kill process 3577 (recursive_readd) score 350 or a child
Jun  8 13:41:56 kvm kernel: Out of memory: kill process 3572 (recursive_readd) score 347 or a child
Jun  8 13:41:56 kvm kernel: Out of memory: kill process 3568 (recursive_readd) score 346 or a child

The oom deadlock detection triggers a couple of times against the PG_locked
deadlock:

Jun  8 13:51:19 kvm kernel: Killed process 3504 (recursive_readd)
Jun  8 13:51:19 kvm kernel: detected probable OOM deadlock, so killing another task
Jun  8 13:51:19 kvm kernel: Out of memory: kill process 3532 (recursive_readd) score 1225 or a child

Example of stack trace of TIF_MEMDIE killed task (not literally verified that
this was the one with TIF_MEMDIE set but it's the same as before with the
verified one):

recursive_rea D ffff810001056418     0  3548   3544 (NOTLB)
 ffff81000e57dba8 0000000000000082 ffff8100010af5e8 ffff8100148df730
 ffff81001ff3ea10 0000000000bd2e1b ffff8100148df908 0000000000000046
 ffff81001fd5f170 ffffffff8031c36d ffff81001fd5f170 ffff810001056418
Call Trace:
 [<ffffffff8031c36d>] __generic_unplug_device+0x13/0x24
 [<ffffffff80244163>] sync_page+0x0/0x40
 [<ffffffff804cdf5b>] io_schedule+0xf/0x17
 [<ffffffff8024419e>] sync_page+0x3b/0x40
 [<ffffffff804ce162>] __wait_on_bit_lock+0x36/0x65
 [<ffffffff80244150>] __lock_page+0x5e/0x64
 [<ffffffff802321f1>] wake_bit_function+0x0/0x23
 [<ffffffff802440c0>] find_get_page+0xe/0x40
 [<ffffffff80244a33>] do_generic_mapping_read+0x200/0x450
 [<ffffffff80243f26>] file_read_actor+0x0/0x11d
 [<ffffffff80247fd4>] get_page_from_freelist+0x2d3/0x36e
 [<ffffffff802464d0>] generic_file_aio_read+0x11d/0x159
 [<ffffffff80260bdc>] do_sync_read+0xc9/0x10c
 [<ffffffff80252adb>] vma_merge+0x10c/0x195
 [<ffffffff802321c3>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80253a06>] do_mmap_pgoff+0x5e1/0x74c
 [<ffffffff8026134d>] vfs_read+0xaa/0x132                                                                                         
 [<ffffffff80261662>] sys_read+0x45/0x6e
 [<ffffffff8020991e>] system_call+0x7e/0x83

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
@ 2007-06-08 20:02 ` Andrea Arcangeli
  2007-06-10 17:36   ` Rik van Riel
  2007-06-08 20:03 ` [PATCH 02 of 16] avoid oom deadlock in nfs_create_request Andrea Arcangeli
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:02 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1181332959 -7200
# Node ID 8e38f7656968417dfee09fbb6450a8f1e70f8b21
# Parent  8b84ac74c8464bb6e4a2c08ff2a656d06c8667ca
remove nr_scan_inactive/active

The older atomic_add/atomic_set were pointless (atomic_set vs atomic_add would
race), but removing them didn't actually remove the race, the race is still
there, for the same reasons atomic_add/set couldn't prevent it. This is really
the kind of code that I dislike because it's sort of buggy, and it shouldn't be
making any measurable difference and when it does something for real it can
only hurt!

The real focus is on shrink_zone (ignore the other places where it's being used
that are even less interesting). Assume two tasks adds to nr_scan_*active at
the same time (first line of the old buggy code), they'll effectively double their
scan rate, for no good reason. What can happen is that instead of scanning
nr_entries each, they'll scan nr_entries*2 each. The more CPUs the bigger the
race and the higher the multiplication effect and the harder it will be to
detect oom. In the case that nr_*active < sc->swap_cluster_max, regardless of
whatever future invocation of alloc_pages, we'll be going down in the
priorities in the current alloc_pages invocation if the DEF_PRIORITY was too
high to make any work, so again accumulating the nr_scan_*active doesn't seem
interesting even when it's smaller than sc->swap_cluster_max. Each task should
work for itself without much care of what the others are doing.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -220,8 +220,6 @@ struct zone {
 	spinlock_t		lru_lock;	
 	struct list_head	active_list;
 	struct list_head	inactive_list;
-	unsigned long		nr_scan_active;
-	unsigned long		nr_scan_inactive;
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	int			all_unreclaimable; /* All pages pinned */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2649,8 +2649,6 @@ static void __meminit free_area_init_cor
 		zone_pcp_init(zone);
 		INIT_LIST_HEAD(&zone->active_list);
 		INIT_LIST_HEAD(&zone->inactive_list);
-		zone->nr_scan_active = 0;
-		zone->nr_scan_inactive = 0;
 		zap_zone_vm_stats(zone);
 		atomic_set(&zone->reclaim_in_progress, 0);
 		if (!size)
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -915,20 +915,11 @@ static unsigned long shrink_zone(int pri
 	 * Add one to `nr_to_scan' just to make sure that the kernel will
 	 * slowly sift through the active list.
 	 */
-	zone->nr_scan_active +=
-		(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
-	nr_active = zone->nr_scan_active;
-	if (nr_active >= sc->swap_cluster_max)
-		zone->nr_scan_active = 0;
-	else
+	nr_active = zone_page_state(zone, NR_ACTIVE) >> priority;
+	if (nr_active < sc->swap_cluster_max)
 		nr_active = 0;
-
-	zone->nr_scan_inactive +=
-		(zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
-	nr_inactive = zone->nr_scan_inactive;
-	if (nr_inactive >= sc->swap_cluster_max)
-		zone->nr_scan_inactive = 0;
-	else
+	nr_inactive = zone_page_state(zone, NR_INACTIVE) >> priority;
+	if (nr_inactive < sc->swap_cluster_max)
 		nr_inactive = 0;
 
 	while (nr_active || nr_inactive) {
@@ -1392,22 +1383,14 @@ static unsigned long shrink_all_zones(un
 
 		/* For pass = 0 we don't shrink the active list */
 		if (pass > 0) {
-			zone->nr_scan_active +=
-				(zone_page_state(zone, NR_ACTIVE) >> prio) + 1;
-			if (zone->nr_scan_active >= nr_pages || pass > 3) {
-				zone->nr_scan_active = 0;
-				nr_to_scan = min(nr_pages,
-					zone_page_state(zone, NR_ACTIVE));
+			nr_to_scan = (zone_page_state(zone, NR_ACTIVE) >> prio) + 1;
+			if (nr_to_scan >= nr_pages || pass > 3) {
 				shrink_active_list(nr_to_scan, zone, sc, prio);
 			}
 		}
 
-		zone->nr_scan_inactive +=
-			(zone_page_state(zone, NR_INACTIVE) >> prio) + 1;
-		if (zone->nr_scan_inactive >= nr_pages || pass > 3) {
-			zone->nr_scan_inactive = 0;
-			nr_to_scan = min(nr_pages,
-				zone_page_state(zone, NR_INACTIVE));
+		nr_to_scan = (zone_page_state(zone, NR_INACTIVE) >> prio) + 1;
+		if (nr_to_scan >= nr_pages || pass > 3) {
 			ret += shrink_inactive_list(nr_to_scan, zone, sc);
 			if (ret >= nr_pages)
 				return ret;
diff --git a/mm/vmstat.c b/mm/vmstat.c
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -554,7 +554,7 @@ static int zoneinfo_show(struct seq_file
 			   "\n        min      %lu"
 			   "\n        low      %lu"
 			   "\n        high     %lu"
-			   "\n        scanned  %lu (a: %lu i: %lu)"
+			   "\n        scanned  %lu"
 			   "\n        spanned  %lu"
 			   "\n        present  %lu",
 			   zone_page_state(zone, NR_FREE_PAGES),
@@ -562,7 +562,6 @@ static int zoneinfo_show(struct seq_file
 			   zone->pages_low,
 			   zone->pages_high,
 			   zone->pages_scanned,
-			   zone->nr_scan_active, zone->nr_scan_inactive,
 			   zone->spanned_pages,
 			   zone->present_pages);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 02 of 16] avoid oom deadlock in nfs_create_request
  2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
  2007-06-08 20:02 ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli
@ 2007-06-08 20:03 ` Andrea Arcangeli
  2007-06-10 17:38   ` Rik van Riel
  2007-06-08 20:03 ` [PATCH 03 of 16] prevent oom deadlocks during read/write operations Andrea Arcangeli
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1181332960 -7200
# Node ID d64cb81222748354bf5b16258197217465f35aeb
# Parent  8e38f7656968417dfee09fbb6450a8f1e70f8b21
avoid oom deadlock in nfs_create_request

When sigkill is pending after the oom killer set TIF_MEMDIE, the task
must go away or the VM will malfunction.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -61,16 +61,20 @@ nfs_create_request(struct nfs_open_conte
 	struct nfs_server *server = NFS_SERVER(inode);
 	struct nfs_page		*req;
 
-	for (;;) {
-		/* try to allocate the request struct */
-		req = nfs_page_alloc();
-		if (req != NULL)
-			break;
-
-		if (signalled() && (server->flags & NFS_MOUNT_INTR))
-			return ERR_PTR(-ERESTARTSYS);
-		yield();
-	}
+	/* try to allocate the request struct */
+	req = nfs_page_alloc();
+	if (unlikely(!req)) {
+		/*
+		 * -ENOMEM will be returned only when TIF_MEMDIE is set
+		 * so userland shouldn't risk to get confused by a new
+		 * unhandled ENOMEM errno.
+		 */
+		WARN_ON(!test_thread_flag(TIF_MEMDIE));
+		return ERR_PTR(-ENOMEM);
+	}
+
+	if (signalled() && (server->flags & NFS_MOUNT_INTR))
+		return ERR_PTR(-ERESTARTSYS);
 
 	/* Initialize the request struct. Initially, we assume a
 	 * long write-back delay. This will be adjusted in

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 03 of 16] prevent oom deadlocks during read/write operations
  2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
  2007-06-08 20:02 ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli
  2007-06-08 20:03 ` [PATCH 02 of 16] avoid oom deadlock in nfs_create_request Andrea Arcangeli
@ 2007-06-08 20:03 ` Andrea Arcangeli
  2007-06-08 20:03 ` [PATCH 04 of 16] serialize oom killer Andrea Arcangeli
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1181332960 -7200
# Node ID 532a5f712848ee75d827bfe233b9364a709e1fc1
# Parent  d64cb81222748354bf5b16258197217465f35aeb
prevent oom deadlocks during read/write operations

We need to react to SIGKILL during read/write with huge buffers or it
becomes too easy to prevent a SIGKILLED task to run do_exit promptly
after it has been selected for oom-killage.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -894,6 +894,13 @@ void do_generic_mapping_read(struct addr
 		struct page *page;
 		unsigned long nr, ret;
 
+		if (unlikely(sigismember(&current->pending.signal, SIGKILL)))
+			/*
+			 * Must not hang almost forever in D state in presence of sigkill
+			 * and lots of ram/swap (think during OOM).
+			 */
+			break;
+
 		/* nr is the maximum number of bytes to copy from this page */
 		nr = PAGE_CACHE_SIZE;
 		if (index >= end_index) {
@@ -2105,6 +2112,13 @@ generic_file_buffered_write(struct kiocb
 		unsigned long index;
 		unsigned long offset;
 		size_t copied;
+
+		if (unlikely(sigismember(&current->pending.signal, SIGKILL)))
+			/*
+			 * Must not hang almost forever in D state in presence of sigkill
+			 * and lots of ram/swap (think during OOM).
+			 */
+			break;
 
 		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
 		index = pos >> PAGE_CACHE_SHIFT;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 04 of 16] serialize oom killer
  2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2007-06-08 20:03 ` [PATCH 03 of 16] prevent oom deadlocks during read/write operations Andrea Arcangeli
@ 2007-06-08 20:03 ` Andrea Arcangeli
  2007-06-09  6:43   ` Peter Zijlstra
  2007-06-08 20:03 ` [PATCH 05 of 16] avoid selecting already killed tasks Andrea Arcangeli
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1181332960 -7200
# Node ID baa866fedc79cb333b90004da2730715c145f1d5
# Parent  532a5f712848ee75d827bfe233b9364a709e1fc1
serialize oom killer

It's risky and useless to run two oom killers in parallel, let serialize it to
reduce the probability of spurious oom-killage.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -400,12 +400,15 @@ void out_of_memory(struct zonelist *zone
 	unsigned long points = 0;
 	unsigned long freed = 0;
 	int constraint;
+	static DECLARE_MUTEX(OOM_lock);
 
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 	if (freed > 0)
 		/* Got some memory back in the last second. */
 		return;
 
+	if (down_trylock(&OOM_lock))
+		return;
 	if (printk_ratelimit()) {
 		printk(KERN_WARNING "%s invoked oom-killer: "
 			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
@@ -472,4 +475,6 @@ out:
 	 */
 	if (!test_thread_flag(TIF_MEMDIE))
 		schedule_timeout_uninterruptible(1);
-}
+
+	up(&OOM_lock);
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 05 of 16] avoid selecting already killed tasks
  2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2007-06-08 20:03 ` [PATCH 04 of 16] serialize oom killer Andrea Arcangeli
@ 2007-06-08 20:03 ` Andrea Arcangeli
  2007-06-08 20:03 ` [PATCH 06 of 16] reduce the probability of an OOM livelock Andrea Arcangeli
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1181332960 -7200
# Node ID 2ebc46595ead0f1790c6ec1d0302dd60ffbb1978
# Parent  baa866fedc79cb333b90004da2730715c145f1d5
avoid selecting already killed tasks

If the killed task doesn't go away because it's waiting on some other
task who needs to allocate memory, to release the i_sem or some other
lock, we must fallback to killing some other task in order to kill the
original selected and already oomkilled task, but the logic that kills
the childs first, would deadlock, if the already oom-killed task was
actually the first child of the newly oom-killed task.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -366,6 +366,12 @@ static int oom_kill_process(struct task_
 		c = list_entry(tsk, struct task_struct, sibling);
 		if (c->mm == p->mm)
 			continue;
+		/*
+		 * We cannot select tasks with TIF_MEMDIE already set
+		 * or we'll hard deadlock.
+		 */
+		if (unlikely(test_tsk_thread_flag(c, TIF_MEMDIE)))
+			continue;
 		if (!oom_kill_task(c))
 			return 0;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 06 of 16] reduce the probability of an OOM livelock
  2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2007-06-08 20:03 ` [PATCH 05 of 16] avoid selecting already killed tasks Andrea Arcangeli
@ 2007-06-08 20:03 ` Andrea Arcangeli
  2007-06-08 20:03 ` [PATCH 07 of 16] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1181332961 -7200
# Node ID fe82f6d082c859c641664990c6e14de8d16dcb5d
# Parent  2ebc46595ead0f1790c6ec1d0302dd60ffbb1978
reduce the probability of an OOM livelock

There's no need to loop way too many times over the lrus in order to
declare defeat and decide to kill a task. The more loops we do the more
likely there we'll run in a livelock with a page bouncing back and
forth between tasks. The maximum number of entries to check in a loop
that returns less than swap-cluster-max pages freed, should be the size
of the list (or at most twice the size of the list if you want to be
really paranoid about the PG_referenced bit).

Our objective there is to know reliably when it's time that we kill a
task, tring to free a few more pages at that already ciritical point is
worthless.

This seems to have the effect of reducing the "hang" time during oom
killing.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1007,7 +1007,7 @@ unsigned long try_to_free_pages(struct z
 	int priority;
 	int ret = 0;
 	unsigned long total_scanned = 0;
-	unsigned long nr_reclaimed = 0;
+	unsigned long nr_reclaimed;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long lru_pages = 0;
 	int i;
@@ -1035,12 +1035,12 @@ unsigned long try_to_free_pages(struct z
 		sc.nr_scanned = 0;
 		if (!priority)
 			disable_swap_token();
-		nr_reclaimed += shrink_zones(priority, zones, &sc);
+		nr_reclaimed = shrink_zones(priority, zones, &sc);
+		if (reclaim_state)
+			reclaim_state->reclaimed_slab = 0;
 		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
-		if (reclaim_state) {
+		if (reclaim_state)
 			nr_reclaimed += reclaim_state->reclaimed_slab;
-			reclaim_state->reclaimed_slab = 0;
-		}
 		total_scanned += sc.nr_scanned;
 		if (nr_reclaimed >= sc.swap_cluster_max) {
 			ret = 1;
@@ -1131,7 +1131,6 @@ static unsigned long balance_pgdat(pg_da
 
 loop_again:
 	total_scanned = 0;
-	nr_reclaimed = 0;
 	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);
 
@@ -1186,6 +1185,7 @@ loop_again:
 		 * pages behind kswapd's direction of progress, which would
 		 * cause too much scanning of the lower zones.
 		 */
+		nr_reclaimed = 0;
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 07 of 16] balance_pgdat doesn't return the number of pages freed
  2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2007-06-08 20:03 ` [PATCH 06 of 16] reduce the probability of an OOM livelock Andrea Arcangeli
@ 2007-06-08 20:03 ` Andrea Arcangeli
  2007-06-08 20:03 ` [PATCH 08 of 16] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1181332961 -7200
# Node ID aafcc5c9057f11d88c43b823c241f14a5ebdd638
# Parent  fe82f6d082c859c641664990c6e14de8d16dcb5d
balance_pgdat doesn't return the number of pages freed

nr_reclaimed would be the number of pages freed in the last pass.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1092,8 +1092,6 @@ out:
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at pages_high.
  *
- * Returns the number of pages which were actually freed.
- *
  * There is special handling here for zones which are full of pinned pages.
  * This can happen if the pages are all mlocked, or if they are all used by
  * device drivers (say, ZONE_DMA).  Or if they are all in use by hugetlb.
@@ -1109,7 +1107,7 @@ out:
  * the page allocator fallback scheme to ensure that aging of pages is balanced
  * across the zones.
  */
-static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
+static void balance_pgdat(pg_data_t *pgdat, int order)
 {
 	int all_zones_ok;
 	int priority;
@@ -1259,8 +1257,6 @@ out:
 
 		goto loop_again;
 	}
-
-	return nr_reclaimed;
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 08 of 16] don't depend on PF_EXITING tasks to go away
  2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2007-06-08 20:03 ` [PATCH 07 of 16] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
@ 2007-06-08 20:03 ` Andrea Arcangeli
  2007-06-08 20:03 ` [PATCH 09 of 16] fallback killing more tasks if tif-memdie doesn't " Andrea Arcangeli
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1181332961 -7200
# Node ID 60059913ab07906fceda14ffa72f2c77ef282fca
# Parent  aafcc5c9057f11d88c43b823c241f14a5ebdd638
don't depend on PF_EXITING tasks to go away

A PF_EXITING task don't have TIF_MEMDIE set so it might get stuck in
memory allocations without access to the PF_MEMALLOC pool (said that
ideally do_exit would better not require memory allocations, especially
not before calling exit_mm). The same way we raise its privilege to
TIF_MEMDIE if it's the current task, we should do it even if it's not
the current task to speedup oom killing.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -233,27 +233,13 @@ static struct task_struct *select_bad_pr
 		 * Note: this may have a chance of deadlock if it gets
 		 * blocked waiting for another task which itself is waiting
 		 * for memory. Is there a better alternative?
+		 *
+		 * Better not to skip PF_EXITING tasks, since they
+		 * don't have access to the PF_MEMALLOC pool until
+		 * we select them here first.
 		 */
 		if (test_tsk_thread_flag(p, TIF_MEMDIE))
 			return ERR_PTR(-1UL);
-
-		/*
-		 * This is in the process of releasing memory so wait for it
-		 * to finish before killing some other task by mistake.
-		 *
-		 * However, if p is the current task, we allow the 'kill' to
-		 * go ahead if it is exiting: this will simply set TIF_MEMDIE,
-		 * which will allow it to gain access to memory reserves in
-		 * the process of exiting and releasing its resources.
-		 * Otherwise we could get an easy OOM deadlock.
-		 */
-		if (p->flags & PF_EXITING) {
-			if (p != current)
-				return ERR_PTR(-1UL);
-
-			chosen = p;
-			*ppoints = ULONG_MAX;
-		}
 
 		if (p->oomkilladj == OOM_DISABLE)
 			continue;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 09 of 16] fallback killing more tasks if tif-memdie doesn't go away
  2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2007-06-08 20:03 ` [PATCH 08 of 16] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
@ 2007-06-08 20:03 ` Andrea Arcangeli
  2007-06-08 21:57   ` Christoph Lameter
  2007-06-08 20:03 ` [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1181332961 -7200
# Node ID 4a70e6a4142230fa161dd37202cd62fede122880
# Parent  60059913ab07906fceda14ffa72f2c77ef282fca
fallback killing more tasks if tif-memdie doesn't go away

Waiting indefinitely for a TIF_MEMDIE task to go away will deadlock. Two
tasks reading from the same inode at the same time and both going out of
memory inside a read(largebuffer) syscall, will even deadlock through
contention over the PG_locked bitflag. The task holding the semaphore
detects oom but the oom killer decides to kill the task blocked in
wait_on_page_locked(). The task holding the semaphore will hang inside
alloc_pages that will never return because it will wait the TIF_MEMDIE
task to go away, but the TIF_MEMDIE task can't go away until the task
holding the semaphore is killed in the first place.

It's quite unpractical to teach the oom killer the locking dependencies
across running tasks, so the feasible fix is to develop a logic that
after waiting a long time for a TIF_MEMDIE tasks goes away, fallbacks
on killing one more task. This also eliminates the possibility of
suprious oom killage (i.e. two tasks killed despite only one had to be
killed). It's not a math guarantee because we can't demonstrate that if
a TIF_MEMDIE SIGKILLED task didn't mange to complete do_exit within
10sec, it never will. But the current probability of suprious oom
killing is sure much higher than the probability of suprious oom killing
with this patch applied.

The whole locking is around the tasklist_lock. On one side do_exit reads
TIF_MEMDIE and clears VM_is_OOM under the lock, on the other side the
oom killer accesses VM_is_OOM and TIF_MEMDIE under the lock. This is a
read_lock in the oom killer but it's actually a write lock thanks to the
OOM_lock semaphore running one oom killer at once (the locking rule is,
either use write_lock_irq or read_lock+OOM_lock).

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/kernel/exit.c b/kernel/exit.c
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -845,6 +845,15 @@ static void exit_notify(struct task_stru
 	     unlikely(tsk->parent->signal->flags & SIGNAL_GROUP_EXIT)))
 		state = EXIT_DEAD;
 	tsk->exit_state = state;
+
+	/*
+	 * Read TIF_MEMDIE and set VM_is_OOM to 0 atomically inside
+	 * the tasklist_lock_lock.
+	 */
+	if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE))) {
+		extern unsigned long VM_is_OOM;
+		clear_bit(0, &VM_is_OOM);
+	}
 
 	write_unlock_irq(&tasklist_lock);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -28,6 +28,9 @@ int sysctl_panic_on_oom;
 int sysctl_panic_on_oom;
 /* #define DEBUG */
 
+unsigned long VM_is_OOM;
+static unsigned long last_tif_memdie_jiffies;
+
 /**
  * badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
@@ -225,21 +228,14 @@ static struct task_struct *select_bad_pr
 		if (is_init(p))
 			continue;
 
-		/*
-		 * This task already has access to memory reserves and is
-		 * being killed. Don't allow any other task access to the
-		 * memory reserve.
-		 *
-		 * Note: this may have a chance of deadlock if it gets
-		 * blocked waiting for another task which itself is waiting
-		 * for memory. Is there a better alternative?
-		 *
-		 * Better not to skip PF_EXITING tasks, since they
-		 * don't have access to the PF_MEMALLOC pool until
-		 * we select them here first.
-		 */
-		if (test_tsk_thread_flag(p, TIF_MEMDIE))
-			return ERR_PTR(-1UL);
+		if (unlikely(test_tsk_thread_flag(p, TIF_MEMDIE))) {
+			/*
+			 * Either we already waited long enough,
+			 * or exit_mm already run, so we must
+			 * try to kill another task.
+			 */
+			continue;
+		}
 
 		if (p->oomkilladj == OOM_DISABLE)
 			continue;
@@ -276,13 +272,16 @@ static void __oom_kill_task(struct task_
 	if (verbose)
 		printk(KERN_ERR "Killed process %d (%s)\n", p->pid, p->comm);
 
+	if (!test_and_set_tsk_thread_flag(p, TIF_MEMDIE)) {
+		last_tif_memdie_jiffies = jiffies;
+		set_bit(0, &VM_is_OOM);
+	}
 	/*
 	 * We give our sacrificial lamb high priority and access to
 	 * all the memory it needs. That way it should be able to
 	 * exit() and clear out its resources quickly...
 	 */
 	p->time_slice = HZ;
-	set_tsk_thread_flag(p, TIF_MEMDIE);
 
 	force_sig(SIGKILL, p);
 }
@@ -419,6 +418,18 @@ void out_of_memory(struct zonelist *zone
 	constraint = constrained_alloc(zonelist, gfp_mask);
 	cpuset_lock();
 	read_lock(&tasklist_lock);
+
+	/*
+	 * This holds the down(OOM_lock)+read_lock(tasklist_lock), so it's
+	 * equivalent to write_lock_irq(tasklist_lock) as far as VM_is_OOM
+	 * is concerned.
+	 */
+	if (unlikely(test_bit(0, &VM_is_OOM))) {
+		if (time_before(jiffies, last_tif_memdie_jiffies + 10*HZ))
+			goto out;
+		printk("detected probable OOM deadlock, so killing another task\n");
+		last_tif_memdie_jiffies = jiffies;
+	}
 
 	switch (constraint) {
 	case CONSTRAINT_MEMORY_POLICY:
@@ -440,10 +451,6 @@ retry:
 		 * issues we may have.
 		 */
 		p = select_bad_process(&points);
-
-		if (PTR_ERR(p) == -1UL)
-			goto out;
-
 		/* Found nothing?!?! Either we hang forever, or we panic. */
 		if (!p) {
 			read_unlock(&tasklist_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2007-06-08 20:03 ` [PATCH 09 of 16] fallback killing more tasks if tif-memdie doesn't " Andrea Arcangeli
@ 2007-06-08 20:03 ` Andrea Arcangeli
  2007-06-08 21:48   ` Christoph Lameter
  2007-06-08 20:03 ` [PATCH 11 of 16] the oom schedule timeout isn't needed with the VM_is_OOM logic Andrea Arcangeli
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1181332962 -7200
# Node ID 24250f0be1aa26e5c6e33fd97b9eae125db9fbde
# Parent  4a70e6a4142230fa161dd37202cd62fede122880
stop useless vm trashing while we wait the TIF_MEMDIE task to exit

There's no point in trying to free memory if we're oom.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -159,6 +159,8 @@ struct swap_list_t {
 #define vm_swap_full() (nr_swap_pages*2 < total_swap_pages)
 
 /* linux/mm/oom_kill.c */
+extern unsigned long VM_is_OOM;
+#define is_VM_OOM() unlikely(test_bit(0, &VM_is_OOM))
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -923,6 +923,8 @@ static unsigned long shrink_zone(int pri
 		nr_inactive = 0;
 
 	while (nr_active || nr_inactive) {
+		if (is_VM_OOM())
+			break;
 		if (nr_active) {
 			nr_to_scan = min(nr_active,
 					(unsigned long)sc->swap_cluster_max);
@@ -1032,6 +1034,17 @@ unsigned long try_to_free_pages(struct z
 	}
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+		if (is_VM_OOM()) {
+			if (!test_thread_flag(TIF_MEMDIE)) {
+				/* get out of the way */
+				schedule_timeout_interruptible(1);
+				/* don't waste cpu if we're still oom */
+				if (is_VM_OOM())
+					goto out;
+			} else
+				goto out;
+		}
+
 		sc.nr_scanned = 0;
 		if (!priority)
 			disable_swap_token();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 11 of 16] the oom schedule timeout isn't needed with the VM_is_OOM logic
  2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
                   ` (9 preceding siblings ...)
  2007-06-08 20:03 ` [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
@ 2007-06-08 20:03 ` Andrea Arcangeli
  2007-06-08 20:03 ` [PATCH 12 of 16] show mem information only when a task is actually being killed Andrea Arcangeli
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1181332962 -7200
# Node ID c6dfb528f53eaac2188b49f67eed51c1a33ce7cd
# Parent  24250f0be1aa26e5c6e33fd97b9eae125db9fbde
the oom schedule timeout isn't needed with the VM_is_OOM logic

VM_is_OOM whole point is to give a proper time to the TIF_MEMDIE task
in order to exit.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -468,12 +468,5 @@ out:
 	read_unlock(&tasklist_lock);
 	cpuset_unlock();
 
-	/*
-	 * Give "p" a good chance of killing itself before we
-	 * retry to allocate memory unless "p" is current
-	 */
-	if (!test_thread_flag(TIF_MEMDIE))
-		schedule_timeout_uninterruptible(1);
-
 	up(&OOM_lock);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 12 of 16] show mem information only when a task is actually being killed
  2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
                   ` (10 preceding siblings ...)
  2007-06-08 20:03 ` [PATCH 11 of 16] the oom schedule timeout isn't needed with the VM_is_OOM logic Andrea Arcangeli
@ 2007-06-08 20:03 ` Andrea Arcangeli
  2007-06-08 20:03 ` [PATCH 13 of 16] simplify oom heuristics Andrea Arcangeli
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1181332962 -7200
# Node ID db4c0ce6754d7838713eda1851aef43c2fb52fca
# Parent  c6dfb528f53eaac2188b49f67eed51c1a33ce7cd
show mem information only when a task is actually being killed

Don't show garbage while VM_is_OOM and the timeout didn't trigger.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -286,7 +286,7 @@ static void __oom_kill_task(struct task_
 	force_sig(SIGKILL, p);
 }
 
-static int oom_kill_task(struct task_struct *p)
+static int oom_kill_task(struct task_struct *p, gfp_t gfp_mask, int order)
 {
 	struct mm_struct *mm;
 	struct task_struct *g, *q;
@@ -313,93 +313,6 @@ static int oom_kill_task(struct task_str
 			return 1;
 	} while_each_thread(g, q);
 
-	__oom_kill_task(p, 1);
-
-	/*
-	 * kill all processes that share the ->mm (i.e. all threads),
-	 * but are in a different thread group. Don't let them have access
-	 * to memory reserves though, otherwise we might deplete all memory.
-	 */
-	do_each_thread(g, q) {
-		if (q->mm == mm && q->tgid != p->tgid)
-			force_sig(SIGKILL, q);
-	} while_each_thread(g, q);
-
-	return 0;
-}
-
-static int oom_kill_process(struct task_struct *p, unsigned long points,
-		const char *message)
-{
-	struct task_struct *c;
-	struct list_head *tsk;
-
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
-	 */
-	if (p->flags & PF_EXITING) {
-		__oom_kill_task(p, 0);
-		return 0;
-	}
-
-	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
-					message, p->pid, p->comm, points);
-
-	/* Try to kill a child first */
-	list_for_each(tsk, &p->children) {
-		c = list_entry(tsk, struct task_struct, sibling);
-		if (c->mm == p->mm)
-			continue;
-		/*
-		 * We cannot select tasks with TIF_MEMDIE already set
-		 * or we'll hard deadlock.
-		 */
-		if (unlikely(test_tsk_thread_flag(c, TIF_MEMDIE)))
-			continue;
-		if (!oom_kill_task(c))
-			return 0;
-	}
-	return oom_kill_task(p);
-}
-
-static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
-
-int register_oom_notifier(struct notifier_block *nb)
-{
-	return blocking_notifier_chain_register(&oom_notify_list, nb);
-}
-EXPORT_SYMBOL_GPL(register_oom_notifier);
-
-int unregister_oom_notifier(struct notifier_block *nb)
-{
-	return blocking_notifier_chain_unregister(&oom_notify_list, nb);
-}
-EXPORT_SYMBOL_GPL(unregister_oom_notifier);
-
-/**
- * out_of_memory - kill the "best" process when we run out of memory
- *
- * If we run out of memory, we have the choice between either
- * killing a random task (bad), letting the system crash (worse)
- * OR try to be smart about which process to kill. Note that we
- * don't have to be perfect here, we just have to be good.
- */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
-{
-	struct task_struct *p;
-	unsigned long points = 0;
-	unsigned long freed = 0;
-	int constraint;
-	static DECLARE_MUTEX(OOM_lock);
-
-	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
-	if (freed > 0)
-		/* Got some memory back in the last second. */
-		return;
-
-	if (down_trylock(&OOM_lock))
-		return;
 	if (printk_ratelimit()) {
 		printk(KERN_WARNING "%s invoked oom-killer: "
 			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
@@ -408,6 +321,94 @@ void out_of_memory(struct zonelist *zone
 		show_mem();
 	}
 
+	__oom_kill_task(p, 1);
+
+	/*
+	 * kill all processes that share the ->mm (i.e. all threads),
+	 * but are in a different thread group. Don't let them have access
+	 * to memory reserves though, otherwise we might deplete all memory.
+	 */
+	do_each_thread(g, q) {
+		if (q->mm == mm && q->tgid != p->tgid)
+			force_sig(SIGKILL, q);
+	} while_each_thread(g, q);
+
+	return 0;
+}
+
+static int oom_kill_process(struct task_struct *p, unsigned long points,
+			    const char *message, gfp_t gfp_mask, int order)
+{
+	struct task_struct *c;
+	struct list_head *tsk;
+
+	/*
+	 * If the task is already exiting, don't alarm the sysadmin or kill
+	 * its children or threads, just set TIF_MEMDIE so it can die quickly
+	 */
+	if (p->flags & PF_EXITING) {
+		__oom_kill_task(p, 0);
+		return 0;
+	}
+
+	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
+					message, p->pid, p->comm, points);
+
+	/* Try to kill a child first */
+	list_for_each(tsk, &p->children) {
+		c = list_entry(tsk, struct task_struct, sibling);
+		if (c->mm == p->mm)
+			continue;
+		/*
+		 * We cannot select tasks with TIF_MEMDIE already set
+		 * or we'll hard deadlock.
+		 */
+		if (unlikely(test_tsk_thread_flag(c, TIF_MEMDIE)))
+			continue;
+		if (!oom_kill_task(c, gfp_mask, order))
+			return 0;
+	}
+	return oom_kill_task(p, gfp_mask, order);
+}
+
+static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
+
+int register_oom_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&oom_notify_list, nb);
+}
+EXPORT_SYMBOL_GPL(register_oom_notifier);
+
+int unregister_oom_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&oom_notify_list, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_oom_notifier);
+
+/**
+ * out_of_memory - kill the "best" process when we run out of memory
+ *
+ * If we run out of memory, we have the choice between either
+ * killing a random task (bad), letting the system crash (worse)
+ * OR try to be smart about which process to kill. Note that we
+ * don't have to be perfect here, we just have to be good.
+ */
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+{
+	struct task_struct *p;
+	unsigned long points = 0;
+	unsigned long freed = 0;
+	int constraint;
+	static DECLARE_MUTEX(OOM_lock);
+
+	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
+	if (freed > 0)
+		/* Got some memory back in the last second. */
+		return;
+
+	if (down_trylock(&OOM_lock))
+		return;
+
 	if (sysctl_panic_on_oom == 2)
 		panic("out of memory. Compulsory panic_on_oom is selected.\n");
 
@@ -434,12 +435,12 @@ void out_of_memory(struct zonelist *zone
 	switch (constraint) {
 	case CONSTRAINT_MEMORY_POLICY:
 		oom_kill_process(current, points,
-				"No available memory (MPOL_BIND)");
+				 "No available memory (MPOL_BIND)", gfp_mask, order);
 		break;
 
 	case CONSTRAINT_CPUSET:
 		oom_kill_process(current, points,
-				"No available memory in cpuset");
+				 "No available memory in cpuset", gfp_mask, order);
 		break;
 
 	case CONSTRAINT_NONE:
@@ -458,7 +459,7 @@ retry:
 			panic("Out of memory and no killable processes...\n");
 		}
 
-		if (oom_kill_process(p, points, "Out of memory"))
+		if (oom_kill_process(p, points, "Out of memory", gfp_mask, order))
 			goto retry;
 
 		break;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 13 of 16] simplify oom heuristics
  2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
                   ` (11 preceding siblings ...)
  2007-06-08 20:03 ` [PATCH 12 of 16] show mem information only when a task is actually being killed Andrea Arcangeli
@ 2007-06-08 20:03 ` Andrea Arcangeli
  2007-06-08 20:03 ` [PATCH 14 of 16] oom select should only take rss into account Andrea Arcangeli
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1181332962 -7200
# Node ID dfac333eb29032dab87dd2c46f71a22037a6dc4a
# Parent  db4c0ce6754d7838713eda1851aef43c2fb52fca
simplify oom heuristics

Over time somebody had the good idea to remove the rcvd_sigterm points,
this removes more of them. The selected task should be the one that if
we don't kill, it will turn the system oom again sooner than later.
These informations tell us nothing about which task is best to kill so
they should be removed.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -52,7 +52,7 @@ static unsigned long last_tif_memdie_jif
 
 unsigned long badness(struct task_struct *p, unsigned long uptime)
 {
-	unsigned long points, cpu_time, run_time, s;
+	unsigned long points;
 	struct mm_struct *mm;
 	struct task_struct *child;
 
@@ -93,26 +93,6 @@ unsigned long badness(struct task_struct
 			points += child->mm->total_vm/2 + 1;
 		task_unlock(child);
 	}
-
-	/*
-	 * CPU time is in tens of seconds and run time is in thousands
-         * of seconds. There is no particular reason for this other than
-         * that it turned out to work very well in practice.
-	 */
-	cpu_time = (cputime_to_jiffies(p->utime) + cputime_to_jiffies(p->stime))
-		>> (SHIFT_HZ + 3);
-
-	if (uptime >= p->start_time.tv_sec)
-		run_time = (uptime - p->start_time.tv_sec) >> 10;
-	else
-		run_time = 0;
-
-	s = int_sqrt(cpu_time);
-	if (s)
-		points /= s;
-	s = int_sqrt(int_sqrt(run_time));
-	if (s)
-		points /= s;
 
 	/*
 	 * Niced processes are most likely less important, so double

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 14 of 16] oom select should only take rss into account
  2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
                   ` (12 preceding siblings ...)
  2007-06-08 20:03 ` [PATCH 13 of 16] simplify oom heuristics Andrea Arcangeli
@ 2007-06-08 20:03 ` Andrea Arcangeli
  2007-06-10 17:17   ` Rik van Riel
  2007-06-08 20:03 ` [PATCH 15 of 16] limit reclaim if enough pages have been freed Andrea Arcangeli
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1181332962 -7200
# Node ID dbd70ffd95f34cd12f1fd2f05a9cc0f9a50edb4a
# Parent  dfac333eb29032dab87dd2c46f71a22037a6dc4a
oom select should only take rss into account

Running workloads where many tasks grow their virtual memory
simultaneously, so they all have a relatively small virtual memory when
oom triggers (if compared to innocent longstanding tasks), the oom
killer then selects mysql/apache and other things with very large VM but
very small RSS. RSS is the only thing that matters, killing a task with
huge VM but zero RSS is not useful. Many apps tend to have large VM but
small RSS in the first place (regardless of swapping activity) and they
shouldn't be penalized like this.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -66,7 +66,7 @@ unsigned long badness(struct task_struct
 	/*
 	 * The memory size of the process is the basis for the badness.
 	 */
-	points = mm->total_vm;
+	points = get_mm_rss(mm);
 
 	/*
 	 * After this unlock we can no longer dereference local variable `mm'
@@ -90,7 +90,7 @@ unsigned long badness(struct task_struct
 	list_for_each_entry(child, &p->children, sibling) {
 		task_lock(child);
 		if (child->mm != mm && child->mm)
-			points += child->mm->total_vm/2 + 1;
+			points += get_mm_rss(child->mm)/2 + 1;
 		task_unlock(child);
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 15 of 16] limit reclaim if enough pages have been freed
  2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
                   ` (13 preceding siblings ...)
  2007-06-08 20:03 ` [PATCH 14 of 16] oom select should only take rss into account Andrea Arcangeli
@ 2007-06-08 20:03 ` Andrea Arcangeli
  2007-06-10 17:20   ` Rik van Riel
  2007-06-08 20:03 ` [PATCH 16 of 16] avoid some lock operation in vm fast path Andrea Arcangeli
  2007-06-08 21:26 ` [PATCH 00 of 16] OOM related fixes William Lee Irwin III
  16 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1181332962 -7200
# Node ID 31ef5d0bf924fb47da144321f692f4fefebf5cf5
# Parent  dbd70ffd95f34cd12f1fd2f05a9cc0f9a50edb4a
limit reclaim if enough pages have been freed

No need to wipe out an huge chunk of the cache.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -938,6 +938,8 @@ static unsigned long shrink_zone(int pri
 			nr_inactive -= nr_to_scan;
 			nr_reclaimed += shrink_inactive_list(nr_to_scan, zone,
 								sc);
+			if (nr_reclaimed >= sc->swap_cluster_max)
+				break;
 		}
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 16 of 16] avoid some lock operation in vm fast path
  2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
                   ` (14 preceding siblings ...)
  2007-06-08 20:03 ` [PATCH 15 of 16] limit reclaim if enough pages have been freed Andrea Arcangeli
@ 2007-06-08 20:03 ` Andrea Arcangeli
  2007-06-08 21:26 ` [PATCH 00 of 16] OOM related fixes William Lee Irwin III
  16 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-08 20:03 UTC (permalink / raw)
  To: linux-mm

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1181332962 -7200
# Node ID 19fb832beb3c83b7bed13c1a2f54ec4e077cfc0d
# Parent  31ef5d0bf924fb47da144321f692f4fefebf5cf5
avoid some lock operation in vm fast path

Let's not bloat the kernel for numa. Not nice, but at least this way
perhaps somebody will clean it up instead of hiding the inefficiency in
there.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -223,8 +223,10 @@ struct zone {
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	int			all_unreclaimable; /* All pages pinned */
 
+#ifdef CONFIG_NUMA
 	/* A count of how many reclaimers are scanning this zone */
 	atomic_t		reclaim_in_progress;
+#endif
 
 	/* Zone statistics */
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2650,7 +2650,9 @@ static void __meminit free_area_init_cor
 		INIT_LIST_HEAD(&zone->active_list);
 		INIT_LIST_HEAD(&zone->inactive_list);
 		zap_zone_vm_stats(zone);
+#ifdef CONFIG_NUMA
 		atomic_set(&zone->reclaim_in_progress, 0);
+#endif
 		if (!size)
 			continue;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -909,7 +909,9 @@ static unsigned long shrink_zone(int pri
 	unsigned long nr_to_scan;
 	unsigned long nr_reclaimed = 0;
 
+#ifdef CONFIG_NUMA
 	atomic_inc(&zone->reclaim_in_progress);
+#endif
 
 	/*
 	 * Add one to `nr_to_scan' just to make sure that the kernel will
@@ -945,7 +947,9 @@ static unsigned long shrink_zone(int pri
 
 	throttle_vm_writeout(sc->gfp_mask);
 
+#ifdef CONFIG_NUMA
 	atomic_dec(&zone->reclaim_in_progress);
+#endif
 	return nr_reclaimed;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 00 of 16] OOM related fixes
  2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
                   ` (15 preceding siblings ...)
  2007-06-08 20:03 ` [PATCH 16 of 16] avoid some lock operation in vm fast path Andrea Arcangeli
@ 2007-06-08 21:26 ` William Lee Irwin III
  2007-06-09 14:55   ` Andrea Arcangeli
  16 siblings, 1 reply; 77+ messages in thread
From: William Lee Irwin III @ 2007-06-08 21:26 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

On Fri, Jun 08, 2007 at 10:02:58PM +0200, Andrea Arcangeli wrote:
> Hello everyone,
> this is a set of fixes done in the context of a quite evil workload reading
> from nfs large files with big read buffers in parallel from many tasks at
> the same time until the system goes oom. Mostly all of these fixes seems to be
> required to fix the customer workload on top of an older sles kernel. The
> forward port of the fixes has been already tested successfully on similar evil
> workloads.
> mainline vanilla running a somewhat simulated workload:
[...]

Interesting. This seems to demonstrate a need for file IO to handle
fatal signals, beyond just people wanting faster responses to kill -9.
Perhaps it's the case that fatal signals should always be handled, and
there should be no waiting primitives excluding them. __GFP_NOFAIL is
also "interesting."


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-06-08 20:03 ` [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
@ 2007-06-08 21:48   ` Christoph Lameter
  2007-06-09  1:59     ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-06-08 21:48 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

On Fri, 8 Jun 2007, Andrea Arcangeli wrote:

> There's no point in trying to free memory if we're oom.

OOMs can occur because we are in a cpuset or have a memory policy that 
restricts the allocations. So I guess that OOMness is a per node property 
and not a global one.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 09 of 16] fallback killing more tasks if tif-memdie doesn't go away
  2007-06-08 20:03 ` [PATCH 09 of 16] fallback killing more tasks if tif-memdie doesn't " Andrea Arcangeli
@ 2007-06-08 21:57   ` Christoph Lameter
  0 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-06-08 21:57 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

On Fri, 8 Jun 2007, Andrea Arcangeli wrote:

> @@ -276,13 +272,16 @@ static void __oom_kill_task(struct task_
>  	if (verbose)
>  		printk(KERN_ERR "Killed process %d (%s)\n", p->pid, p->comm);
>  
> +	if (!test_and_set_tsk_thread_flag(p, TIF_MEMDIE)) {
> +		last_tif_memdie_jiffies = jiffies;
> +		set_bit(0, &VM_is_OOM);
> +	}
>  	/*

You cannot set VM_is_OM here since __oom_kill_task can be called for
a process that has constrained allocations.

With this patch a user can cause an OOM by restricting access to a single
node using MPOL_BIND. Then VM_is_OOM will be set despite of lots of 
available memory elsewhere.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-06-08 21:48   ` Christoph Lameter
@ 2007-06-09  1:59     ` Andrea Arcangeli
  2007-06-09  3:01       ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-09  1:59 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm

On Fri, Jun 08, 2007 at 02:48:15PM -0700, Christoph Lameter wrote:
> On Fri, 8 Jun 2007, Andrea Arcangeli wrote:
> 
> > There's no point in trying to free memory if we're oom.
> 
> OOMs can occur because we are in a cpuset or have a memory policy that 
> restricts the allocations. So I guess that OOMness is a per node property 
> and not a global one.

I'm sorry to inform you that the oom killing in current mainline has
always been a global event not a per-node one, regardless of the fixes
I just posted.

    	 if (test_tsk_thread_flag(p, TIF_MEMDIE))
	     return ERR_PTR(-1UL);
[..]
		if (PTR_ERR(p) == -1UL)
	   	       goto out;

Best would be for you to send me more changes at the end of the
patchbomb so that for that the first time _ever_, the oom will become
a per-node event and not a global one anymore.

Said that it's not entirely obvious to me, that it makes any sense to
disrupt functionality instead of just running slower but safely (I
would generally prefer printk a warning instead of killing a task if
we've to override the restriction on the memory policy). But that's
your call, I'm fine either ways...

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-06-09  1:59     ` Andrea Arcangeli
@ 2007-06-09  3:01       ` Christoph Lameter
  2007-06-09 14:05         ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-06-09  3:01 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

On Sat, 9 Jun 2007, Andrea Arcangeli wrote:

> I'm sorry to inform you that the oom killing in current mainline has
> always been a global event not a per-node one, regardless of the fixes
> I just posted.

Wrong. The oom killling is a local event if we are in a constrained 
allocation. The allocating task is killed not a random task. That call to 
kill the allocating task should not set any global flags.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 04 of 16] serialize oom killer
  2007-06-08 20:03 ` [PATCH 04 of 16] serialize oom killer Andrea Arcangeli
@ 2007-06-09  6:43   ` Peter Zijlstra
  2007-06-09 15:27     ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2007-06-09  6:43 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

On Fri, 2007-06-08 at 22:03 +0200, Andrea Arcangeli wrote:
> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1181332960 -7200
> # Node ID baa866fedc79cb333b90004da2730715c145f1d5
> # Parent  532a5f712848ee75d827bfe233b9364a709e1fc1
> serialize oom killer
> 
> It's risky and useless to run two oom killers in parallel, let serialize it to
> reduce the probability of spurious oom-killage.
> 
> Signed-off-by: Andrea Arcangeli <andrea@suse.de>
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -400,12 +400,15 @@ void out_of_memory(struct zonelist *zone
>  	unsigned long points = 0;
>  	unsigned long freed = 0;
>  	int constraint;
> +	static DECLARE_MUTEX(OOM_lock);

I thought we depricated that construct in favour of DEFINE_MUTEX. Also,
putting it in a function like so is a little icky IMHO.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-06-09  3:01       ` Christoph Lameter
@ 2007-06-09 14:05         ` Andrea Arcangeli
  2007-06-09 14:38           ` Andrea Arcangeli
  2007-06-11 16:04           ` Christoph Lameter
  0 siblings, 2 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-09 14:05 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm

On Fri, Jun 08, 2007 at 08:01:58PM -0700, Christoph Lameter wrote:
> On Sat, 9 Jun 2007, Andrea Arcangeli wrote:
> 
> > I'm sorry to inform you that the oom killing in current mainline has
> > always been a global event not a per-node one, regardless of the fixes
> > I just posted.
> 
> Wrong. The oom killling is a local event if we are in a constrained 
> allocation. The allocating task is killed not a random task. That call to 
> kill the allocating task should not set any global flags.

I just showed the global flag that is being checked. TIF_MEMDIE
affects the whole system, not just your node-constrained allocating
task. If your local constrained task fails to exit because it's
running in the nfs path that loops forever even if NULL is returned
from alloc_pages, it will deadlock the whole system if later a regular
oom happens (alloc_pages isn't guaranteed to be called by a page fault
where we know do_exit will guaranteed to be called if a sigkill is
pending). This is just an example.

Amittedly my fixes made things worse for your "local" oom killing, but
your code was only apparently "local" because TIF_MEMDIE is a _global_
flag in the mainline kernel. So again, I'm very willing to improve the
local oom killing, so that it will really become a local event for the
first time ever. Infact with my fixes applied the whole system will
stop waiting for the TIF_MEMDIE flag to go away, so it'll be much
easier to really make the global oom killing independent from the
local one. I didn't look into the details of the local oom killing yet
(exactly because it wasn't so local in the first place) but it may be
enough to set VM_is_OOM only for tasks that are not being locally
killed and then those new changes will automatically prevent
TIF_MEMDIE being set on a local-oom to affect the global-oom event.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-06-09 14:05         ` Andrea Arcangeli
@ 2007-06-09 14:38           ` Andrea Arcangeli
  2007-06-11 16:07             ` Christoph Lameter
  2007-06-11 16:04           ` Christoph Lameter
  1 sibling, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-09 14:38 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm

On a side note about the current way you select the task to kill if a
constrained alloc failure triggers, I think it would have been better
if you simply extended the oom-selector by filtering tasks in function
of the current->mems_allowed. Now I agree the current badness is quite
bad, now with rss instead of the virtual space, it works a bit better
at least, but the whole point is that if you integrate the cpuset task
filtering in the oom-selector algorithm, then once we fix the badness
algorithm to actually do something more meaningful than to check
static values, you'll get the better algorithm working for your
local-oom killing too. This if you really care about the huge-numa
niche to get node-partitioning working really like if this was a
virtualized environment. If you just have kill something to release
memory, killing the current task is always the safest choice
obviously, so as your customers are ok with it I'm certainly fine with
the current approach too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 00 of 16] OOM related fixes
  2007-06-08 21:26 ` [PATCH 00 of 16] OOM related fixes William Lee Irwin III
@ 2007-06-09 14:55   ` Andrea Arcangeli
  2007-06-12  8:58     ` Petr Tesarik
  0 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-09 14:55 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-mm, Petr Tesarik

Hi Wil,

On Fri, Jun 08, 2007 at 02:26:10PM -0700, William Lee Irwin III wrote:
> Interesting. This seems to demonstrate a need for file IO to handle
> fatal signals, beyond just people wanting faster responses to kill -9.
> Perhaps it's the case that fatal signals should always be handled, and
> there should be no waiting primitives excluding them. __GFP_NOFAIL is
> also "interesting."

Clearly the sooner we respond to a SIGKILL the better. We tried to
catch the two critical points to solve the evil read(huge)->oom. BTW,
the first suggestion that we had to also break out of read to make
progress substantially quicker, was from Petr so I'm cc'ing him. I'm
unsure what else of more generic we could do to solve more of those
troubles at the same time without having to pollute the code with
sigkill checks. For example we're not yet covering the o-direct paths
but I did the minimal changes to resolve the current workload and that
used buffered io of course ;). BTW, I could have checked the
TIF_MEMDIE instead of seeing if sigkill was pending, but since I had
to check the task structure anyway, I preferred to check for the
sigkill so that kill -9 will now work for the first time against a
large read/write syscall, besides allowing the TIF_MEMDIE task to exit
in reasonable time without triggering the deadlock detection in the
later patches.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 04 of 16] serialize oom killer
  2007-06-09  6:43   ` Peter Zijlstra
@ 2007-06-09 15:27     ` Andrea Arcangeli
  0 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-09 15:27 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-mm

On Sat, Jun 09, 2007 at 08:43:47AM +0200, Peter Zijlstra wrote:
> On Fri, 2007-06-08 at 22:03 +0200, Andrea Arcangeli wrote:
> > +	static DECLARE_MUTEX(OOM_lock);
> 
> I thought we depricated that construct in favour of DEFINE_MUTEX. Also,

Ok, so it should be changed to DEFINE_MUTEX. I have to trust you on
this because there's not a sign of warning in asm-i386/semaphore.h
that DECLARE_MUTEX has been deprecated and tons of code is still using
it in the current kernel. I couldn't imagine that somebody duplicated
it somewhere else for whatever reason without removing
DECLARE_MUTEX. It's not like we have to keep deprecated and redundant
interfaces in the kernel for no good reason, especially if `sed` can
fix it without human intervention. Let's say it's a low priority to
rename it, if I've to generate a new diff, I'd probably prefer to
generate one that drops DECLARE_MUTEX all over the other places too.

> putting it in a function like so is a little icky IMHO.

On this I disagree, the whole point of static/private variables is to
decrease visibility where it's unnecessary. A static variable
function-local is even less visible so it's a good thing and it helps
self-documenting the code. So I very much like to keep it there,
coding strict improves readability (you immediately know that no other
code could ever try to acquire that lock).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 14 of 16] oom select should only take rss into account
  2007-06-08 20:03 ` [PATCH 14 of 16] oom select should only take rss into account Andrea Arcangeli
@ 2007-06-10 17:17   ` Rik van Riel
  2007-06-10 17:30     ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Rik van Riel @ 2007-06-10 17:17 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

Andrea Arcangeli wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -66,7 +66,7 @@ unsigned long badness(struct task_struct
>  	/*
>  	 * The memory size of the process is the basis for the badness.
>  	 */
> -	points = mm->total_vm;
> +	points = get_mm_rss(mm);

Makes sense.  Originally it used total_vm so it could also
select tasks that use up lots of swap, but I guess that in
almost all the cases the preferred OOM task to kill is also
using a lot of RAM.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 15 of 16] limit reclaim if enough pages have been freed
  2007-06-08 20:03 ` [PATCH 15 of 16] limit reclaim if enough pages have been freed Andrea Arcangeli
@ 2007-06-10 17:20   ` Rik van Riel
  2007-06-10 17:32     ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Rik van Riel @ 2007-06-10 17:20 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Larry Woodman

Andrea Arcangeli wrote:

> No need to wipe out an huge chunk of the cache.

I've seen recent upstream kernels free up to 75% of memory
on my test system, when pushed hard enough.

It is not hard to get hundreds of tasks into the pageout
code simultaneously, all starting out at priority 12 and
not freeing anything until they all get to much lower
priorities.

A workload that is dominated by anonymous memory will
trigger this.  All anonymous memory starts out on the
active list and tasks will not even try to shrink the
inactive list because nr_inactive >> priority is 0.

This patch is a step in the right direction.

However, I believe that your [PATCH 01 of 16] is a
step in the wrong direction for these workloads...

> Signed-off-by: Andrea Arcangeli <andrea@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -938,6 +938,8 @@ static unsigned long shrink_zone(int pri
>  			nr_inactive -= nr_to_scan;
>  			nr_reclaimed += shrink_inactive_list(nr_to_scan, zone,
>  								sc);
> +			if (nr_reclaimed >= sc->swap_cluster_max)
> +				break;
>  		}
>  	}

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 14 of 16] oom select should only take rss into account
  2007-06-10 17:17   ` Rik van Riel
@ 2007-06-10 17:30     ` Andrea Arcangeli
  0 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-10 17:30 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm

On Sun, Jun 10, 2007 at 01:17:13PM -0400, Rik van Riel wrote:
> Andrea Arcangeli wrote:
> 
> >diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> >--- a/mm/oom_kill.c
> >+++ b/mm/oom_kill.c
> >@@ -66,7 +66,7 @@ unsigned long badness(struct task_struct
> > 	/*
> > 	 * The memory size of the process is the basis for the badness.
> > 	 */
> >-	points = mm->total_vm;
> >+	points = get_mm_rss(mm);
> 
> Makes sense.  Originally it used total_vm so it could also
> select tasks that use up lots of swap, but I guess that in
> almost all the cases the preferred OOM task to kill is also
> using a lot of RAM.

Agreed.

> Acked-by: Rik van Riel <riel@redhat.com>

Thanks for the Ack.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 15 of 16] limit reclaim if enough pages have been freed
  2007-06-10 17:20   ` Rik van Riel
@ 2007-06-10 17:32     ` Andrea Arcangeli
  2007-06-10 17:52       ` Rik van Riel
  0 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-10 17:32 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, Larry Woodman

On Sun, Jun 10, 2007 at 01:20:50PM -0400, Rik van Riel wrote:
> code simultaneously, all starting out at priority 12 and
> not freeing anything until they all get to much lower
> priorities.

BTW, this reminds me that I've been wondering if 2**12 is a too small
fraction of the lru to start the scan with.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-08 20:02 ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli
@ 2007-06-10 17:36   ` Rik van Riel
  2007-06-10 18:17     ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Rik van Riel @ 2007-06-10 17:36 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

Andrea Arcangeli wrote:

> -	else
> +	nr_inactive = zone_page_state(zone, NR_INACTIVE) >> priority;
> +	if (nr_inactive < sc->swap_cluster_max)
>  		nr_inactive = 0;

This is a problem.

On workloads with lots of anonymous memory, for example
running a very large JVM or simply stressing the system
with AIM7, the inactive list can be very small.

If dozens (or even hundreds) of tasks get into the
pageout code simultaneously, they will all spend a lot
of time moving pages from the active to the inactive
list, but they will not even try to free any of the
(few) inactive pages the system has!

We have observed systems in stress tests that spent
well over 10 minutes in shrink_active_list before
the first call to shrink_inactive_list was made.

Your code looks like it could exacerbate that situation,
by not having zone->nr_scan_inactive increment between
calls.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02 of 16] avoid oom deadlock in nfs_create_request
  2007-06-08 20:03 ` [PATCH 02 of 16] avoid oom deadlock in nfs_create_request Andrea Arcangeli
@ 2007-06-10 17:38   ` Rik van Riel
  2007-06-10 18:27     ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Rik van Riel @ 2007-06-10 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

Andrea Arcangeli wrote:

> When sigkill is pending after the oom killer set TIF_MEMDIE, the task
> must go away or the VM will malfunction.

However, if the sigkill is pending against ANOTHER task,
this patch looks like it could introduce an IO error
where the system would recover fine before.

Tasks that do not have a pending SIGKILL should retry
the allocation, shouldn't they?

> diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
> --- a/fs/nfs/pagelist.c
> +++ b/fs/nfs/pagelist.c
> @@ -61,16 +61,20 @@ nfs_create_request(struct nfs_open_conte
>  	struct nfs_server *server = NFS_SERVER(inode);
>  	struct nfs_page		*req;
>  
> -	for (;;) {
> -		/* try to allocate the request struct */
> -		req = nfs_page_alloc();
> -		if (req != NULL)
> -			break;
> -
> -		if (signalled() && (server->flags & NFS_MOUNT_INTR))
> -			return ERR_PTR(-ERESTARTSYS);
> -		yield();
> -	}
> +	/* try to allocate the request struct */
> +	req = nfs_page_alloc();
> +	if (unlikely(!req)) {
> +		/*
> +		 * -ENOMEM will be returned only when TIF_MEMDIE is set
> +		 * so userland shouldn't risk to get confused by a new
> +		 * unhandled ENOMEM errno.
> +		 */
> +		WARN_ON(!test_thread_flag(TIF_MEMDIE));
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	if (signalled() && (server->flags & NFS_MOUNT_INTR))
> +		return ERR_PTR(-ERESTARTSYS);
>  
>  	/* Initialize the request struct. Initially, we assume a
>  	 * long write-back delay. This will be adjusted in
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 15 of 16] limit reclaim if enough pages have been freed
  2007-06-10 17:32     ` Andrea Arcangeli
@ 2007-06-10 17:52       ` Rik van Riel
  2007-06-11 16:23         ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Rik van Riel @ 2007-06-10 17:52 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Larry Woodman

Andrea Arcangeli wrote:
> On Sun, Jun 10, 2007 at 01:20:50PM -0400, Rik van Riel wrote:
>> code simultaneously, all starting out at priority 12 and
>> not freeing anything until they all get to much lower
>> priorities.
> 
> BTW, this reminds me that I've been wondering if 2**12 is a too small
> fraction of the lru to start the scan with.

If the system has 1 TB of RAM, it's probably too big
of a fraction :)

We need something smarter.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-10 17:36   ` Rik van Riel
@ 2007-06-10 18:17     ` Andrea Arcangeli
  2007-06-11 14:58       ` Rik van Riel
  2007-06-26 17:08       ` Rik van Riel
  0 siblings, 2 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-10 18:17 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm

On Sun, Jun 10, 2007 at 01:36:46PM -0400, Rik van Riel wrote:
> Andrea Arcangeli wrote:
> 
> >-	else
> >+	nr_inactive = zone_page_state(zone, NR_INACTIVE) >> priority;
> >+	if (nr_inactive < sc->swap_cluster_max)
> > 		nr_inactive = 0;
> 
> This is a problem.
> 
> On workloads with lots of anonymous memory, for example
> running a very large JVM or simply stressing the system
> with AIM7, the inactive list can be very small.
> 
> If dozens (or even hundreds) of tasks get into the
> pageout code simultaneously, they will all spend a lot
> of time moving pages from the active to the inactive
> list, but they will not even try to free any of the
> (few) inactive pages the system has!
> 
> We have observed systems in stress tests that spent
> well over 10 minutes in shrink_active_list before
> the first call to shrink_inactive_list was made.
> 
> Your code looks like it could exacerbate that situation,
> by not having zone->nr_scan_inactive increment between
> calls.

If all tasks spend 10 minutes in shrink_active_list before the first
call to shrink_inactive_list that could mean you hit the race that I'm
just trying to fix with this very patch. (i.e. nr_*active going
totally huge because of the race triggering, and trashing over the few
pages left in the *active_list until the artificially boosted
nr_*active finally goes down to zero in all tasks that read it at the
unlucky time when it got huge) So my patch may actually fix your
situation completely if your trouble was nr_scan_active becoming huge
for no good reason, just because many tasks entered the VM at the same
time on big-SMP systems. Did you monitor the real sizes of the active
lists during those 10 min and compared it to the nr_active stored in
the stack?

Normally if the highest priority passes only calls into
shrink_active_list that's because the two lists needs rebalancing. But
I fail to see how it could ever take 10min for the first
shrink_inactive_list to trigger with my patch applied, while if it
happens in current vanilla that could be the race triggering, or
anyway something unrelated is going wrong in the VM.

Overall this code seems quite flakey in its current "racy" form, so I
doubt it can be allowed to live as-is. Infact even if we fix the race
with a slow-shared-lock in a fast path or if we only make sure not to
avoid exacerbate your situation with something a simple and lock-less
as "nr_active = min(sizeof_active_list, nr_scan_active)", I think it
would still wrong to do more work in the current tasks, if we've other
tasks helping us at the same time. We should do nothing more, nothing
less. So I think if we want those counters to avoid restarting from
zero at each priority step (what I understand is your worry), those
counters should be in the stack, task-local. That will still take into
account the previously not scanned "nr_inactive" value.

Not sure what's best. I've the feeling that introducing a task-local
*nr_active *nr_inactive counter shared by all priority steps, won't
move the VM needle much, but I sure wouldn't be against it. It will
change the balancing to be more fair, but in practice I don't expect
huge differences, there are only 12 steps anyway, very quickly the
inactive list should be shrunk even if the active list is huge.

I'm only generally against the current per-zone global and racy
approach without limits, so potentially exacerbating your situation
when nr_active becomes very huge despite the active list being very
small.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02 of 16] avoid oom deadlock in nfs_create_request
  2007-06-10 17:38   ` Rik van Riel
@ 2007-06-10 18:27     ` Andrea Arcangeli
  0 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-10 18:27 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm

On Sun, Jun 10, 2007 at 01:38:49PM -0400, Rik van Riel wrote:
> Andrea Arcangeli wrote:
> 
> >When sigkill is pending after the oom killer set TIF_MEMDIE, the task
> >must go away or the VM will malfunction.
> 
> However, if the sigkill is pending against ANOTHER task,
> this patch looks like it could introduce an IO error
> where the system would recover fine before.

The error being returned would be -ENOMEM. But even that should not be
returned because do_exit will run before userland runs again. When I
told about this to Neil he didn't seem to object that do_exit will be
called first so I hope we didn't get it wrong.

The only risk would be if we set TIF_MEMDIE but we kill a task with
SIGTERM, then the I/O error could reach userland if the user catched
the sigterm signal in userland.

I didn't add the warn-on for sigkill, because even if we decide to
send sigterm first, in theory it wouldn't be a kernel issue if we
correctly return -ENOMEM to userland if that is the task that must
exit (we don't support a graceful exit path today, perhaps we never
will). But clearly we don't know if all userland code is capable of
coping with a -ENOMEM, so for now we don't have to worry thanks to the
sigkill.

> Tasks that do not have a pending SIGKILL should retry
> the allocation, shouldn't they?

All tasks not having TIF_MEMDIE set (and currently sigkill pending as
well) should retry yes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-10 18:17     ` Andrea Arcangeli
@ 2007-06-11 14:58       ` Rik van Riel
  2007-06-26 17:08       ` Rik van Riel
  1 sibling, 0 replies; 77+ messages in thread
From: Rik van Riel @ 2007-06-11 14:58 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

Andrea Arcangeli wrote:
> On Sun, Jun 10, 2007 at 01:36:46PM -0400, Rik van Riel wrote:
>> Andrea Arcangeli wrote:
>>
>>> -	else
>>> +	nr_inactive = zone_page_state(zone, NR_INACTIVE) >> priority;
>>> +	if (nr_inactive < sc->swap_cluster_max)
>>> 		nr_inactive = 0;
>> This is a problem.
>>
>> On workloads with lots of anonymous memory, for example
>> running a very large JVM or simply stressing the system
>> with AIM7, the inactive list can be very small.
>>
>> If dozens (or even hundreds) of tasks get into the
>> pageout code simultaneously, they will all spend a lot
>> of time moving pages from the active to the inactive
>> list, but they will not even try to free any of the
>> (few) inactive pages the system has!
>>
>> We have observed systems in stress tests that spent
>> well over 10 minutes in shrink_active_list before
>> the first call to shrink_inactive_list was made.
>>
>> Your code looks like it could exacerbate that situation,
>> by not having zone->nr_scan_inactive increment between
>> calls.
> 
> If all tasks spend 10 minutes in shrink_active_list before the first
> call to shrink_inactive_list that could mean you hit the race that I'm
> just trying to fix with this very patch. (i.e. nr_*active going
> totally huge because of the race triggering,

Nope.  In this case it spends its time in shrink_active_list
because the active list is 99% of memory (several GB) while
the inactive list is so small that nr_inactive_pages >> priority
is zero.

> Normally if the highest priority passes only calls into
> shrink_active_list that's because the two lists needs rebalancing. But
> I fail to see how it could ever take 10min for the first
> shrink_inactive_list to trigger with my patch applied, while if it
> happens in current vanilla that could be the race triggering, or
> anyway something unrelated is going wrong in the VM.

Yeah, I have no real objection to your patch, but was
just pointing out that it does not fix the big problem
with this code.

-- 
All Rights Reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-06-09 14:05         ` Andrea Arcangeli
  2007-06-09 14:38           ` Andrea Arcangeli
@ 2007-06-11 16:04           ` Christoph Lameter
  1 sibling, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-06-11 16:04 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

On Sat, 9 Jun 2007, Andrea Arcangeli wrote:

> I just showed the global flag that is being checked. TIF_MEMDIE
> affects the whole system, not just your node-constrained allocating

TIF_MEMDIE affects the task that attempted to perform an constrained 
allocation. The effects are global for that task but there are not as 
severe as setting a global OOM flag!

> Amittedly my fixes made things worse for your "local" oom killing, but
> your code was only apparently "local" because TIF_MEMDIE is a _global_
> flag in the mainline kernel. So again, I'm very willing to improve the

TIF_MEMDIE is confined to a process.

> local one. I didn't look into the details of the local oom killing yet
> (exactly because it wasn't so local in the first place) but it may be
> enough to set VM_is_OOM only for tasks that are not being locally
> killed and then those new changes will automatically prevent
> TIF_MEMDIE being set on a local-oom to affect the global-oom event.

TIF_MEMDIE must be set in order for the task to die properly even if its a 
constrained allocation because TIF_MEMDIE relaxes the constraints so that 
the task can terminate.
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-06-09 14:38           ` Andrea Arcangeli
@ 2007-06-11 16:07             ` Christoph Lameter
  2007-06-11 16:50               ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-06-11 16:07 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

On Sat, 9 Jun 2007, Andrea Arcangeli wrote:

> On a side note about the current way you select the task to kill if a
> constrained alloc failure triggers, I think it would have been better
> if you simply extended the oom-selector by filtering tasks in function
> of the current->mems_allowed. Now I agree the current badness is quite

Filtering tasks is a very expensive operation on huge systems. We have had 
cases where it took an hour or so for the OOM to complete. OOM usually 
occurs under heavy processing loads which makes the taking of global locks 
quite expensive.

> bad, now with rss instead of the virtual space, it works a bit better
> at least, but the whole point is that if you integrate the cpuset task
> filtering in the oom-selector algorithm, then once we fix the badness
> algorithm to actually do something more meaningful than to check
> static values, you'll get the better algorithm working for your
> local-oom killing too. This if you really care about the huge-numa
> niche to get node-partitioning working really like if this was a
> virtualized environment. If you just have kill something to release
> memory, killing the current task is always the safest choice
> obviously, so as your customers are ok with it I'm certainly fine with
> the current approach too.

The "kill-the-current-process" approach is most effective in hitting the 
process that is allocating the most. And as far as I can tell its easiest 
to understand for our customer.
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 15 of 16] limit reclaim if enough pages have been freed
  2007-06-10 17:52       ` Rik van Riel
@ 2007-06-11 16:23         ` Christoph Lameter
  2007-06-11 16:57           ` Rik van Riel
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-06-11 16:23 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm, Larry Woodman

On Sun, 10 Jun 2007, Rik van Riel wrote:

> Andrea Arcangeli wrote:
> > On Sun, Jun 10, 2007 at 01:20:50PM -0400, Rik van Riel wrote:
> > > code simultaneously, all starting out at priority 12 and
> > > not freeing anything until they all get to much lower
> > > priorities.
> > 
> > BTW, this reminds me that I've been wondering if 2**12 is a too small
> > fraction of the lru to start the scan with.
> 
> If the system has 1 TB of RAM, it's probably too big
> of a fraction :)
> 
> We need something smarter.

Well this value is depending on a nodes memory not on the systems 
total memory. So I think we are fine. 1TB systems (at least ours) are 
comprised of nodes with 4GB/8GB/16GB of memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-06-11 16:07             ` Christoph Lameter
@ 2007-06-11 16:50               ` Andrea Arcangeli
  2007-06-11 16:57                 ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-11 16:50 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm

On Mon, Jun 11, 2007 at 09:07:59AM -0700, Christoph Lameter wrote:
> Filtering tasks is a very expensive operation on huge systems. We have had 

Come on, oom_kill.c only happens at oom time, after the huge complex
processing has figured out it's time to call into oom_kill.c, how can
you care about the performance of oom_kill.c?  Apparently some folks
prefer to panic when oom triggers go figure...

> cases where it took an hour or so for the OOM to complete. OOM usually 
> occurs under heavy processing loads which makes the taking of global locks 
> quite expensive.

Since you mean that a _global_ OOM took one hour (you just used it as
the comparison of the slow-one, the local-oom is supposed to be the
fast one instead) I'd appreciate if you could try again with all my
fixes applied and see if the time to recover the global oom is reduced
(which is the whole objective of most of the fixes I've just
posted).

In general whatever you do inside oom_kill.c has nothing to do with
the "expensive operations" (the expensive operations are infact halted
with my fixes).

In turn killing the current task so that oom_kill.c is faster, is
quite a dubious argument.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 15 of 16] limit reclaim if enough pages have been freed
  2007-06-11 16:23         ` Christoph Lameter
@ 2007-06-11 16:57           ` Rik van Riel
  0 siblings, 0 replies; 77+ messages in thread
From: Rik van Riel @ 2007-06-11 16:57 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrea Arcangeli, linux-mm, Larry Woodman

Christoph Lameter wrote:
> On Sun, 10 Jun 2007, Rik van Riel wrote:
> 
>> Andrea Arcangeli wrote:
>>> On Sun, Jun 10, 2007 at 01:20:50PM -0400, Rik van Riel wrote:
>>>> code simultaneously, all starting out at priority 12 and
>>>> not freeing anything until they all get to much lower
>>>> priorities.
>>> BTW, this reminds me that I've been wondering if 2**12 is a too small
>>> fraction of the lru to start the scan with.
>> If the system has 1 TB of RAM, it's probably too big
>> of a fraction :)
>>
>> We need something smarter.
> 
> Well this value is depending on a nodes memory not on the systems 
> total memory. So I think we are fine. 1TB systems (at least ours) are 
> comprised of nodes with 4GB/8GB/16GB of memory.

Yours are fine, because currently the very large system
customers tend to run fine tuned workloads.

We are seeing some other users throwing random workloads
at systems with 256GB of RAM in a single zone.  General
purpose computing is moving up, VM explosions are becoming
more spectacular :)

-- 
All Rights Reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-06-11 16:50               ` Andrea Arcangeli
@ 2007-06-11 16:57                 ` Christoph Lameter
  2007-06-11 17:51                   ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-06-11 16:57 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

On Mon, 11 Jun 2007, Andrea Arcangeli wrote:

> On Mon, Jun 11, 2007 at 09:07:59AM -0700, Christoph Lameter wrote:
> > Filtering tasks is a very expensive operation on huge systems. We have had 
> 
> Come on, oom_kill.c only happens at oom time, after the huge complex
> processing has figured out it's time to call into oom_kill.c, how can
> you care about the performance of oom_kill.c?  Apparently some folks
> prefer to panic when oom triggers go figure...

Its pretty bad if a large system sits for hours just because it cannot 
finish its OOM processing. We have reports of that taking 4 hours!

> In turn killing the current task so that oom_kill.c is faster, is
> quite a dubious argument.

It avoids repeated scans over a super sized tasklist with heavy lock 
contention. 4 loops for every OOM kill! If a number of processes will be 
OOM killed then it will take hours to sort out the lock contention.

Want this as a a SUSE bug?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-06-11 16:57                 ` Christoph Lameter
@ 2007-06-11 17:51                   ` Andrea Arcangeli
  2007-06-11 17:56                     ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-11 17:51 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm

On Mon, Jun 11, 2007 at 09:57:59AM -0700, Christoph Lameter wrote:
> On Mon, 11 Jun 2007, Andrea Arcangeli wrote:
> 
> > On Mon, Jun 11, 2007 at 09:07:59AM -0700, Christoph Lameter wrote:
> > > Filtering tasks is a very expensive operation on huge systems. We have had 
> > 
> > Come on, oom_kill.c only happens at oom time, after the huge complex
> > processing has figured out it's time to call into oom_kill.c, how can
> > you care about the performance of oom_kill.c?  Apparently some folks
> > prefer to panic when oom triggers go figure...
> 
> Its pretty bad if a large system sits for hours just because it cannot 
> finish its OOM processing. We have reports of that taking 4 hours!

Which is why I posted these fixes, so it will hopefully take much less
than 4 hours. Even normal production systems takes far too long
today. Most of these fixes are meant to reduce the complexity involved
in detecting when the system is oom (starting from number 01). Keep in
mind the whole 4 hours are spent _outside_ oom_kill.c.

> It avoids repeated scans over a super sized tasklist with heavy lock 
> contention. 4 loops for every OOM kill! If a number of processes will be 

Once the tasklist_lock has been taken, what else is going to trash
inside oom_kill.c?

> OOM killed then it will take hours to sort out the lock contention.

Did you measure it or this is just your imagination? I don't buy your
hypothetical "several hours spent in oom_kill.c" numbers. How long
does "ls /proc" takes? Can your run top at all?

> Want this as a a SUSE bug?

Feel free to file a SUSE bugreport so I hope you will back your claim
with some real profiling data and so we can check if this can be fixed
in software of it's the hardware to blame (in which case we need a
CONFIG_SLOW_NUMA, since other hardware implementations may prefer to
use the oom-selector during local-oom killing too and not only during
the global ones).

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-06-11 17:51                   ` Andrea Arcangeli
@ 2007-06-11 17:56                     ` Christoph Lameter
  2007-06-11 18:22                       ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-06-11 17:56 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

On Mon, 11 Jun 2007, Andrea Arcangeli wrote:

> Did you measure it or this is just your imagination? I don't buy your
> hypothetical "several hours spent in oom_kill.c" numbers. How long
> does "ls /proc" takes? Can your run top at all?

These are customer reports. 4 hours one and another 2 hours. I can 
certainly get more reports if I ask them for more details. I will get this 
on your SUSE radar.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-06-11 17:56                     ` Christoph Lameter
@ 2007-06-11 18:22                       ` Andrea Arcangeli
  2007-06-11 18:39                         ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-11 18:22 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm

On Mon, Jun 11, 2007 at 10:56:56AM -0700, Christoph Lameter wrote:
> On Mon, 11 Jun 2007, Andrea Arcangeli wrote:
> 
> > Did you measure it or this is just your imagination? I don't buy your
> > hypothetical "several hours spent in oom_kill.c" numbers. How long
> > does "ls /proc" takes? Can your run top at all?
> 
> These are customer reports. 4 hours one and another 2 hours. I can 

How long does "ls /proc" take? Can you run top at all on such a
system (I mean before it reaches the oom point, then it'll hang for
those 4 hours with the mainline kernel, I know this and that's why I
worked to fix it and posted 18 patches so far about it).

> certainly get more reports if I ask them for more details. I will get this 
> on your SUSE radar.

If it takes 4 hours for the function out_of_memory to return, please
report it. If instead as I start to suspect, you're going to show me
the function out_of_memory called one million times and taking a few
seconds for each invocation, please test all my fixes before
reporting, there's a reason I made those changes...

Back to the local-oom: if out_of_memory takes a couple of seconds at
most as I expect (it'll be the same order of ls /proc, actually ls
/proc will be a lot slower), killing the current task in the local-oom
as a performance optimization remains a very dubious argument.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-06-11 18:22                       ` Andrea Arcangeli
@ 2007-06-11 18:39                         ` Christoph Lameter
  2007-06-11 18:58                           ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-06-11 18:39 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

On Mon, 11 Jun 2007, Andrea Arcangeli wrote:

> > These are customer reports. 4 hours one and another 2 hours. I can 
> 
> How long does "ls /proc" take? Can you run top at all on such a
> system (I mean before it reaches the oom point, then it'll hang for
> those 4 hours with the mainline kernel, I know this and that's why I
> worked to fix it and posted 18 patches so far about it).

These are big systems and it would take some time to reproduce these 
issues. Thanks for your work. I'd really like to see improvements there. 
If you take care of not worsening the local kill path then I am okay with 
the rest.
 
> > certainly get more reports if I ask them for more details. I will get this 
> > on your SUSE radar.
> 
> If it takes 4 hours for the function out_of_memory to return, please
> report it. If instead as I start to suspect, you're going to show me
> the function out_of_memory called one million times and taking a few
> seconds for each invocation, please test all my fixes before
> reporting, there's a reason I made those changes...

out_of_memory takes about 5-10 minutes each (according to one report). An 
OOM storm will then take the machine out for 4 hours. The on site SE can 
likely tell you more details in the bugzilla.

Another reporter had been waiting for 2 hours after an oom without any 
messages indicating that a single OOM was processed.

> Back to the local-oom: if out_of_memory takes a couple of seconds at
> most as I expect (it'll be the same order of ls /proc, actually ls
> /proc will be a lot slower), killing the current task in the local-oom
> as a performance optimization remains a very dubious argument.

Killing the local process avoids 4 slow scans over a pretty large 
tasklist. But I agree that there may be additionial other issues lurking 
there fore large systems.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-06-11 18:39                         ` Christoph Lameter
@ 2007-06-11 18:58                           ` Andrea Arcangeli
  2007-06-11 19:25                             ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-11 18:58 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm

On Mon, Jun 11, 2007 at 11:39:03AM -0700, Christoph Lameter wrote:
> These are big systems and it would take some time to reproduce these 

Sure I understand.

> issues. Thanks for your work. I'd really like to see improvements there. 

I appreciate and hope it already helps for your oom troubles too.

> If you take care of not worsening the local kill path then I am okay with 
> the rest.

The slight regression I introduced for the numa local oom path clearly
needed correction. Let me know if you still see problems after the
incremental patch I posted today of course. I think that should be
enough to correct the local-oom without altering the global-oom. I
tested it on non-numa and it still works fine.

> out_of_memory takes about 5-10 minutes each (according to one report). An 

Even 10 minutes is way beyond what I expected (but with the background
trashing of the mainline kernel, I can imagine it happening).

> OOM storm will then take the machine out for 4 hours. The on site SE can 
> likely tell you more details in the bugzilla.

Ok, then I think you really want to try my patchset for the oom storm
since at least that one should be gone. When the first oom starts, the
whole VM will stop, no other oom_kill will be called, and even if
they're on their way to call a spurious out_of_memory, the semaphore
trylock will put them back in S state immediately inside
try_to_free_pages. Especially in systems like yours where trashing
cachelines is practically forbidden, I suspect this could make a
substantial difference and perhaps then out_of_memory will return in
less than 10 minutes by the fact of practically running single
threaded.

> Another reporter had been waiting for 2 hours after an oom without any 
> messages indicating that a single OOM was processed.

This is the case I'm dealing with more commonly, normally the more
swap more more it takes, and that's expectable. It should have
improved too with the patchset.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-06-11 18:58                           ` Andrea Arcangeli
@ 2007-06-11 19:25                             ` Christoph Lameter
  0 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-06-11 19:25 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

On Mon, 11 Jun 2007, Andrea Arcangeli wrote:

> This is the case I'm dealing with more commonly, normally the more
> swap more more it takes, and that's expectable. It should have
> improved too with the patchset.

Do you have a SLES10 kernel with these fixes?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 00 of 16] OOM related fixes
  2007-06-09 14:55   ` Andrea Arcangeli
@ 2007-06-12  8:58     ` Petr Tesarik
  0 siblings, 0 replies; 77+ messages in thread
From: Petr Tesarik @ 2007-06-12  8:58 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: William Lee Irwin III, linux-mm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrea Arcangeli wrote:
> Hi Wil,
> 
> On Fri, Jun 08, 2007 at 02:26:10PM -0700, William Lee Irwin III wrote:
>> Interesting. This seems to demonstrate a need for file IO to handle
>> fatal signals, beyond just people wanting faster responses to kill -9.
>> Perhaps it's the case that fatal signals should always be handled, and
>> there should be no waiting primitives excluding them. __GFP_NOFAIL is
>> also "interesting."
> 
> Clearly the sooner we respond to a SIGKILL the better. We tried to
> catch the two critical points to solve the evil read(huge)->oom. BTW,
> the first suggestion that we had to also break out of read to make
> progress substantially quicker, was from Petr so I'm cc'ing him. I'm

Late as always... :((
It's not only about getting it quicker - the loop wouldn't break until
the whole chunk has been read, which couldn't be accomplished until some
memory was freed first, but the memory would be freed by killing this
task which wouldn't terminate until everything is read, and so on... We
obviously need to break the vicious circle somewhere.

If we want to resolve all such cases we would have to ensure that
delivering a SIGKILL can't fail on OOM conditions, i.e. that SIGKILL can
always be handled without memory allocation. I'm planning to do some
investigations on which places in the kernel are (worst) affected and
then think about ways of fixing them. I don't expect we can fix them
all, or at least not in the first round, but this looks like the only
way to go...

Cheers,
Petr Tesarik

> unsure what else of more generic we could do to solve more of those
> troubles at the same time without having to pollute the code with
> sigkill checks. For example we're not yet covering the o-direct paths
> but I did the minimal changes to resolve the current workload and that
> used buffered io of course ;). BTW, I could have checked the
> TIF_MEMDIE instead of seeing if sigkill was pending, but since I had
> to check the task structure anyway, I preferred to check for the
> sigkill so that kill -9 will now work for the first time against a
> large read/write syscall, besides allowing the TIF_MEMDIE task to exit
> in reasonable time without triggering the deadlock detection in the
> later patches.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGbmBRjpY2ODFi2ogRAseoAKCV+rX+PTmdGdjmjdObBwmdYDlqXACfXI9f
BT5dOXg5qPVhH7Wj/DlHCP4=
=ZlW9
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-10 18:17     ` Andrea Arcangeli
  2007-06-11 14:58       ` Rik van Riel
@ 2007-06-26 17:08       ` Rik van Riel
  2007-06-26 17:55         ` Andrew Morton
  2007-06-26 20:37         ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli
  1 sibling, 2 replies; 77+ messages in thread
From: Rik van Riel @ 2007-06-26 17:08 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton

Andrea Arcangeli wrote:

> If all tasks spend 10 minutes in shrink_active_list before the first
> call to shrink_inactive_list that could mean you hit the race that I'm
> just trying to fix with this very patch. 

I got around to testing it now.  I am using AIM7 since it is
a very anonymous memory heavy workload.

Unfortunately your patch does not fix the problem, but behaves
as I had feared :(

Both the normal kernel and your kernel fall over once memory
pressure gets big enough, but they explode differently and
at different points.

I am running the test on a quad core x86-64 system with 2GB
memory.  I am "zooming in" on the 4000 user range, because
that is where they start to diverge.  I am running aim7 to
cross-over, which is the point at which fewer than 1 jobs/min/user
are being completed.

First vanilla 2.6.22-rc5-git8:

Num     Parent   Child   Child  Jobs per   Jobs/min/  Std_dev  Std_dev  JTI
Forked  Time     SysTime UTime   Minute     Child      Time     Percent
4000    119.97   432.86  47.17   204051.01  51.01      11.52    9.99 
  90
4100    141.59   517.31  48.92   177215.91  43.22      6.67     4.84 
  95
4200    154.95   569.16  50.51   165885.77  39.50      5.07     3.35 
  96
4300    166.24   613.40  51.58   158301.25  36.81      10.59    6.51 
  93
4400    170.40   628.63  52.72   158028.17  35.92      5.46     3.27 
  96
4500    188.88   701.84  54.06   145806.86  32.40      6.13     3.31 
  96
4600    200.37   745.73  55.55   140500.07  30.54      4.98     2.54 
  97
4700    219.25   819.80  57.01   131192.70  27.91      5.38     2.51 
  97
4800    219.70   820.36  58.22   133709.60  27.86      5.40     2.52 
  97
4900    232.45   870.08  59.56   129008.39  26.33      4.65     2.02 
  97
5105    1704.46  5406.56 64.03   18329.91   3.59       264.38   18.85 
  81
Crossover achieved
Max Jobs per Minute 204051.01


Now 2.6.22-rc5-git8 with your patches 01/16 and 15/16:
Num     Parent   Child   Child  Jobs per   Jobs/min/  Std_dev  Std_dev  JTI
Forked  Time     SysTime UTime   Minute     Child      Time     Percent
4000    141.51   518.37  47.46   172991.31  43.25      5.20     3.75 
  96
4100    147.07   539.16  48.91   170612.63  41.61      5.11     3.58 
  96
4200    155.43   571.36  50.18   165373.48  39.37      5.42     3.58 
  96
4300    1317.89  4558.95 52.53   19968.28   4.64       219.76   18.42 
  81
Crossover achieved
Max Jobs per Minute 172991.31

One thing I noticed is that with the vanilla kernel, the lower
numbers of users allowed the system to still run fine, while
with your patches the system seemed to get stuck at ~90% system
time pretty quickly...

-- 
All Rights Reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-26 17:08       ` Rik van Riel
@ 2007-06-26 17:55         ` Andrew Morton
  2007-06-26 19:02           ` Rik van Riel
  2007-06-28 22:44           ` Rik van Riel
  2007-06-26 20:37         ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli
  1 sibling, 2 replies; 77+ messages in thread
From: Andrew Morton @ 2007-06-26 17:55 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm

On Tue, 26 Jun 2007 13:08:57 -0400 Rik van Riel <riel@redhat.com> wrote:

> > If all tasks spend 10 minutes in shrink_active_list before the first
> > call to shrink_inactive_list that could mean you hit the race that I'm
> > just trying to fix with this very patch. 
> 
> I got around to testing it now.  I am using AIM7 since it is
> a very anonymous memory heavy workload.
> 
> Unfortunately your patch does not fix the problem, but behaves
> as I had feared :(
> 
> Both the normal kernel and your kernel fall over once memory
> pressure gets big enough, but they explode differently and
> at different points.
> 
> I am running the test on a quad core x86-64 system with 2GB
> memory.  I am "zooming in" on the 4000 user range, because
> that is where they start to diverge.  I am running aim7 to
> cross-over, which is the point at which fewer than 1 jobs/min/user
> are being completed.

with what command line and config scripts does one run aim7 to
reproduce this?

Where's the system time being spent?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-26 17:55         ` Andrew Morton
@ 2007-06-26 19:02           ` Rik van Riel
  2007-06-28 22:44           ` Rik van Riel
  1 sibling, 0 replies; 77+ messages in thread
From: Rik van Riel @ 2007-06-26 19:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm

Andrew Morton wrote:
> On Tue, 26 Jun 2007 13:08:57 -0400 Rik van Riel <riel@redhat.com> wrote:
> 
>>> If all tasks spend 10 minutes in shrink_active_list before the first
>>> call to shrink_inactive_list that could mean you hit the race that I'm
>>> just trying to fix with this very patch. 
>> I got around to testing it now.  I am using AIM7 since it is
>> a very anonymous memory heavy workload.
>>
>> Unfortunately your patch does not fix the problem, but behaves
>> as I had feared :(
>>
>> Both the normal kernel and your kernel fall over once memory
>> pressure gets big enough, but they explode differently and
>> at different points.
>>
>> I am running the test on a quad core x86-64 system with 2GB
>> memory.  I am "zooming in" on the 4000 user range, because
>> that is where they start to diverge.  I am running aim7 to
>> cross-over, which is the point at which fewer than 1 jobs/min/user
>> are being completed.
> 
> with what command line and config scripts does one run aim7 to
> reproduce this?

reaim -x -i 100 -s 5000

Using the default reaim.config and workfile.shared

> Where's the system time being spent?

I will run the tests again with profiling enabled.

-- 
All Rights Reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-26 17:08       ` Rik van Riel
  2007-06-26 17:55         ` Andrew Morton
@ 2007-06-26 20:37         ` Andrea Arcangeli
  2007-06-26 20:57           ` Rik van Riel
  1 sibling, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-26 20:37 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, Andrew Morton

On Tue, Jun 26, 2007 at 01:08:57PM -0400, Rik van Riel wrote:
> Both the normal kernel and your kernel fall over once memory
> pressure gets big enough, but they explode differently and
> at different points.

Ok, at some point it's normal they start trashing. What is strange is
that it seems patch 01 requires the VM to do more work and in turn
more memory to be free. The only explanation I could have is that the
race has the side effect of in average reducing the amount of vm
activity for each task instead of increasing it (this in turn reduces
thrashing and free memory level requirements before the workload
halts).

Even if it may have a positive effect in practice, I still think the
current racy behavior (randomly overstimating and randomly
understimating the amount of work each task has to do depending of who
adds and read the zone values first) isn't good.

Perhaps if you change the DEF_PRIORITY you'll get closer to the
current mainline but without any race. You can try to halve it and see
what happens. If the initial passes fails, it'll start swapping and
performance will go down quick. So perhaps once we fix the race we've
to decrease DEF_PRIORITY to get the same vm-tune.

It'd also be interesting to see what we get between 3000 and 4000.

Where exactly we get to the halting point (4300 vs 5105) isn't
crucial, otherwise one can win by simply decreasing min_free_kbytes as
well, which clearly shows "when" we hang isn't the real interest. OTOH
I agree the difference between 4300 and 5105 seems way too big but if
this was between 5000 and 5105 I wouldn't worry too much (5000 instead
of 5105 would result in more memory to be free at the oom point which
isn't a net-negative). Hope the benchmark is repeatable.  This week
I've been working on another project but I'll shortly try to install
AIM and reproduce and see what happens by decreasing
DEF_PRIORITY. Thanks for the testing!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-26 20:37         ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli
@ 2007-06-26 20:57           ` Rik van Riel
  2007-06-26 22:21             ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Rik van Riel @ 2007-06-26 20:57 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Andrew Morton

Andrea Arcangeli wrote:
> On Tue, Jun 26, 2007 at 01:08:57PM -0400, Rik van Riel wrote:
>> Both the normal kernel and your kernel fall over once memory
>> pressure gets big enough, but they explode differently and
>> at different points.
> 
> Ok, at some point it's normal they start trashing. 

Yes, but I would hope that the system would be disk bound
at that time instead of CPU bound.

There was no swap IO going on yet, the system was just
wasting CPU time in the VM.

> Even if it may have a positive effect in practice, I still think the
> current racy behavior (randomly overstimating and randomly
> understimating the amount of work each task has to do depending of who
> adds and read the zone values first) isn't good.

Oh, I like your simplification of the code, too.

I was running the test to see if that patch could be
merged without any negative side effects, because I
would have liked to see it.

> Where exactly we get to the halting point (4300 vs 5105) isn't
> crucial, 

However, neither of the two seems to be IO bound
at that point...

> Hope the benchmark is repeatable.  This week
> I've been working on another project but I'll shortly try to install
> AIM and reproduce and see what happens by decreasing
> DEF_PRIORITY. Thanks for the testing!

Not only is the AIM7 test perfectly repeatable, it also
causes the VM to show some of the same behaviour that
customers are seeing in the field with large JVM workloads.

-- 
All Rights Reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-26 20:57           ` Rik van Riel
@ 2007-06-26 22:21             ` Andrea Arcangeli
  0 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-26 22:21 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, Andrew Morton

On Tue, Jun 26, 2007 at 04:57:20PM -0400, Rik van Riel wrote:
> Yes, but I would hope that the system would be disk bound
> at that time instead of CPU bound.
> 
> There was no swap IO going on yet, the system was just
> wasting CPU time in the VM.

That seems a separate problem, 01 starts wasting cpu sooner and that's
the regression you discovered, but mainline wastes cpu the same way
too later on. We should do some profiling like Andrew suggested to see
what's going on when it starts trashing cpu (perhaps it's some smp
lock? you said you've only 4 cores so it must be some highly contended
one if it's really a lock).

> Oh, I like your simplification of the code, too.
> 
> I was running the test to see if that patch could be
> merged without any negative side effects, because I
> would have liked to see it.

I see. Good that you tested this with this workload so we noticed this
regression. At the moment I hope it's only a tuning knob in the
DEF_PRIORITY (or similar), it'd be really sad if this had a magic racy
behavior that wouldn't be reproducible with a static non-racy
algorithm.

If nothing else, if we want to stick with this explicit smp race in
the vm core, somebody should at least attempt to document how they can
predict what the race will do at runtime, because to me it seems quite
an unpredictable beast. On average it will probably reach a stable
state, but this stable state will depend on the speed of the cpu
caches and on the number of cpus, on the architecture and on the
assembly generated by gcc, and then the race will trigger more or less
or in a different way...

> However, neither of the two seems to be IO bound
> at that point...

Yes. For now I'd be happy to see the same results for both to
eliminate the regression.

> Not only is the AIM7 test perfectly repeatable, it also
> causes the VM to show some of the same behaviour that
> customers are seeing in the field with large JVM workloads.

Sounds good, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-26 17:55         ` Andrew Morton
  2007-06-26 19:02           ` Rik van Riel
@ 2007-06-28 22:44           ` Rik van Riel
  2007-06-28 22:57             ` Andrew Morton
  2007-06-29 13:38             ` Lee Schermerhorn
  1 sibling, 2 replies; 77+ messages in thread
From: Rik van Riel @ 2007-06-28 22:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm

Andrew Morton wrote:

> Where's the system time being spent?

OK, it turns out that there is quite a bit of variability
in where the system spends its time.  I did a number of
reaim runs and averaged the time the system spent in the
top functions.

This is with the Fedora rawhide kernel config, which has
quite a few debugging options enabled.

_raw_spin_lock		32.0%
page_check_address	12.7%
__delay			10.8%
mwait_idle		10.4%
anon_vma_unlink		5.7%
__anon_vma_link		5.3%
lockdep_reset_lock	3.5%
__kmalloc_node_track_caller 2.8%
security_port_sid	1.8%
kfree			1.6%
anon_vma_link		1.2%
page_referenced_one	1.1%

In short, the system is waiting on the anon_vma lock.

I wonder if Lee Schemmerhorn's patch to turn that
spinlock into an rwlock would help this workload,
or if we simply should scan fewer pages in the
pageout code.

Andrea, with your VM patches for some reason the
number of users where reaim has its crossover point
is also somewhat variable, between 4200 and 5100
users, with 9 out of 10 runs under 4500 on my system.

A kernel without your patches is not as variable,
but has visibly more unfairness between tasks, as
seen in the reaim "Std_dev" columns.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-28 22:44           ` Rik van Riel
@ 2007-06-28 22:57             ` Andrew Morton
  2007-06-28 23:04               ` Rik van Riel
  2007-06-29 13:38             ` Lee Schermerhorn
  1 sibling, 1 reply; 77+ messages in thread
From: Andrew Morton @ 2007-06-28 22:57 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm

On Thu, 28 Jun 2007 18:44:56 -0400
Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> 
> > Where's the system time being spent?
> 
> OK, it turns out that there is quite a bit of variability
> in where the system spends its time.  I did a number of
> reaim runs and averaged the time the system spent in the
> top functions.
> 
> This is with the Fedora rawhide kernel config, which has
> quite a few debugging options enabled.
> 
> _raw_spin_lock		32.0%
> page_check_address	12.7%
> __delay			10.8%
> mwait_idle		10.4%
> anon_vma_unlink		5.7%
> __anon_vma_link		5.3%
> lockdep_reset_lock	3.5%
> __kmalloc_node_track_caller 2.8%
> security_port_sid	1.8%
> kfree			1.6%
> anon_vma_link		1.2%
> page_referenced_one	1.1%
> 
> In short, the system is waiting on the anon_vma lock.

Sigh.  We had a workload (forget which, still unfixed) in which things
would basically melt down in that linear anon_vma walk, walking 10,000 or
more vma's.  I wonder if that's what's happening here?

Also, one thing to watch out for here is a problem with the spinlocks
themselves: the problem wherein the cores in one package keep rattling the
lock around between them and never let it out for the cores in another
package to grab.

> I wonder if Lee Schemmerhorn's patch to turn that
> spinlock into an rwlock would help this workload,
> or if we simply should scan fewer pages in the
> pageout code.

Maybe.  I'm thinking that the problem here is really due to the huge amount
of processing which needs to occur when we are in the "all pages active,
referenced" state and then we hit pages_low.  Panic time, we need to scan
and deactivate a huge amount of stuff.

Would it not be better to prevent that situation from occurring by doing a
bit of scanning and balancing when adding pages to the LRU?  Make sure that
the lists will be in reasonable shape for when reclaim starts?

That'd deoptimise those workloads which allocate and free pages but never
enter reclaim.  Probably liveable with.

We would want to avoid needlessly unmapping pages and causing more minor
faults.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-28 22:57             ` Andrew Morton
@ 2007-06-28 23:04               ` Rik van Riel
  2007-06-28 23:13                 ` Andrew Morton
  0 siblings, 1 reply; 77+ messages in thread
From: Rik van Riel @ 2007-06-28 23:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm

Andrew Morton wrote:
> On Thu, 28 Jun 2007 18:44:56 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
>> Andrew Morton wrote:
>>
>>> Where's the system time being spent?
>> OK, it turns out that there is quite a bit of variability
>> in where the system spends its time.  I did a number of
>> reaim runs and averaged the time the system spent in the
>> top functions.
>>
>> This is with the Fedora rawhide kernel config, which has
>> quite a few debugging options enabled.
>>
>> _raw_spin_lock		32.0%
>> page_check_address	12.7%
>> __delay			10.8%
>> mwait_idle		10.4%
>> anon_vma_unlink		5.7%
>> __anon_vma_link		5.3%
>> lockdep_reset_lock	3.5%
>> __kmalloc_node_track_caller 2.8%
>> security_port_sid	1.8%
>> kfree			1.6%
>> anon_vma_link		1.2%
>> page_referenced_one	1.1%
>>
>> In short, the system is waiting on the anon_vma lock.
> 
> Sigh.  We had a workload (forget which, still unfixed) in which things
> would basically melt down in that linear anon_vma walk, walking 10,000 or
> more vma's.  I wonder if that's what's happening here?

That would be a large multi-threaded application that fills up
memory.  Customers are reproducing this with JVMs on some very
large systems.

> Also, one thing to watch out for here is a problem with the spinlocks
> themselves: the problem wherein the cores in one package keep rattling the
> lock around between them and never let it out for the cores in another
> package to grab.

This is a single package quad core system, though.

>> I wonder if Lee Schemmerhorn's patch to turn that
>> spinlock into an rwlock would help this workload,
>> or if we simply should scan fewer pages in the
>> pageout code.
> 
> Maybe.  I'm thinking that the problem here is really due to the huge amount
> of processing which needs to occur when we are in the "all pages active,
> referenced" state and then we hit pages_low.  Panic time, we need to scan
> and deactivate a huge amount of stuff.
> 
> Would it not be better to prevent that situation from occurring by doing a
> bit of scanning and balancing when adding pages to the LRU?  Make sure that
> the lists will be in reasonable shape for when reclaim starts?

Agreed, we need to simply scan fewer pages.

Doing something like SEQ replacement on the anonymous (and other
swap backed) pages might just do the trick here.  Page cache, of
course, should continue using a used-once scheme.

I suspect we want to split out the lists for many other reasons
anyway, as detailed on http://linux-mm.org/PageoutFailureModes

I'll whip up a patch that does this...

> That'd deoptimise those workloads which allocate and free pages but never
> enter reclaim.  Probably liveable with.

If we do true SEQ replacement for anonymous pages (deactivating
active pages without regard to the referenced bit) and keep the
inactive list reasonably small that penalty should be negligable.

> We would want to avoid needlessly unmapping pages and causing more minor
> faults.

That's a minor issue, the page fault path is pretty cheap and
very scalable.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-28 23:04               ` Rik van Riel
@ 2007-06-28 23:13                 ` Andrew Morton
  2007-06-28 23:16                   ` Rik van Riel
  2007-06-28 23:25                   ` Andrea Arcangeli
  0 siblings, 2 replies; 77+ messages in thread
From: Andrew Morton @ 2007-06-28 23:13 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm

On Thu, 28 Jun 2007 19:04:05 -0400
Rik van Riel <riel@redhat.com> wrote:

> > Sigh.  We had a workload (forget which, still unfixed) in which things
> > would basically melt down in that linear anon_vma walk, walking 10,000 or
> > more vma's.  I wonder if that's what's happening here?
> 
> That would be a large multi-threaded application that fills up
> memory.  Customers are reproducing this with JVMs on some very
> large systems.

So.... does that mean "yes, it's scanning a lot of vmas"?

If so, I expect there will still be failure modes, whatever we do outside
of this.  A locked, linear walk of a list whose length is
application-controlled is going to be a problem.  Could be that we'll need
an O(n) -> O(log(n)) conversion, which will be tricky in there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-28 23:13                 ` Andrew Morton
@ 2007-06-28 23:16                   ` Rik van Riel
  2007-06-28 23:29                     ` Andrew Morton
  2007-06-28 23:25                   ` Andrea Arcangeli
  1 sibling, 1 reply; 77+ messages in thread
From: Rik van Riel @ 2007-06-28 23:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm

Andrew Morton wrote:
> On Thu, 28 Jun 2007 19:04:05 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
>>> Sigh.  We had a workload (forget which, still unfixed) in which things
>>> would basically melt down in that linear anon_vma walk, walking 10,000 or
>>> more vma's.  I wonder if that's what's happening here?
>> That would be a large multi-threaded application that fills up
>> memory.  Customers are reproducing this with JVMs on some very
>> large systems.
> 
> So.... does that mean "yes, it's scanning a lot of vmas"?

Not necessarily.

The problem can also be reproduced if you have many
threads, from "enough" CPUs, all scanning pages in
the same huge VMA.

> If so, I expect there will still be failure modes, whatever we do outside
> of this.  A locked, linear walk of a list whose length is
> application-controlled is going to be a problem.  Could be that we'll need
> an O(n) -> O(log(n)) conversion, which will be tricky in there.

Scanning fewer pages in the pageout path is probably
the way to go.

No matter how efficient we make the scanning of one
individual page, we simply cannot scan through 1TB
worth of anonymous pages (which are all referenced
because they've been there for a week) in order to
deactivate something.

Systems that big are only a year or two away from
general purpose use.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-28 23:13                 ` Andrew Morton
  2007-06-28 23:16                   ` Rik van Riel
@ 2007-06-28 23:25                   ` Andrea Arcangeli
  2007-06-29  0:12                     ` Andrew Morton
  1 sibling, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-28 23:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, linux-mm

On Thu, Jun 28, 2007 at 04:13:50PM -0700, Andrew Morton wrote:
> On Thu, 28 Jun 2007 19:04:05 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
> > > Sigh.  We had a workload (forget which, still unfixed) in which things
> > > would basically melt down in that linear anon_vma walk, walking 10,000 or
> > > more vma's.  I wonder if that's what's happening here?
> > 
> > That would be a large multi-threaded application that fills up
> > memory.  Customers are reproducing this with JVMs on some very
> > large systems.
> 
> So.... does that mean "yes, it's scanning a lot of vmas"?
> 
> If so, I expect there will still be failure modes, whatever we do outside
> of this.  A locked, linear walk of a list whose length is
> application-controlled is going to be a problem.  Could be that we'll need
> an O(n) -> O(log(n)) conversion, which will be tricky in there.

There's no swapping, so are we sure we need to scan the pte? This
might be as well the unmapping code being invoked too early despite
there's still clean cache to free. If I/O would start because swapping
is really needed, the O(N) walk wouldn't hog the cpu so much because
lots of time would be spent waiting for I/O too. Decreasing
DEF_PRIORITY should defer the invocation of the unmapping code too.

Conversion to O(log(N)) like for the filebacked mappings shouldn't be
a big problem but it'll waste more static memory for each vma and
anon_vma.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-28 23:16                   ` Rik van Riel
@ 2007-06-28 23:29                     ` Andrew Morton
  2007-06-29  0:00                       ` Rik van Riel
  0 siblings, 1 reply; 77+ messages in thread
From: Andrew Morton @ 2007-06-28 23:29 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm

On Thu, 28 Jun 2007 19:16:45 -0400
Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> > On Thu, 28 Jun 2007 19:04:05 -0400
> > Rik van Riel <riel@redhat.com> wrote:
> > 
> >>> Sigh.  We had a workload (forget which, still unfixed) in which things
> >>> would basically melt down in that linear anon_vma walk, walking 10,000 or
> >>> more vma's.  I wonder if that's what's happening here?
> >> That would be a large multi-threaded application that fills up
> >> memory.  Customers are reproducing this with JVMs on some very
> >> large systems.
> > 
> > So.... does that mean "yes, it's scanning a lot of vmas"?
> 
> Not necessarily.
> 
> The problem can also be reproduced if you have many
> threads, from "enough" CPUs, all scanning pages in
> the same huge VMA.

I wouldn't have expected the anon_vma lock to be the main problem for a
single vma.

If it _is_ the problem then significant improvements could probably be
obtained by passing the whole isolate_lru_pages() pile of pages into the
rmap code rather than doing them one-at-a-time.

> > If so, I expect there will still be failure modes, whatever we do outside
> > of this.  A locked, linear walk of a list whose length is
> > application-controlled is going to be a problem.  Could be that we'll need
> > an O(n) -> O(log(n)) conversion, which will be tricky in there.
> 
> Scanning fewer pages in the pageout path is probably
> the way to go.

I don't see why that would help.  The bottom-line steady-state case is that
we need to reclaim N pages per second, and we need to scan N*M vmas per
second to do so.  How we chunk that up won't affect the aggregate amount of
work which needs to be done.

Or maybe you're referring to the ongoing LRU balancing thing.  Or to something
else.

> No matter how efficient we make the scanning of one
> individual page, we simply cannot scan through 1TB
> worth of anonymous pages (which are all referenced
> because they've been there for a week) in order to
> deactivate something.

Sure.  And we could avoid that sudden transition by balancing the LRU prior
to hitting the great pages_high wall.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-28 23:29                     ` Andrew Morton
@ 2007-06-29  0:00                       ` Rik van Riel
  2007-06-29  0:19                         ` Andrew Morton
  0 siblings, 1 reply; 77+ messages in thread
From: Rik van Riel @ 2007-06-29  0:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm

Andrew Morton wrote:

>> Scanning fewer pages in the pageout path is probably
>> the way to go.
> 
> I don't see why that would help.  The bottom-line steady-state case is that
> we need to reclaim N pages per second, and we need to scan N*M vmas per
> second to do so.  How we chunk that up won't affect the aggregate amount of
> work which needs to be done.
> 
> Or maybe you're referring to the ongoing LRU balancing thing.  Or to something
> else.

Yes, I am indeed talking about LRU balancing.

We pretty much *know* that an anonymous page on the
active list is accessed, so why bother scanning them
all?

We could just deactivate the oldest ones and clear
their referenced bits.

Once they reach the end of the inactive list, we
check for the referenced bit again.  If the page
was accessed, we move it back to the active list.

The only problem with this is that anonymous
pages could be easily pushed out of memory by
the page cache, because the page cache has
totally different locality of reference.

The page cache also benefits from the use-once
scheme we have in place today.

Because of these three reasons, I want to split
the page cache LRU lists from the anonymous
memory LRU lists.

Does this make sense to you?

>> No matter how efficient we make the scanning of one
>> individual page, we simply cannot scan through 1TB
>> worth of anonymous pages (which are all referenced
>> because they've been there for a week) in order to
>> deactivate something.
> 
> Sure.  And we could avoid that sudden transition by balancing the LRU prior
> to hitting the great pages_high wall.

Yes, we will need to do some preactive balancing.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-28 23:25                   ` Andrea Arcangeli
@ 2007-06-29  0:12                     ` Andrew Morton
  0 siblings, 0 replies; 77+ messages in thread
From: Andrew Morton @ 2007-06-29  0:12 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Rik van Riel, linux-mm

On Fri, 29 Jun 2007 01:25:36 +0200
Andrea Arcangeli <andrea@suse.de> wrote:

> On Thu, Jun 28, 2007 at 04:13:50PM -0700, Andrew Morton wrote:
> > On Thu, 28 Jun 2007 19:04:05 -0400
> > Rik van Riel <riel@redhat.com> wrote:
> > 
> > > > Sigh.  We had a workload (forget which, still unfixed) in which things
> > > > would basically melt down in that linear anon_vma walk, walking 10,000 or
> > > > more vma's.  I wonder if that's what's happening here?
> > > 
> > > That would be a large multi-threaded application that fills up
> > > memory.  Customers are reproducing this with JVMs on some very
> > > large systems.
> > 
> > So.... does that mean "yes, it's scanning a lot of vmas"?
> > 
> > If so, I expect there will still be failure modes, whatever we do outside
> > of this.  A locked, linear walk of a list whose length is
> > application-controlled is going to be a problem.  Could be that we'll need
> > an O(n) -> O(log(n)) conversion, which will be tricky in there.
> 
> There's no swapping, so are we sure we need to scan the pte?

well, for better or for worse, that's the design.  We need to run
page_referenced() when considering whether to deactivate the page and that
involves a scan of all the ptes.

> This
> might be as well the unmapping code being invoked too early despite
> there's still clean cache to free.

Might be so, but even if we ade changes there, failure modes will remain.

> If I/O would start because swapping
> is really needed, the O(N) walk wouldn't hog the cpu so much because
> lots of time would be spent waiting for I/O too.

yup.  The *total* amount of CPu we spend in there shouldn't matter a lot:
unless something else is bust, it'll be relatively low.  I think the
problem here is that a) we do it all in a big burst and b) we do it on lots
of CPUs at the same time, so that burst is quite an inefficient one.

We _could_ teach kswapd to keep the lists in balance in some fashion even
when we're above pages_high.  But I suspect that'll have corner-cases and
probably it'd be better to do it synchronously.  There's not much point in
having multiple CPUs doing this so some per-zone trylock could perhaps be
used.

> Decreasing
> DEF_PRIORITY should defer the invocation of the unmapping code too.
> 
> Conversion to O(log(N)) like for the filebacked mappings shouldn't be
> a big problem but it'll waste more static memory for each vma and
> anon_vma.

hm, OK, I haven't looked at what would be involved there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-29  0:00                       ` Rik van Riel
@ 2007-06-29  0:19                         ` Andrew Morton
  2007-06-29  0:45                           ` Rik van Riel
  0 siblings, 1 reply; 77+ messages in thread
From: Andrew Morton @ 2007-06-29  0:19 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm

On Thu, 28 Jun 2007 20:00:03 -0400
Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> 
> >> Scanning fewer pages in the pageout path is probably
> >> the way to go.
> > 
> > I don't see why that would help.  The bottom-line steady-state case is that
> > we need to reclaim N pages per second, and we need to scan N*M vmas per
> > second to do so.  How we chunk that up won't affect the aggregate amount of
> > work which needs to be done.
> > 
> > Or maybe you're referring to the ongoing LRU balancing thing.  Or to something
> > else.
> 
> Yes, I am indeed talking about LRU balancing.
> 
> We pretty much *know* that an anonymous page on the
> active list is accessed, so why bother scanning them
> all?

Because there might well be pages in there which haven't been accessed in
days.  Confused.

> We could just deactivate the oldest ones and clear
> their referenced bits.
> 
> Once they reach the end of the inactive list, we
> check for the referenced bit again.  If the page
> was accessed, we move it back to the active list.

ok.

> The only problem with this is that anonymous
> pages could be easily pushed out of memory by
> the page cache, because the page cache has
> totally different locality of reference.

I don't immediately see why we need to change the fundamental aging design
at all.   The problems afacit are

a) that huge burst of activity when we hit pages_high and

b) the fact that this huge burst happens on lots of CPUs at the same time.

And balancing the LRUs _prior_ to hitting pages_high can address both
problems?

It will I guess impact the page aging a bit though.

> The page cache also benefits from the use-once
> scheme we have in place today.
> 
> Because of these three reasons, I want to split
> the page cache LRU lists from the anonymous
> memory LRU lists.
> 
> Does this make sense to you?

Could do, don't know.    What new problems will it introduce? :(

> >> No matter how efficient we make the scanning of one
> >> individual page, we simply cannot scan through 1TB
> >> worth of anonymous pages (which are all referenced
> >> because they've been there for a week) in order to
> >> deactivate something.
> > 
> > Sure.  And we could avoid that sudden transition by balancing the LRU prior
> > to hitting the great pages_high wall.
> 
> Yes, we will need to do some preactive balancing.

OK..

And that huge anon-vma walk might need attention.  At the least we could do
something to prevent lots of CPUs from piling up in there.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-29  0:19                         ` Andrew Morton
@ 2007-06-29  0:45                           ` Rik van Riel
  2007-06-29  1:12                             ` Andrew Morton
  0 siblings, 1 reply; 77+ messages in thread
From: Rik van Riel @ 2007-06-29  0:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm

Andrew Morton wrote:
> On Thu, 28 Jun 2007 20:00:03 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
>> Andrew Morton wrote:
>>
>>>> Scanning fewer pages in the pageout path is probably
>>>> the way to go.
>>> I don't see why that would help.  The bottom-line steady-state case is that
>>> we need to reclaim N pages per second, and we need to scan N*M vmas per
>>> second to do so.  How we chunk that up won't affect the aggregate amount of
>>> work which needs to be done.
>>>
>>> Or maybe you're referring to the ongoing LRU balancing thing.  Or to something
>>> else.
>> Yes, I am indeed talking about LRU balancing.
>>
>> We pretty much *know* that an anonymous page on the
>> active list is accessed, so why bother scanning them
>> all?
> 
> Because there might well be pages in there which haven't been accessed in
> days.  Confused.

We won't know that unless we actually did some background
scanning.  Currently hours old (or days old) referenced
bits are not cleared from anonymous pages.

>> We could just deactivate the oldest ones and clear
>> their referenced bits.
>>
>> Once they reach the end of the inactive list, we
>> check for the referenced bit again.  If the page
>> was accessed, we move it back to the active list.
> 
> ok.
> 
>> The only problem with this is that anonymous
>> pages could be easily pushed out of memory by
>> the page cache, because the page cache has
>> totally different locality of reference.
> 
> I don't immediately see why we need to change the fundamental aging design
> at all.   The problems afacit are
> 
> a) that huge burst of activity when we hit pages_high and
> 
> b) the fact that this huge burst happens on lots of CPUs at the same time.
> 
> And balancing the LRUs _prior_ to hitting pages_high can address both
> problems?

That may work on systems with up to a few GB of memory,
but customers are already rolling out systems with 256GB
of RAM for general purpose use, that's 64 million pages!

Even doing a background scan on that many pages will take
insane amounts of CPU time.

In a few years, they will be deploying systems with 1TB
of memory and throwing random workloads at them.

> It will I guess impact the page aging a bit though.

Yes, it will.  However, I believe that the current system
of page aging is simply not sustainable when memory size
gets insanely large.

>> The page cache also benefits from the use-once
>> scheme we have in place today.
>>
>> Because of these three reasons, I want to split
>> the page cache LRU lists from the anonymous
>> memory LRU lists.
>>
>> Does this make sense to you?
> 
> Could do, don't know.    What new problems will it introduce? :(

The obvious problem is how to balance the eviction of
page cache backed pages versus the eviction of swap
backed pages.

The "good news" here is that the current VM does not
really balance this either, but relies on system
administrators to tweak /proc/sys/vm/swappiness on
systems that run a "corner case" workload.

>>>> No matter how efficient we make the scanning of one
>>>> individual page, we simply cannot scan through 1TB
>>>> worth of anonymous pages (which are all referenced
>>>> because they've been there for a week) in order to
>>>> deactivate something.
>>> Sure.  And we could avoid that sudden transition by balancing the LRU prior
>>> to hitting the great pages_high wall.
>> Yes, we will need to do some preactive balancing.
> 
> OK..
> 
> And that huge anon-vma walk might need attention.  At the least we could do
> something to prevent lots of CPUs from piling up in there.

Speaking of which, I have also seen a thousand processes waiting
to grab the iprune_mutex in prune_icache.

Maybe direct reclaim processes should not dive into this cache
at all, but simply increase some variable indicating that kswapd
might want to prune some extra pages from this cache on its next
run?

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-29  0:45                           ` Rik van Riel
@ 2007-06-29  1:12                             ` Andrew Morton
  2007-06-29  1:20                               ` Rik van Riel
  0 siblings, 1 reply; 77+ messages in thread
From: Andrew Morton @ 2007-06-29  1:12 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm

On Thu, 28 Jun 2007 20:45:20 -0400
Rik van Riel <riel@redhat.com> wrote:

> >> The only problem with this is that anonymous
> >> pages could be easily pushed out of memory by
> >> the page cache, because the page cache has
> >> totally different locality of reference.
> > 
> > I don't immediately see why we need to change the fundamental aging design
> > at all.   The problems afacit are
> > 
> > a) that huge burst of activity when we hit pages_high and
> > 
> > b) the fact that this huge burst happens on lots of CPUs at the same time.
> > 
> > And balancing the LRUs _prior_ to hitting pages_high can address both
> > problems?
> 
> That may work on systems with up to a few GB of memory,
> but customers are already rolling out systems with 256GB
> of RAM for general purpose use, that's 64 million pages!
> 
> Even doing a background scan on that many pages will take
> insane amounts of CPU time.
> 
> In a few years, they will be deploying systems with 1TB
> of memory and throwing random workloads at them.

I don't see how the amount of memory changes anything here: if there are
more pages, more work needs to be done regardless of when we do it.

Still confused.

> >>>> No matter how efficient we make the scanning of one
> >>>> individual page, we simply cannot scan through 1TB
> >>>> worth of anonymous pages (which are all referenced
> >>>> because they've been there for a week) in order to
> >>>> deactivate something.
> >>> Sure.  And we could avoid that sudden transition by balancing the LRU prior
> >>> to hitting the great pages_high wall.
> >> Yes, we will need to do some preactive balancing.
> > 
> > OK..
> > 
> > And that huge anon-vma walk might need attention.  At the least we could do
> > something to prevent lots of CPUs from piling up in there.
> 
> Speaking of which, I have also seen a thousand processes waiting
> to grab the iprune_mutex in prune_icache.
> 

It would make sense to only permit one cpu at a time to go in and do
reclaimation against a particular zone (or even node).

But the problem with the vfs caches is that they aren't node/zone-specific.
We wouldn't want to get into the situation where 1023 CPUs are twiddling
thumbs waiting for one CPU to free stuff up (or less extreme variants of
this).

> Maybe direct reclaim processes should not dive into this cache
> at all, but simply increase some variable indicating that kswapd
> might want to prune some extra pages from this cache on its next
> run?

Tell the node's kswapd to go off and do VFS reclaim while the CPUs on that
node wait for it?  That would help I guess, but those thousand processes
would still need to block _somewhere_ waiting for the memory to come back.

Of course, iprune_mutex is a particularly dumb place in which to do that,
because the memory may get freed up from somewhere else.

The general design here could/should be to back off to the top-level when
there's contention (that's presently congestion_wait()) and to poll for
memory-became-allocatable.

So what we could do here is to back off when iprune_mutex is busy and, if
nothing else works out, block in congestion_wait() (which is becoming
increasingly misnamed).  Then, add some more smarts to congestion_wait():
deliver a wakeup when "enough" memory got freed from the VFS caches.

One suspects that at some stage, congestion_wait() will need to be told
what the calling task is actually waiting for (perhaps a zonelist) so that
the wakup delivery can become smarter.  

But for now, the question is: is this a reasonable overall design?  Back
off from contention points, block at the top-level, polling for allocatable
memory to turn up?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-29  1:12                             ` Andrew Morton
@ 2007-06-29  1:20                               ` Rik van Riel
  2007-06-29  1:29                                 ` Andrew Morton
  0 siblings, 1 reply; 77+ messages in thread
From: Rik van Riel @ 2007-06-29  1:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm

Andrew Morton wrote:
> On Thu, 28 Jun 2007 20:45:20 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
>>>> The only problem with this is that anonymous
>>>> pages could be easily pushed out of memory by
>>>> the page cache, because the page cache has
>>>> totally different locality of reference.
>>> I don't immediately see why we need to change the fundamental aging design
>>> at all.   The problems afacit are
>>>
>>> a) that huge burst of activity when we hit pages_high and
>>>
>>> b) the fact that this huge burst happens on lots of CPUs at the same time.
>>>
>>> And balancing the LRUs _prior_ to hitting pages_high can address both
>>> problems?
>> That may work on systems with up to a few GB of memory,
>> but customers are already rolling out systems with 256GB
>> of RAM for general purpose use, that's 64 million pages!
>>
>> Even doing a background scan on that many pages will take
>> insane amounts of CPU time.
>>
>> In a few years, they will be deploying systems with 1TB
>> of memory and throwing random workloads at them.
> 
> I don't see how the amount of memory changes anything here: if there are
> more pages, more work needs to be done regardless of when we do it.
> 
> Still confused.

If we deactivate some of the active pages regardless of
whether or not they were recently referenced, you end
up with "hey, I need to deactivate 1GB worth of pages",
instead of with "I need to scan through 1TB worth of
pages to find 1GB of not recently accessed ones".

Note that is the exact same argument used against the
used-once cleanups that have been proposed in the past:
it is more work to scan through the whole list than to
have pages end up in a "reclaimable" state by default.

> But the problem with the vfs caches is that they aren't node/zone-specific.
> We wouldn't want to get into the situation where 1023 CPUs are twiddling
> thumbs waiting for one CPU to free stuff up (or less extreme variants of
> this).

The direct reclaimers can free something else.  Chances are they
don't care about the little bit of memory coming out of these
caches.

We just need to make sure the pressure gets evened out later.

>> Maybe direct reclaim processes should not dive into this cache
>> at all, but simply increase some variable indicating that kswapd
>> might want to prune some extra pages from this cache on its next
>> run?
> 
> Tell the node's kswapd to go off and do VFS reclaim while the CPUs on that
> node wait for it?  That would help I guess, but those thousand processes
> would still need to block _somewhere_ waiting for the memory to come back.

Not for the VFS memory.  They can just recycle some page cache
memory or start IO on anonymous memory going into swap.

> So what we could do here is to back off when iprune_mutex is busy and, if
> nothing else works out, block in congestion_wait() (which is becoming
> increasingly misnamed).  Then, add some more smarts to congestion_wait():
> deliver a wakeup when "enough" memory got freed from the VFS caches.

Yeah, that sounds doable.  Not sure if they should wait in
congestion_wait() though, or if they should just return
to __alloc_pages() since they may already have reclaimed
enough pages from the anonymous list.

> But for now, the question is: is this a reasonable overall design?  Back
> off from contention points, block at the top-level, polling for allocatable
> memory to turn up?

I'm not convinced.  If we have already reclaimed some
pages from the inactive list, why wait in congestion_wait()
AT ALL?

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-29  1:20                               ` Rik van Riel
@ 2007-06-29  1:29                                 ` Andrew Morton
  0 siblings, 0 replies; 77+ messages in thread
From: Andrew Morton @ 2007-06-29  1:29 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm

On Thu, 28 Jun 2007 21:20:40 -0400
Rik van Riel <riel@redhat.com> wrote:

> > But for now, the question is: is this a reasonable overall design?  Back
> > off from contention points, block at the top-level, polling for allocatable
> > memory to turn up?
> 
> I'm not convinced.  If we have already reclaimed some
> pages from the inactive list, why wait in congestion_wait()
> AT ALL?

Well by top-level I meant top-level.  The point where we either block or
declare oom.

We do that now in alloc_pages(), correctly I believe.

The congestion_wait()s in vmscan.c might be misplaced (ie: too far down)
because they could lead to us blocking when some memory actually got freed
up (or became freeable?) somewhere else.

To fix that we'd need to take a global look at things from within
direct-reclaim, or back out of direct-reclaim back up to alloc_pages(), but
remember where we were up to for the next pass.  Perhaps by extending
scan_control a bit and moving its instantiation up to __alloc_pages().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-28 22:44           ` Rik van Riel
  2007-06-28 22:57             ` Andrew Morton
@ 2007-06-29 13:38             ` Lee Schermerhorn
  2007-06-29 14:12               ` Andrea Arcangeli
  1 sibling, 1 reply; 77+ messages in thread
From: Lee Schermerhorn @ 2007-06-29 13:38 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm, Nick Dokos

On Thu, 2007-06-28 at 18:44 -0400, Rik van Riel wrote:
> Andrew Morton wrote:
> 
> > Where's the system time being spent?
> 
> OK, it turns out that there is quite a bit of variability
> in where the system spends its time.  I did a number of
> reaim runs and averaged the time the system spent in the
> top functions.
> 
> This is with the Fedora rawhide kernel config, which has
> quite a few debugging options enabled.
> 
> _raw_spin_lock		32.0%
> page_check_address	12.7%
> __delay			10.8%
> mwait_idle		10.4%
> anon_vma_unlink		5.7%
> __anon_vma_link		5.3%
> lockdep_reset_lock	3.5%
> __kmalloc_node_track_caller 2.8%
> security_port_sid	1.8%
> kfree			1.6%
> anon_vma_link		1.2%
> page_referenced_one	1.1%
> 
> In short, the system is waiting on the anon_vma lock.
> 
> I wonder if Lee Schemmerhorn's patch to turn that
> spinlock into an rwlock would help this workload,
> or if we simply should scan fewer pages in the
> pageout code.
> 

Rik:

Here's a fairly recent version of the patch if you want to try it on
your workload.  We've seen mixed results on somewhat larger systems,
with and without your split LRU patch.  I've started writing up those
results.  I'll try to get back to finishing up the writeup after OLS and
vacation.

Regards,
Lee

-----------
Patch against 2.6.22-rc4-mm2

Make the anon_vma list lock a read/write lock.  Heaviest use of this
lock is in the page_referenced()/try_to_unmap() calls from vmscan
[shrink_page_list()].  These functions can use a read lock to allow
some parallelism for different cpus trying to reclaim pages mapped
via the same set of vmas.

This change should not change the footprint of the anon_vma in the
non-debug case.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/rmap.h |    9 ++++++---
 mm/migrate.c         |    4 ++--
 mm/mmap.c            |    4 ++--
 mm/rmap.c            |   20 ++++++++++----------
 4 files changed, 20 insertions(+), 17 deletions(-)

Index: Linux/include/linux/rmap.h
===================================================================
--- Linux.orig/include/linux/rmap.h	2007-06-11 14:39:56.000000000 -0400
+++ Linux/include/linux/rmap.h	2007-06-20 09:49:24.000000000 -0400
@@ -24,7 +24,7 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-	spinlock_t lock;	/* Serialize access to vma list */
+	rwlock_t rwlock;	/* Serialize access to vma list */
 	struct list_head head;	/* List of private "related" vmas */
 };
 
@@ -42,18 +42,21 @@ static inline void anon_vma_free(struct 
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+/*
+ * This needs to be a write lock for __vma_link()
+ */
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+		write_lock(&anon_vma->rwlock);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		write_unlock(&anon_vma->rwlock);
 }
 
 /*
Index: Linux/mm/rmap.c
===================================================================
--- Linux.orig/mm/rmap.c	2007-06-11 14:40:06.000000000 -0400
+++ Linux/mm/rmap.c	2007-06-20 09:50:27.000000000 -0400
@@ -25,7 +25,7 @@
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_lock
- *         anon_vma->lock
+ *         anon_vma->rwlock
  *           mm->page_table_lock or pte_lock
  *             zone->lru_lock (in mark_page_accessed, isolate_lru_page)
  *             swap_lock (in swap_duplicate, swap_info_get)
@@ -85,7 +85,7 @@ int anon_vma_prepare(struct vm_area_stru
 		if (anon_vma) {
 			allocated = NULL;
 			locked = anon_vma;
-			spin_lock(&locked->lock);
+			write_lock(&locked->rwlock);
 		} else {
 			anon_vma = anon_vma_alloc();
 			if (unlikely(!anon_vma))
@@ -104,7 +104,7 @@ int anon_vma_prepare(struct vm_area_stru
 		spin_unlock(&mm->page_table_lock);
 
 		if (locked)
-			spin_unlock(&locked->lock);
+			write_unlock(&locked->rwlock);
 		if (unlikely(allocated))
 			anon_vma_free(allocated);
 	}
@@ -132,10 +132,10 @@ void anon_vma_link(struct vm_area_struct
 	struct anon_vma *anon_vma = vma->anon_vma;
 
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		write_lock(&anon_vma->rwlock);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
 		validate_anon_vma(vma);
-		spin_unlock(&anon_vma->lock);
+		write_unlock(&anon_vma->rwlock);
 	}
 }
 
@@ -147,13 +147,13 @@ void anon_vma_unlink(struct vm_area_stru
 	if (!anon_vma)
 		return;
 
-	spin_lock(&anon_vma->lock);
+	write_lock(&anon_vma->rwlock);
 	validate_anon_vma(vma);
 	list_del(&vma->anon_vma_node);
 
 	/* We must garbage collect the anon_vma if it's empty */
 	empty = list_empty(&anon_vma->head);
-	spin_unlock(&anon_vma->lock);
+	write_unlock(&anon_vma->rwlock);
 
 	if (empty)
 		anon_vma_free(anon_vma);
@@ -164,7 +164,7 @@ static void anon_vma_ctor(void *data, st
 {
 	struct anon_vma *anon_vma = data;
 
-	spin_lock_init(&anon_vma->lock);
+ 	rwlock_init(&anon_vma->rwlock);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -191,7 +191,7 @@ static struct anon_vma *page_lock_anon_v
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
+	read_lock(&anon_vma->rwlock);
 	return anon_vma;
 out:
 	rcu_read_unlock();
@@ -200,7 +200,7 @@ out:
 
 static void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
-	spin_unlock(&anon_vma->lock);
+	read_unlock(&anon_vma->rwlock);
 	rcu_read_unlock();
 }
 
Index: Linux/mm/mmap.c
===================================================================
--- Linux.orig/mm/mmap.c	2007-06-20 09:39:03.000000000 -0400
+++ Linux/mm/mmap.c	2007-06-20 09:49:24.000000000 -0400
@@ -571,7 +571,7 @@ again:			remove_next = 1 + (end > next->
 	if (vma->anon_vma)
 		anon_vma = vma->anon_vma;
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		write_lock(&anon_vma->rwlock);
 		/*
 		 * Easily overlooked: when mprotect shifts the boundary,
 		 * make sure the expanding vma has anon_vma set if the
@@ -625,7 +625,7 @@ again:			remove_next = 1 + (end > next->
 	}
 
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		write_unlock(&anon_vma->rwlock);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
Index: Linux/mm/migrate.c
===================================================================
--- Linux.orig/mm/migrate.c	2007-06-20 09:39:04.000000000 -0400
+++ Linux/mm/migrate.c	2007-06-20 09:49:24.000000000 -0400
@@ -228,12 +228,12 @@ static void remove_anon_migration_ptes(s
 	 * We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
 	 */
 	anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
+	read_lock(&anon_vma->rwlock);
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&anon_vma->lock);
+	read_unlock(&anon_vma->rwlock);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-29 13:38             ` Lee Schermerhorn
@ 2007-06-29 14:12               ` Andrea Arcangeli
  2007-06-29 14:59                 ` Rik van Riel
                                   ` (4 more replies)
  0 siblings, 5 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-06-29 14:12 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Rik van Riel, Andrew Morton, linux-mm, Nick Dokos

On Fri, Jun 29, 2007 at 09:38:29AM -0400, Lee Schermerhorn wrote:
> On Thu, 2007-06-28 at 18:44 -0400, Rik van Riel wrote:
> > Andrew Morton wrote:
> > 
> > > Where's the system time being spent?
> > 
> > OK, it turns out that there is quite a bit of variability
> > in where the system spends its time.  I did a number of
> > reaim runs and averaged the time the system spent in the
> > top functions.
> > 
> > This is with the Fedora rawhide kernel config, which has
> > quite a few debugging options enabled.
> > 
> > _raw_spin_lock		32.0%
> > page_check_address	12.7%
> > __delay			10.8%
> > mwait_idle		10.4%
> > anon_vma_unlink		5.7%
> > __anon_vma_link		5.3%
> > lockdep_reset_lock	3.5%
> > __kmalloc_node_track_caller 2.8%
> > security_port_sid	1.8%
> > kfree			1.6%
> > anon_vma_link		1.2%
> > page_referenced_one	1.1%

BTW, hope the above numbers are measured before the trashing stage
when the number of jobs per second is lower than 10. It'd be nice not
to spend all that time in system time but after that point the system
will shortly reach oom. It's more important to be fast and save cpu in
"useful" conditions (like with <4000 tasks).

> Here's a fairly recent version of the patch if you want to try it on
> your workload.  We've seen mixed results on somewhat larger systems,
> with and without your split LRU patch.  I've started writing up those
> results.  I'll try to get back to finishing up the writeup after OLS and
> vacation.

This looks a very good idea indeed.

Overall the O(log(N)) change I doubt would help, being able to give an
efficient answer to "give me only the vmas that maps this anon page"
won't be helpful here since the answer will be the same as the current
question "give me any vma that may be mapping this anon page". Only
for the filebacked mappings it matters.

Also I'm stunned this is being compared to a java workload, java is a
threaded beast (unless you're capable of understanding async-io in
which case it's still threaded but with tons less threads, but anyway
you code it won't create any anonymous related overhead). What we deal
with isn't really an issue with anon-vma but just with the fact the
system is trying to unmap pages that are mapped in 4000-5000 pte, so
no matter how you code it, there will be still 4000-5000 ptes to check
for each page that we want to know if it's referenced and it will take
system time, this is an hardware issue not a software one. And the
other suspect thing is to do all that pte-mangling work without doing
any I/O at all.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01 of 16] remove nr_scan_inactive/active
  2007-06-29 14:12               ` Andrea Arcangeli
@ 2007-06-29 14:59                 ` Rik van Riel
  2007-06-29 22:39                 ` "Noreclaim Infrastructure" [was Re: [PATCH 01 of 16] remove nr_scan_inactive/active] Lee Schermerhorn
                                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 77+ messages in thread
From: Rik van Riel @ 2007-06-29 14:59 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Lee Schermerhorn, Andrew Morton, linux-mm, Nick Dokos

Andrea Arcangeli wrote:

> BTW, hope the above numbers are measured before the trashing stage
> when the number of jobs per second is lower than 10. It'd be nice not
> to spend all that time in system time but after that point the system
> will shortly reach oom. It's more important to be fast and save cpu in
> "useful" conditions (like with <4000 tasks).

If the numbers were measured only in the thrashing stage,
mwait_idle would be the top CPU "user", not the scanning
code.

What I am trying to measure is more a question of system
robustness than performance.  We have seen a few cases
where the system took 2 hours to recover to a useful state
after running out of RAM, with enough free swap.

Linux needs to deal better with memory filling up. It
should start to swap instead of scanning pages for very
long periods of time and not recovering for a while.

>> Here's a fairly recent version of the patch if you want to try it on
>> your workload.  We've seen mixed results on somewhat larger systems,
>> with and without your split LRU patch.  I've started writing up those
>> results.  I'll try to get back to finishing up the writeup after OLS and
>> vacation.
> 
> This looks a very good idea indeed.

I'm definately going to give Lee's patch a spin.

> Also I'm stunned this is being compared to a java workload, java is a
> threaded beast 

Interestingly enough, both a heavy Java workload and this AIM7
test block on the anon_vma lock contention.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* "Noreclaim Infrastructure"  [was Re: [PATCH 01 of 16] remove nr_scan_inactive/active]
  2007-06-29 14:12               ` Andrea Arcangeli
  2007-06-29 14:59                 ` Rik van Riel
@ 2007-06-29 22:39                 ` Lee Schermerhorn
  2007-06-29 22:42                 ` RFC "Noreclaim Infrastructure - patch 1/3 basic infrastructure" Lee Schermerhorn
                                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-06-29 22:39 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Rik van Riel, Andrew Morton, linux-mm, Nick Dokos

On Fri, 2007-06-29 at 16:12 +0200, Andrea Arcangeli wrote:
> On Fri, Jun 29, 2007 at 09:38:29AM -0400, Lee Schermerhorn wrote:
<snip>
> 
> > Here's a fairly recent version of the patch if you want to try it on
> > your workload.  We've seen mixed results on somewhat larger systems,
> > with and without your split LRU patch.  I've started writing up those
> > results.  I'll try to get back to finishing up the writeup after OLS and
> > vacation.
> 
> This looks a very good idea indeed.
> 
> Overall the O(log(N)) change I doubt would help, being able to give an
> efficient answer to "give me only the vmas that maps this anon page"
> won't be helpful here since the answer will be the same as the current
> question "give me any vma that may be mapping this anon page". Only
> for the filebacked mappings it matters.
> 
> Also I'm stunned this is being compared to a java workload, java is a
> threaded beast (unless you're capable of understanding async-io in
> which case it's still threaded but with tons less threads, but anyway
> you code it won't create any anonymous related overhead). What we deal
> with isn't really an issue with anon-vma but just with the fact the
> system is trying to unmap pages that are mapped in 4000-5000 pte, so
> no matter how you code it, there will be still 4000-5000 ptes to check
> for each page that we want to know if it's referenced and it will take
> system time, this is an hardware issue not a software one. And the
> other suspect thing is to do all that pte-mangling work without doing
> any I/O at all.

Andrea:

Yes, the patch is not a panacea.  At best, it allows different kswapd's
to attempt to unmap different pages associated with the same VMA.  But,
as you say, you still have to unmap X000 ptes.  On one of the smaller
ia64 systems we've been testing, we hit this state in the 15000-20000
range of AIM jobs.  This patch, along with Rik's split LRU patch allowed
us to make forward progress at saturation, and we were actually
swapping, instead of just spinning around in page_referenced() and
try_to_unmap().  [Actually, I don't think we get past page_referenced()
much w/o this patch--have to check.]

I have experimented with another "noreclaim" infrastructure, based on
some patches by Larry Woodman at Red Hat, to keep non-reclaimable pages
off the active/inactive list.  I envisioned this as a general
infrastructure to handle this case--pages whose anon_vmas have
excessively long vma lists, swap-backed pages for which no swap space is
available and mlock()ed pages [a la Nick Piggin's patch].  

I will include the patch overview here and send along the 2
infrastructure patches and one "client" patch--the excessively
referenced anon_vma case.  I'm not proposing that these be considered
for inclusion.  Just another take on this issue.

The patches are against 2.6.21-rc6.  I have been distracted by other
issues lately, so they have languished, and even the overview is a bit
out of date relative to on-going activity in this area.  I did integrate
this series with Rik's split LRU patch at one time, and it all "worked"
for some definition thereof. 

One final note before the "noreclaim overview":  I have seen similar
behavior on the i_mmap_lock for file back pages running a [too] heavy
Oracle/TPC-C workload--on a larger ia64 system with ~8TB of storage.
System hung/unresponsive, spitting out "Soft lockup" messages.  Stack
traces showed cpus in spinlock contention called from
page_referenced_file.  So, it's not limited to anon pages.

Lee
-----------------

This series of patches introduces support for mananaging "non-reclaimable"
pages off the LRU active and inactive list.   In this rather long-winded 
overview, I attempt to provide the motivation for this work, describe how
it relates to other recent patches that address different aspects of the
"problem", and give an overview of the mechanism.  I'll try not to repeat
too much of this in the patch descriptions.

We have seen instances of large linux servers [10s/100s of GB of memory =>
millions of pages] apparently hanging for extended periods [10s or minutes
or more] while all processors attempt to reclaim memory.  For various
reasons many of the pages on the LRU lists become difficult or impossible
to reclaim.  The system spends a lot time trying to reclaim [unmap] the
difficult pages and/or shuffling through the impossible ones.

Some of the conditions that make pages difficult or impossible to reclaim:
  1) page is anon or shmem, but no swap space available
  2) page is mlocked into memory
  3) page is anon with an excessive number of related vmas [on the
     anon_vma list].  More on this below.

The basic noreclaim mechanism, described below, is based
on a patch developed by Larry Woodman of Red Hat for RHEL4 [2.6.9+ based
kernel] to address the first condition above--an x86_64 non-NUMA system 
with 64G-128G memory [16M-32M 4k pages] with very little swap space--
~2GB.  The majority of the memory on the system was consumed by large
database shared memory areas.  A file IO intensive operation, such as
backup, causes remaining free memory to be consumed by the page cache,
initiating reclaim.

vmscan then spends a great deal of time shuffling non-swappable anon
and shmem pages between the active to the inactive lists, only to find
that it can't move them to the swap cache.  The pages get reactivated
and round and round it goes.  Because pages cannot be easily reclaimed,
eventually other processors need to allocate pages and enter direct
reclaim, only to compete for the zone lru lock.  The single [normal]
zone on the non-numa platform exacerbates this problem, but it can 
also arise, per zone, on numa platforms.

Larry's patch alleviates this problem by maintaining anon and shmem
pages for which no swap space exists on a per zone noreclaim list.
Once the pages have been parked there, vmscan deals only with page
cache pages, and anon/shmem pages to which space space has already
been assigned.  Pages move from the noreclaim list back to the LRU
when swap space becomes available.

Upstream developers have been addressing some of these issues in other
ways:

Christoph Lameter posted a patch to keep anon pages off the LRU when SWAP
support not configured into the kernel.  With Christoph's patch, these
pages are left out "in limbo"--not on any list.  Because of this,
Christoph's patch does not address the more common situation of kernels
with SWAP configured in, but insufficient or no swap added.  I think this
is a more common situation because most distros will ship kernels with
the SWAP support configured in--at least for "enterprise" use.  Maintaining
these pages on a noreclaim list, will make it possible to restore these
pages to the [in]active lists when/if swap is added.

Nick Piggin's patch to keep mlock'ed pages [condition 2 above] off the
LRU list also lets the mlocked/non-reclaimable pages float, not on any
list.  While Nick's patch does allow these pages to become reclaimable
when all memory locks are removed, there is another reason to keep pages
on a separate list.

We want to be able to migrate anon pages that have no swap space backing
them, and those that are mlocked.  Indeed, the migration infrastructure
supports this.  However, the LRU lists, via the zone lru locks, arbitrate
between tasks attempting to migrate the same pages simultaneously.  To
migrate a page, we must isolate it from the LRU.  If the page cannot be
isolated, migration gives up and moves on to another page.  Which ever
task is successful in isolating the page proceeds with the migration.
Keeping the nonreclaimable pages on a separate list, protected by the
zone lru lock, would preserve this arbitration function.  isolate_page_lru(),
used by both migration and Nick's mlock patch, can be enhanced to find
pages on the noreclaim list, as well as on the [in]active lists.

What's the probability that tasks will race on migrating the same page?
Fairly high if auto-migration ever makes it into the kernel, but non-zero
in any case.

Rik van Reil's patch to split the active and inactive lists can address
the non-swappable page problem by throttling the scan of the anon LRU
lists, that contain both anon and shmem pages.  However, if the system
supports any swap space at all, one still needs to scan the anon lists
to free up memory consumed by pages already in the swap cache.  On
large memory systems, the anon lists can still be millions of pages 
long and contain a large per centage of non-swappable and mlocked
pages.

This series attempts to unify this work into a general mechanism for
managing non-reclaimable pages.  The basic objective is to make vmscan
as productive as possible on very large memory systems, by eliminating
non-productive page shuffling.

Like Larry's patch, the noreclaim infrastructure maintains "non-reclaimable"
pages on a separate per-zone list.  This noreclaim list is, conceptually,
another LRU list--a sibling of the active and inactive lists.  A page on
the noreclaim list will have the PG_lru and PG_noreclaim flags set.  The
PG_noreclaim flag is analogous to, and mutually exclusive with, the
PG_active flag--it specifies which LRU list the page resides on.  The
noreclaim list supports a pagevec cache, like the active and inactive
lists to reduce contention on the zone lru lock in vmscan and in the
fault path.

Pages on the noreclaim list are "hidden" from page reclaim scanning.  Thus,
reclaim will not spend time attempting to reclaim the pages, only to find
that they can't be unmapped, have no swap space available, are locked into
memory, ...  However, vmscan may find pages on the [in]active lists that
have become non-reclaimable since they were put on the list.  It will
move them to the noreclaim list at that time.

This series of patches includes the basic noreclaim list support and one
patch, as a proof of concept, to address the 3rd condition listed above:
the excessively long anon_vma list of related vmas.  This seemed to be
the easiest of the 3 conditions to address, and I have a test case
handy [AIM7--see below].  Additional patches to handle anon pages for
which no swap exists and to layer Nick Piggin's patch to keep "mlock
pages off the LRU" will be forthcoming, if feedback indicates that
this approach is worth pursuing.

Now, about those anon pages with really long "related vma" lists:
We have only seen this in AIM7 benchmarks on largish servers.  The situation
occurs when a single task fork()s many [10s of] thousands of children, and
the the system needs to reclaim memory.  We've seen all processors on a
system spinning on the anon_vma lock attempting to unmap pages mapped
by these thousands of children--for 10s of minutes or until we give up
and reboot.

I discussed this issue at LCA'07 in a kernel miniconf presentation.
Linus questioned whether this was a problem that really needs solving.
After all, AIM7 is only a synthetic benchmark.  Does any real application 
behave this way?   After the presentation, someone came up to me and told
me that Apache also fork()s for each incoming connection and can fork
thousands of children.  However, I have not witnessed this, nor do I
know how long lived these children are.

I have included another patch that makes the anon_vma lock a reader/write
lock.  This allows different cpus to attempt to reclaim, in parallel,
different pages that point to the same anon_vma.  However, this doesn't
solve the problem of trying to unmap pages that are [potentially] mapped
into thousands of vmas.

The last patch in this series counts the number of related vmas on an 
anon_vma's list and, when it exceeds a tunable threshold, pages that 
reference that anon_vma are declared nonreclaimable.  We detect these
non-reclaimable pages either on fault [COW or new anon page in a vma with
an excessively shared anon_vma] or when vmscan encounters such a page on
the LRU list.  

The patch/series does not [yet] support moving such a page back to the
[in]active lists when it's anon_vma sharing drops below the threshold.
This usually occurs when a task exits or explicitly unmapps the area.
Any COWed private pages will be freed at this time, but anon pages that
are still shared will remain nonreclaimable even though the related vma
count is below the no-reclaim limit.  Again, I will address this if the
overall approach is deemed worth pursuing.

Additional considerations:

If the noreclaim list contains mlocked pages, they can be directly deleted
from the noreclaim list without scanning when the become unlocked.  But,
note that we can't use one of the lru link fields to contain the mlocked
vma count in this case.

If the noreclaim list contains anon/shmem pages for which no swap space
exists, it will be necessary to scan the list when swap space becomes
available, either because it has been freed from other pages, or because
additional swap has been added.  The latter case should not occur 
frequently enough to be a problem.  We should be able to defer the
scanning when swap space is freed from other pages until a sufficient
number become available or system is under severe pressure.

If the list contains pages that are merely difficult to reclaim because
of the excessive anon_vma sharing, and if we want to make them reclaimable
again when the anon_vma related vma count drops to an acceptable value,
one would have to scan the list at some point.  Again, this could be 
deferred until there are a sufficient number of such pages to make it
worth while or until the system is under severe memory pressure.

The above considerations suggest that one consider separate lists for
non-reclaimable [no swap, mlocked] and difficult to reclaim.   Or,
maybe not...

Interaction of noreclaim list and LRU lists:  My current patch moves
pages to the noreclaim list as soon as they are detected, either on the
active or inactive list.  I could change this such that non-reclaimable
pages found on the active list go to the inactive list first, and 
take a ride there before being declared non-reclaimable.  However, 
we still have the issue of where to place the pages when then come off
the no reclaim list:  back to the active list?  the inactive list?
head or tail thereof?  My current mechanism, with the PG_active and
PG_noreclaim flags being mutually exclusive, does not track activeness
of pages on the noreclaim list.  To do so would require additional
scanning of the list, I think, sort of defeating the purpose of the
list.  But, maybe acceptable if we scan just to test/modify the active
flags.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RFC "Noreclaim Infrastructure - patch 1/3 basic infrastructure"
  2007-06-29 14:12               ` Andrea Arcangeli
  2007-06-29 14:59                 ` Rik van Riel
  2007-06-29 22:39                 ` "Noreclaim Infrastructure" [was Re: [PATCH 01 of 16] remove nr_scan_inactive/active] Lee Schermerhorn
@ 2007-06-29 22:42                 ` Lee Schermerhorn
  2007-06-29 22:44                 ` RFC "Noreclaim Infrastructure patch 2/3 - noreclaim statistics..." Lee Schermerhorn
  2007-06-29 22:49                 ` "Noreclaim - client patch 3/3 - treat pages w/ excessively references anon_vma as nonreclaimable" Lee Schermerhorn
  4 siblings, 0 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-06-29 22:42 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Rik van Riel, Andrew Morton, linux-mm, Nick Dokos

Patch against 2.6.21-rc5/6

Infrastructure to manage pages excluded from reclaim--i.e., hidden
from vmscan.  Based on a patch by Larry Woodman of Red Hat.

Applies atop two patches from Nick Piggin's "mlock pages off the LRU"
series:  move-and-rework-isolate_lru_page and
move-and-rename-install_arg_page

Maintain "nonreclaimable" pages on a separate per-zone list, to 
"hide" them from vmscan. 

Although this patch series does not support it, the noreclaim list
could be scanned at a lower rate--for example to attempt to reclaim
the "difficult to reclaim" pages when pages are REALLY needed, such
as when reserves are exhausted and a critical need arises.

A new function 'page_reclaimable(page, vma)' in vmscan.c tests whether
or not a page is reclaimable.  Subsequent patches will add the various
!reclaimable tests.  Reclaimable pages are placed on the appropriate
LRU list; non-reclaimable pages on the new noreclaim list.

Notes:

1.  Not sure I need the 'vma' arg to page_reclaimable().  I did in an
    earlier incarnation.  Don't seem to now

2.  for now, use bit 20 in page flags.   Could restrict to 64-bit
    systems only and use one of bits 21-30 [ia64 uses bit 31; other
    archs ???].  

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mm_inline.h  |   34 +++++++++++++++++++-
 include/linux/mmzone.h     |    6 +++
 include/linux/page-flags.h |   20 ++++++++++++
 include/linux/pagevec.h    |    5 +++
 include/linux/swap.h       |   11 ++++++
 mm/Kconfig                 |    8 ++++
 mm/mempolicy.c             |    2 -
 mm/migrate.c               |    8 ++++
 mm/page_alloc.c            |    6 +++
 mm/swap.c                  |   73 +++++++++++++++++++++++++++++++++++++++----
 mm/vmscan.c                |   75 ++++++++++++++++++++++++++++++++++++++++++++-
 11 files changed, 237 insertions(+), 11 deletions(-)

Index: Linux/mm/Kconfig
===================================================================
--- Linux.orig/mm/Kconfig	2007-03-26 12:39:02.000000000 -0400
+++ Linux/mm/Kconfig	2007-03-26 13:14:05.000000000 -0400
@@ -163,3 +163,11 @@ config ZONE_DMA_FLAG
 	default "0" if !ZONE_DMA
 	default "1"
 
+config NORECLAIM
+	bool "Track non-reclaimable pages"
+	help
+	  Supports tracking of non-reclaimable pages off the [in]active lists
+	  to avoid excessive reclaim overhead on large memory systems.  Pages
+	  may be non-reclaimable because:  they are locked into memory, they
+	  are anonymous pages for which no swap space exists, or they are anon
+	  pages that are expensive to unmap [long anon_vma "related vma" list.]
Index: Linux/include/linux/page-flags.h
===================================================================
--- Linux.orig/include/linux/page-flags.h	2007-03-26 12:39:01.000000000 -0400
+++ Linux/include/linux/page-flags.h	2007-03-26 13:15:08.000000000 -0400
@@ -91,6 +91,9 @@
 #define PG_nosave_free		18	/* Used for system suspend/resume */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
+#define PG_noreclaim		20	/* Page is "non-reclaimable"  */
+
+
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
 
@@ -249,6 +252,23 @@ static inline void SetPageUptodate(struc
 #define PageSwapCache(page)	0
 #endif
 
+#ifdef CONFIG_NORECLAIM
+#define PageNoreclaim(page)	test_bit(PG_noreclaim, &(page)->flags)
+#define SetPageNoreclaim(page)	set_bit(PG_noreclaim, &(page)->flags)
+#define ClearPageNoreclaim(page) clear_bit(PG_noreclaim, &(page)->flags)
+#define __ClearPageNoreclaim(page) __clear_bit(PG_noreclaim, &(page)->flags)
+//TODO:   need test versions?
+#define TestSetPageNoreclaim(page) \
+				test_and_set_bit(PG_noreclaim, &(page)->flags)
+#define TestClearPageNoreclaim(page) \
+				test_and_clear_bit(PG_noreclaim, &(page)->flags)
+#else
+#define PageNoreclaim(page)	0
+#define SetPageNoreclaim(page)
+#define ClearPageNoreclaim(page)
+#define __ClearPageNoreclaim(page)
+#endif
+
 #define PageUncached(page)	test_bit(PG_uncached, &(page)->flags)
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
Index: Linux/include/linux/mmzone.h
===================================================================
--- Linux.orig/include/linux/mmzone.h	2007-03-26 12:39:01.000000000 -0400
+++ Linux/include/linux/mmzone.h	2007-03-26 13:23:10.000000000 -0400
@@ -51,6 +51,9 @@ enum zone_stat_item {
 	NR_FREE_PAGES,
 	NR_INACTIVE,
 	NR_ACTIVE,
+#ifdef CONFIG_NORECLAIM
+	NR_NORECLAIM,
+#endif
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
@@ -217,6 +220,9 @@ struct zone {
 	spinlock_t		lru_lock;	
 	struct list_head	active_list;
 	struct list_head	inactive_list;
+#ifdef CONFIG_NORECLAIM
+	struct list_head	noreclaim_list;
+#endif
 	unsigned long		nr_scan_active;
 	unsigned long		nr_scan_inactive;
 	unsigned long		pages_scanned;	   /* since last reclaim */
Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-03-26 12:39:02.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-03-26 13:17:49.000000000 -0400
@@ -198,6 +198,7 @@ static void bad_page(struct page *page)
 			1 << PG_private |
 			1 << PG_locked	|
 			1 << PG_active	|
+			1 << PG_noreclaim	|
 			1 << PG_dirty	|
 			1 << PG_reclaim |
 			1 << PG_slab    |
@@ -433,6 +434,7 @@ static inline int free_pages_check(struc
 			1 << PG_private |
 			1 << PG_locked	|
 			1 << PG_active	|
+			1 << PG_noreclaim	|
 			1 << PG_reclaim	|
 			1 << PG_slab	|
 			1 << PG_swapcache |
@@ -582,6 +584,7 @@ static int prep_new_page(struct page *pa
 			1 << PG_private	|
 			1 << PG_locked	|
 			1 << PG_active	|
+			1 << PG_noreclaim	|
 			1 << PG_dirty	|
 			1 << PG_reclaim	|
 			1 << PG_slab    |
@@ -2673,6 +2676,9 @@ static void __meminit free_area_init_cor
 		zone_pcp_init(zone);
 		INIT_LIST_HEAD(&zone->active_list);
 		INIT_LIST_HEAD(&zone->inactive_list);
+#ifdef CONFIG_NORECLAIM
+		INIT_LIST_HEAD(&zone->noreclaim_list);
+#endif
 		zone->nr_scan_active = 0;
 		zone->nr_scan_inactive = 0;
 		zap_zone_vm_stats(zone);
Index: Linux/include/linux/mm_inline.h
===================================================================
--- Linux.orig/include/linux/mm_inline.h	2007-03-26 12:39:01.000000000 -0400
+++ Linux/include/linux/mm_inline.h	2007-03-26 13:24:10.000000000 -0400
@@ -26,11 +26,43 @@ del_page_from_inactive_list(struct zone 
 	__dec_zone_state(zone, NR_INACTIVE);
 }
 
+#ifdef CONFIG_NORECLAIM
+static inline void __dec_zone_noreclaim(struct zone *zone)
+{
+	__dec_zone_state(zone, NR_NORECLAIM);
+}
+
+static inline void
+add_page_to_noreclaim_list(struct zone *zone, struct page *page)
+{
+	list_add(&page->lru, &zone->noreclaim_list);
+	__inc_zone_state(zone, NR_NORECLAIM);
+}
+
+static inline void
+del_page_from_noreclaim_list(struct zone *zone, struct page *page)
+{
+	list_del(&page->lru);
+	__dec_zone_noreclaim(zone);
+}
+#else
+static inline void __dec_zone_noreclaim(struct zone *zone) { }
+
+static inline void
+add_page_to_noreclaim_list(struct zone *zone, struct page *page) { }
+
+static inline void
+del_page_from_noreclaim_list(struct zone *zone, struct page *page) { }
+#endif
+
 static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);
-	if (PageActive(page)) {
+	if (PageNoreclaim(page)) {
+		__ClearPageNoreclaim(page);
+		__dec_zone_noreclaim(zone);
+	} else if (PageActive(page)) {
 		__ClearPageActive(page);
 		__dec_zone_state(zone, NR_ACTIVE);
 	} else {
Index: Linux/include/linux/swap.h
===================================================================
--- Linux.orig/include/linux/swap.h	2007-03-26 12:39:01.000000000 -0400
+++ Linux/include/linux/swap.h	2007-03-26 13:13:18.000000000 -0400
@@ -186,6 +186,11 @@ extern void lru_add_drain(void);
 extern int lru_add_drain_all(void);
 extern int rotate_reclaimable_page(struct page *page);
 extern void swap_setup(void);
+#ifdef CONFIG_NORECLAIM
+extern void FASTCALL(lru_cache_add_noreclaim(struct page *page));
+#else
+static inline void lru_cache_add_noreclaim(struct page *page) { }
+#endif
 
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zone **, gfp_t);
@@ -207,6 +212,12 @@ static inline int zone_reclaim(struct zo
 }
 #endif
 
+#ifdef CONFIG_NORECLAIM
+extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
+#else
+#define page_reclaimable(P, V) 1
+#endif
+
 extern int kswapd_run(int nid);
 
 #ifdef CONFIG_MMU
Index: Linux/include/linux/pagevec.h
===================================================================
--- Linux.orig/include/linux/pagevec.h	2007-02-04 13:44:54.000000000 -0500
+++ Linux/include/linux/pagevec.h	2007-03-26 13:13:18.000000000 -0400
@@ -25,6 +25,11 @@ void __pagevec_release_nonlru(struct pag
 void __pagevec_free(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec);
 void __pagevec_lru_add_active(struct pagevec *pvec);
+#ifdef CONFIG_NORECLAIM
+void __pagevec_lru_add_noreclaim(struct pagevec *pvec);
+#else
+static inline void __pagevec_lru_add_noreclaim(struct pagevec *pvec) { }
+#endif
 void pagevec_strip(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
Index: Linux/mm/swap.c
===================================================================
--- Linux.orig/mm/swap.c	2007-02-04 13:44:54.000000000 -0500
+++ Linux/mm/swap.c	2007-03-26 13:13:18.000000000 -0400
@@ -117,14 +117,14 @@ int rotate_reclaimable_page(struct page 
 		return 1;
 	if (PageDirty(page))
 		return 1;
-	if (PageActive(page))
+	if (PageActive(page) | PageNoreclaim(page))
 		return 1;
 	if (!PageLRU(page))
 		return 1;
 
 	zone = page_zone(page);
 	spin_lock_irqsave(&zone->lru_lock, flags);
-	if (PageLRU(page) && !PageActive(page)) {
+	if (PageLRU(page) && !PageActive(page) && !PageNoreclaim(page)) {
 		list_move_tail(&page->lru, &zone->inactive_list);
 		__count_vm_event(PGROTATED);
 	}
@@ -142,7 +142,7 @@ void fastcall activate_page(struct page 
 	struct zone *zone = page_zone(page);
 
 	spin_lock_irq(&zone->lru_lock);
-	if (PageLRU(page) && !PageActive(page)) {
+	if (PageLRU(page) && !PageActive(page) && !PageNoreclaim(page)) {
 		del_page_from_inactive_list(zone, page);
 		SetPageActive(page);
 		add_page_to_active_list(zone, page);
@@ -160,7 +160,8 @@ void fastcall activate_page(struct page 
  */
 void fastcall mark_page_accessed(struct page *page)
 {
-	if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
+	if (!PageActive(page) && !PageNoreclaim(page) &&
+			PageReferenced(page) && PageLRU(page)) {
 		activate_page(page);
 		ClearPageReferenced(page);
 	} else if (!PageReferenced(page)) {
@@ -197,6 +198,29 @@ void fastcall lru_cache_add_active(struc
 	put_cpu_var(lru_add_active_pvecs);
 }
 
+#ifdef CONFIG_NORECLAIM
+static DEFINE_PER_CPU(struct pagevec, lru_add_noreclaim_pvecs) = { 0, };
+
+void fastcall lru_cache_add_noreclaim(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(lru_add_noreclaim_pvecs);
+
+	page_cache_get(page);
+	if (!pagevec_add(pvec, page))
+		__pagevec_lru_add_noreclaim(pvec);
+	put_cpu_var(lru_add_noreclaim_pvecs);
+}
+
+static inline void __drain_noreclaim_pvec(struct pagevec **pvec, int cpu)
+{
+	*pvec = &per_cpu(lru_add_noreclaim_pvecs, cpu);
+	if (pagevec_count(*pvec))
+		__pagevec_lru_add_noreclaim(*pvec);
+}
+#else
+static inline void __drain_noreclaim_pvec(struct pagevec **pvec, int cpu) { }
+#endif
+
 static void __lru_add_drain(int cpu)
 {
 	struct pagevec *pvec = &per_cpu(lru_add_pvecs, cpu);
@@ -207,6 +231,8 @@ static void __lru_add_drain(int cpu)
 	pvec = &per_cpu(lru_add_active_pvecs, cpu);
 	if (pagevec_count(pvec))
 		__pagevec_lru_add_active(pvec);
+
+	__drain_noreclaim_pvec(&pvec, cpu);
 }
 
 void lru_add_drain(void)
@@ -277,14 +303,18 @@ void release_pages(struct page **pages, 
 
 		if (PageLRU(page)) {
 			struct zone *pagezone = page_zone(page);
+			int is_lru_page;
+
 			if (pagezone != zone) {
 				if (zone)
 					spin_unlock_irq(&zone->lru_lock);
 				zone = pagezone;
 				spin_lock_irq(&zone->lru_lock);
 			}
-			VM_BUG_ON(!PageLRU(page));
-			__ClearPageLRU(page);
+			is_lru_page = PageLRU(page);
+			VM_BUG_ON(!(is_lru_page));
+			if (is_lru_page)
+				__ClearPageLRU(page);
 			del_page_from_lru(zone, page);
 		}
 
@@ -392,7 +422,7 @@ void __pagevec_lru_add_active(struct pag
 		}
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
-		VM_BUG_ON(PageActive(page));
+		VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
 		SetPageActive(page);
 		add_page_to_active_list(zone, page);
 	}
@@ -402,6 +432,35 @@ void __pagevec_lru_add_active(struct pag
 	pagevec_reinit(pvec);
 }
 
+#ifdef CONFIG_NORECLAIM
+void __pagevec_lru_add_noreclaim(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		VM_BUG_ON(PageLRU(page));
+		SetPageLRU(page);
+		VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
+		SetPageNoreclaim(page);
+		add_page_to_noreclaim_list(zone, page);
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+#endif
+
 /*
  * Try to drop buffers from the pages in a pagevec
  */
Index: Linux/mm/migrate.c
===================================================================
--- Linux.orig/mm/migrate.c	2007-03-26 13:11:51.000000000 -0400
+++ Linux/mm/migrate.c	2007-03-26 13:13:18.000000000 -0400
@@ -52,7 +52,10 @@ int migrate_prep(void)
 
 static inline void move_to_lru(struct page *page)
 {
-	if (PageActive(page)) {
+	if (PageNoreclaim(page)) {
+		ClearPageNoreclaim(page);
+		lru_cache_add_noreclaim(page);
+	} else if (PageActive(page)) {
 		/*
 		 * lru_cache_add_active checks that
 		 * the PG_active bit is off.
@@ -322,6 +325,9 @@ static void migrate_page_copy(struct pag
 		SetPageUptodate(newpage);
 	if (PageActive(page))
 		SetPageActive(newpage);
+	else
+		if (PageNoreclaim(page))
+			SetPageNoreclaim(newpage);
 	if (PageChecked(page))
 		SetPageChecked(newpage);
 	if (PageMappedToDisk(page))
Index: Linux/mm/vmscan.c
===================================================================
--- Linux.orig/mm/vmscan.c	2007-03-26 13:11:51.000000000 -0400
+++ Linux/mm/vmscan.c	2007-03-26 13:24:56.000000000 -0400
@@ -473,6 +473,11 @@ static unsigned long shrink_page_list(st
 
 		sc->nr_scanned++;
 
+		if (!page_reclaimable(page, NULL)) {
+			SetPageNoreclaim(page);
+			goto keep_locked;
+		}
+
 		if (!sc->may_swap && page_mapped(page))
 			goto keep_locked;
 
@@ -587,6 +592,7 @@ free_it:
 		continue;
 
 activate_locked:
+		VM_BUG_ON(PageActive(page));
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
@@ -682,6 +688,8 @@ int isolate_lru_page(struct page *page)
 			ClearPageLRU(page);
 			if (PageActive(page))
 				del_page_from_active_list(zone, page);
+			else if (PageNoreclaim(page))
+				del_page_from_noreclaim_list(zone, page);
 			else
 				del_page_from_inactive_list(zone, page);
 		}
@@ -742,8 +750,11 @@ static unsigned long shrink_inactive_lis
 			VM_BUG_ON(PageLRU(page));
 			SetPageLRU(page);
 			list_del(&page->lru);
-			if (PageActive(page))
+			if (PageActive(page)) {
 				add_page_to_active_list(zone, page);
+				VM_BUG_ON(PageNoreclaim(page));
+			} else if (PageNoreclaim(page))
+				add_page_to_noreclaim_list(zone, page);
 			else
 				add_page_to_inactive_list(zone, page);
 			if (!pagevec_add(&pvec, page)) {
@@ -806,6 +817,9 @@ static void shrink_active_list(unsigned 
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
 	LIST_HEAD(l_inactive);	/* Pages to go onto the inactive_list */
 	LIST_HEAD(l_active);	/* Pages to go onto the active_list */
+#ifdef CONFIG_NORECLAIM
+	LIST_HEAD(l_noreclaim);	/* Pages to go onto the noreclaim list */
+#endif
 	struct page *page;
 	struct pagevec pvec;
 	int reclaim_mapped = 0;
@@ -869,6 +883,14 @@ force_reclaim_mapped:
 		cond_resched();
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
+		if (!page_reclaimable(page, NULL)) {
+			/*
+			 * divert any non-reclaimable pages onto the
+			 * noreclaim list
+			 */
+			list_add(&page->lru, &l_noreclaim);
+			continue;
+		}
 		if (page_mapped(page)) {
 			if (!reclaim_mapped ||
 			    (total_swap_pages == 0 && PageAnon(page)) ||
@@ -931,6 +953,30 @@ force_reclaim_mapped:
 	}
 	__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
 
+#ifdef CONFIG_NORECLAIM
+	pgmoved = 0;
+	while (!list_empty(&l_noreclaim)) {
+		page = lru_to_page(&l_noreclaim);
+		prefetchw_prev_lru_page(page, &l_noreclaim, flags);
+		VM_BUG_ON(PageLRU(page));
+		SetPageLRU(page);
+		VM_BUG_ON(!PageActive(page));
+		ClearPageActive(page);
+		VM_BUG_ON(PageNoreclaim(page));
+		SetPageNoreclaim(page);
+		list_move(&page->lru, &zone->noreclaim_list);
+		pgmoved++;
+		if (!pagevec_add(&pvec, page)) {
+			__mod_zone_page_state(zone, NR_NORECLAIM, pgmoved);
+			pgmoved = 0;
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
+	}
+	__mod_zone_page_state(zone, NR_NORECLAIM, pgmoved);
+#endif
+
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
 	__count_vm_events(PGDEACTIVATE, pgdeactivate);
 	spin_unlock_irq(&zone->lru_lock);
@@ -1764,3 +1810,30 @@ int zone_reclaim(struct zone *zone, gfp_
 	return __zone_reclaim(zone, gfp_mask, order);
 }
 #endif
+
+#ifdef CONFIG_NORECLAIM
+/*
+ * page_reclaimable(struct page *page, struct vm_area_struct *vma)
+ * Test whether page is reclaimable--i.e., should be placed on active/inactive
+ * lists vs noreclaim list.
+ *
+ * @page       - page to test
+ * @vma        - vm area in which page is/will be mapped.  May be NULL.
+ *               If !NULL, called from fault path.
+ *
+ * Reasons page might not be reclaimable:
+ * TODO - later patches
+ *
+ * TODO:  specify locking assumptions
+ */
+int page_reclaimable(struct page *page, struct vm_area_struct *vma)
+{
+	int reclaimable = 1;
+
+	VM_BUG_ON(PageNoreclaim(page));
+
+	/* TODO:  test page [!]reclaimable conditions */
+
+	return reclaimable;
+}
+#endif
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-03-26 13:11:51.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-03-26 13:13:18.000000000 -0400
@@ -1790,7 +1790,7 @@ static void gather_stats(struct page *pa
 	if (PageSwapCache(page))
 		md->swapcache++;
 
-	if (PageActive(page))
+	if (PageActive(page) || PageNoreclaim(page))
 		md->active++;
 
 	if (PageWriteback(page))


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RFC "Noreclaim Infrastructure patch 2/3 - noreclaim statistics..."
  2007-06-29 14:12               ` Andrea Arcangeli
                                   ` (2 preceding siblings ...)
  2007-06-29 22:42                 ` RFC "Noreclaim Infrastructure - patch 1/3 basic infrastructure" Lee Schermerhorn
@ 2007-06-29 22:44                 ` Lee Schermerhorn
  2007-06-29 22:49                 ` "Noreclaim - client patch 3/3 - treat pages w/ excessively references anon_vma as nonreclaimable" Lee Schermerhorn
  4 siblings, 0 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-06-29 22:44 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Rik van Riel, Andrew Morton, linux-mm, Nick Dokos

Patch against 2.6.21-rc5

n/m in the noreclaim series

Report non-reclaimable pages per zone and system wide.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 drivers/base/node.c |    6 ++++++
 fs/proc/proc_misc.c |    6 ++++++
 mm/page_alloc.c     |   16 +++++++++++++++-
 mm/vmstat.c         |    3 +++
 4 files changed, 30 insertions(+), 1 deletion(-)

Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-03-26 13:17:49.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-03-26 13:44:51.000000000 -0400
@@ -1574,10 +1574,18 @@ void show_free_areas(void)
 		}
 	}
 
-	printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu\n"
+//TODO:  check/adjust line lengths
+	printk("Active:%lu inactive:%lu"
+#ifdef CONFIG_NORECLAIM
+		" noreclaim:%lu"
+#endif
+		" dirty:%lu writeback:%lu unstable:%lu\n"
 		" free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n",
 		global_page_state(NR_ACTIVE),
 		global_page_state(NR_INACTIVE),
+#ifdef CONFIG_NORECLAIM
+		global_page_state(NR_NORECLAIM),
+#endif
 		global_page_state(NR_FILE_DIRTY),
 		global_page_state(NR_WRITEBACK),
 		global_page_state(NR_UNSTABLE_NFS),
@@ -1602,6 +1610,9 @@ void show_free_areas(void)
 			" high:%lukB"
 			" active:%lukB"
 			" inactive:%lukB"
+#ifdef CONFIG_NORECLAIM
+			" noreclaim:%lukB"
+#endif
 			" present:%lukB"
 			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
@@ -1613,6 +1624,9 @@ void show_free_areas(void)
 			K(zone->pages_high),
 			K(zone_page_state(zone, NR_ACTIVE)),
 			K(zone_page_state(zone, NR_INACTIVE)),
+#ifdef CONFIG_NORECLAIM
+			K(zone_page_state(zone, NR_NORECLAIM)),
+#endif
 			K(zone->present_pages),
 			zone->pages_scanned,
 			(zone->all_unreclaimable ? "yes" : "no")
Index: Linux/mm/vmstat.c
===================================================================
--- Linux.orig/mm/vmstat.c	2007-03-26 12:39:02.000000000 -0400
+++ Linux/mm/vmstat.c	2007-03-26 13:35:43.000000000 -0400
@@ -434,6 +434,9 @@ static const char * const vmstat_text[] 
 	"nr_free_pages",
 	"nr_active",
 	"nr_inactive",
+#ifdef CONFIG_NORECLAIM
+	"nr_noreclaim",
+#endif
 	"nr_anon_pages",
 	"nr_mapped",
 	"nr_file_pages",
Index: Linux/drivers/base/node.c
===================================================================
--- Linux.orig/drivers/base/node.c	2007-03-26 12:38:59.000000000 -0400
+++ Linux/drivers/base/node.c	2007-03-26 13:37:35.000000000 -0400
@@ -49,6 +49,9 @@ static ssize_t node_read_meminfo(struct 
 		       "Node %d MemUsed:      %8lu kB\n"
 		       "Node %d Active:       %8lu kB\n"
 		       "Node %d Inactive:     %8lu kB\n"
+#ifdef CONFIG_NORECLAIM
+		       "Node %d Noreclaim:    %8lu kB\n"
+#endif
 #ifdef CONFIG_HIGHMEM
 		       "Node %d HighTotal:    %8lu kB\n"
 		       "Node %d HighFree:     %8lu kB\n"
@@ -71,6 +74,9 @@ static ssize_t node_read_meminfo(struct 
 		       nid, K(i.totalram - i.freeram),
 		       nid, node_page_state(nid, NR_ACTIVE),
 		       nid, node_page_state(nid, NR_INACTIVE),
+#ifdef CONFIG_NORECLAIM
+		       nid, node_page_state(nid, NR_NORECLAIM),
+#endif
 #ifdef CONFIG_HIGHMEM
 		       nid, K(i.totalhigh),
 		       nid, K(i.freehigh),
Index: Linux/fs/proc/proc_misc.c
===================================================================
--- Linux.orig/fs/proc/proc_misc.c	2007-03-26 12:39:01.000000000 -0400
+++ Linux/fs/proc/proc_misc.c	2007-03-26 13:39:05.000000000 -0400
@@ -154,6 +154,9 @@ static int meminfo_read_proc(char *page,
 		"SwapCached:   %8lu kB\n"
 		"Active:       %8lu kB\n"
 		"Inactive:     %8lu kB\n"
+#ifdef CONFIG_NORECLAIM
+		"Noreclaim:    %8lu kB\n"
+#endif
 #ifdef CONFIG_HIGHMEM
 		"HighTotal:    %8lu kB\n"
 		"HighFree:     %8lu kB\n"
@@ -184,6 +187,9 @@ static int meminfo_read_proc(char *page,
 		K(total_swapcache_pages),
 		K(global_page_state(NR_ACTIVE)),
 		K(global_page_state(NR_INACTIVE)),
+#ifdef CONFIG_NORECLAIM
+		K(global_page_state(NR_NORECLAIM)),
+#endif
 #ifdef CONFIG_HIGHMEM
 		K(i.totalhigh),
 		K(i.freehigh),


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* "Noreclaim - client patch 3/3 - treat pages w/ excessively references anon_vma as nonreclaimable"
  2007-06-29 14:12               ` Andrea Arcangeli
                                   ` (3 preceding siblings ...)
  2007-06-29 22:44                 ` RFC "Noreclaim Infrastructure patch 2/3 - noreclaim statistics..." Lee Schermerhorn
@ 2007-06-29 22:49                 ` Lee Schermerhorn
  4 siblings, 0 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-06-29 22:49 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Rik van Riel, Andrew Morton, linux-mm, Nick Dokos

Here's the last one for now.  I have a couple more in this series
that handle the swap-backed w/ no swap space avail, but that's a 
different topic, right?

----

Patch m/n against 2.6.21-rc5 - track anon_vma "related vmas" == list length

When a single parent forks a large number [thousands, 10s of thousands]
of children, the anon_vma list of related vmas becomes very long.  In
reclaim, this list must be traversed twice--once in page_referenced_anon()
and once in try_to_unmap_anon()--under a spin lock to reclaim the page.
Multiple cpus can end up spinning behind the same anon_vma spinlock and
traversing the lists.  This patch, part of the "noreclaim" series, treats
anon pages with list lengths longer than a tunable threshold as non-
reclaimable.

1) add mm Kconfig option NORECLAIM_ANON_VMA, dependent on NORECLAIM.
   32-bit systems may not want/need this features.

2) add a counter of related vmas to the anon_vma structure.  This won't
   increase the size of the structure on 64-bit systems, as it will fit
   in a padding slot.  

3) In [__]anon_vma_[un]link(), track number of related vmas.  The
   count is only incremented/decremented while the anon_vma lock
   is held, so regular, non-atomic, increment/decrement is used.

4) in page_reclaimable(), check anon_vma count in vma's anon_vma, if
   vma supplied, or in page's anon_vma.  In fault path, new anon pages are
   placed on the LRU before adding the anon rmap, so we need to check
   the vma's anon_vma.  Fortunately, the vma is available at that point.
   In vmscan, we can just check the page's anon_vma for any anon pages
   that made it onto the [in]active list before the anon_vma list length
   became "excessive".

5) make the threshold tunable via /proc/sys/vm/anon_vma_reclaim_limit.
   Default value of 64 is totally arbitrary, but should be high enough
   that most applications won't hit it.

6) In the fault paths that install new anonymous pages, check whether
   the page is reclaimable or not [#4 above].  If it is, just add it
   to the active lru list [via the pagevec cache], else add it to the
   noreclaim list.  

Notes:

1) a separate patch makes the anon_vma lock a reader/writer lock.
This allows some parallelism--different cpus can work on different 
pages that reference the same anon_vma--but this does not address the
problem of long lists and potentially many pte's to unmap.

2) I moved the call to page_add_new_anon_rmap() to before the test
for page_reclaimable() and thus before the calls to
lru_cache_add_{active|noreclaim}(), so that page_reclaimable()
could recognize the page as anon, thus obviating, I think, the vma
arg to page_reclaimable().  TBD I think this reordering is OK,
but the previous order may have existed to close some obscure
race?

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/rmap.h   |   57 ++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/swap.h   |    3 ++
 include/linux/sysctl.h |    1 
 kernel/sysctl.c        |   12 ++++++++++
 mm/Kconfig             |   11 +++++++++
 mm/memory.c            |   20 +++++++++++++----
 mm/rmap.c              |    9 ++++++-
 mm/vmscan.c            |   22 +++++++++++++++++-
 8 files changed, 127 insertions(+), 8 deletions(-)

Index: Linux/mm/Kconfig
===================================================================
--- Linux.orig/mm/Kconfig	2007-03-28 16:33:18.000000000 -0400
+++ Linux/mm/Kconfig	2007-03-28 16:34:00.000000000 -0400
@@ -171,3 +171,14 @@ config NORECLAIM
 	  may be non-reclaimable because:  they are locked into memory, they
 	  are anonymous pages for which no swap space exists, or they are anon
 	  pages that are expensive to unmap [long anon_vma "related vma" list.]
+
+config NORECLAIM_ANON_VMA
+	bool "Exclude pages with excessively long anon_vma lists"
+	depends on NORECLAIM
+	help
+	  Treats anonymous pages with excessively long anon_vma lists as
+	  non-reclaimable.  Long anon_vma lists results from fork()ing
+	  many [hundreds, thousands] of children from a single parent.  The
+	  anonymous pages in such tasks are very expensive [sometimes almost
+	  impossible] to reclaim.  Treating them as non-reclaimable avoids
+	  the overhead of attempting to reclaim them.
Index: Linux/include/linux/rmap.h
===================================================================
--- Linux.orig/include/linux/rmap.h	2007-03-28 16:33:18.000000000 -0400
+++ Linux/include/linux/rmap.h	2007-03-28 16:33:29.000000000 -0400
@@ -10,6 +10,18 @@
 #include <linux/spinlock.h>
 
 /*
+ * Optionally, limit the growth of the anon_vma list of "related" vmas
+ * to ANON_VMA_LIST_LIMIT.  Add a count member
+ * to the anon_vma structure where we'd have padding on a 64-bit
+ * system w/o lock debugging.
+ */
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+#define DEFAULT_ANON_VMA_RECLAIM_LIMIT 64
+#else
+#define DEFAULT_ANON_VMA_RECLAIM_LIMIT 0
+#endif
+
+/*
  * The anon_vma heads a list of private "related" vmas, to scan if
  * an anonymous page pointing to this anon_vma needs to be unmapped:
  * the vmas on the list will be related by forking, or by splitting.
@@ -25,6 +37,9 @@
  */
 struct anon_vma {
 	rwlock_t rwlock;	/* Serialize access to vma list */
+#if CONFIG_NORECLAIM_ANON_VMA
+	int count;	/* number of "related" vmas */
+#endif
 	struct list_head head;	/* List of private "related" vmas */
 };
 
@@ -34,11 +49,18 @@ extern struct kmem_cache *anon_vma_cache
 
 static inline struct anon_vma *anon_vma_alloc(void)
 {
-	return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+	struct anon_vma *anon_vma;
+
+	anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+	if (DEFAULT_ANON_VMA_RECLAIM_LIMIT && anon_vma)
+		anon_vma->count = 0;
+	return anon_vma;
 }
 
 static inline void anon_vma_free(struct anon_vma *anon_vma)
 {
+	if (DEFAULT_ANON_VMA_RECLAIM_LIMIT)
+		VM_BUG_ON(anon_vma->count);
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
@@ -59,6 +81,39 @@ static inline void anon_vma_unlock(struc
 		write_unlock(&anon_vma->rwlock);
 }
 
+#if CONFIG_NORECLAIM_ANON_VMA
+
+/*
+ * Track number of "related" vmas on anon_vma list.
+ * Only called with anon_vma lock held.
+ * Note:  we track related vmas on fork() and splits, but
+ * only enforce the limit on fork().
+ */
+static inline void add_related_vma(struct anon_vma *anon_vma)
+{
+	++anon_vma->count;
+}
+
+static inline void remove_related_vma(struct anon_vma *anon_vma)
+{
+	--anon_vma->count;
+	VM_BUG_ON(anon_vma->count < 0);
+}
+
+static inline struct anon_vma *page_anon_vma(struct page *page)
+{
+	VM_BUG_ON(!PageAnon(page));
+	return (struct anon_vma *)((unsigned long)page->mapping &
+						~PAGE_MAPPING_ANON);
+}
+
+#else
+
+#define add_related_vma(A)
+#define remove_related_vma(A)
+
+#endif
+
 /*
  * anon_vma helper functions.
  */
Index: Linux/mm/rmap.c
===================================================================
--- Linux.orig/mm/rmap.c	2007-03-28 16:33:18.000000000 -0400
+++ Linux/mm/rmap.c	2007-03-28 16:33:29.000000000 -0400
@@ -99,6 +99,7 @@ int anon_vma_prepare(struct vm_area_stru
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
 			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+			add_related_vma(anon_vma);
 			allocated = NULL;
 		}
 		spin_unlock(&mm->page_table_lock);
@@ -113,8 +114,11 @@ int anon_vma_prepare(struct vm_area_stru
 
 void __anon_vma_merge(struct vm_area_struct *vma, struct vm_area_struct *next)
 {
-	BUG_ON(vma->anon_vma != next->anon_vma);
+	struct anon_vma *anon_vma = vma->anon_vma;
+
+	BUG_ON(anon_vma != next->anon_vma);
 	list_del(&next->anon_vma_node);
+	remove_related_vma(anon_vma);
 }
 
 void __anon_vma_link(struct vm_area_struct *vma)
@@ -123,6 +127,7 @@ void __anon_vma_link(struct vm_area_stru
 
 	if (anon_vma) {
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+		add_related_vma(anon_vma);
 		validate_anon_vma(vma);
 	}
 }
@@ -134,6 +139,7 @@ void anon_vma_link(struct vm_area_struct
 	if (anon_vma) {
 		write_lock(&anon_vma->rwlock);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+		add_related_vma(anon_vma);
 		validate_anon_vma(vma);
 		write_unlock(&anon_vma->rwlock);
 	}
@@ -150,6 +156,7 @@ void anon_vma_unlink(struct vm_area_stru
 	write_lock(&anon_vma->rwlock);
 	validate_anon_vma(vma);
 	list_del(&vma->anon_vma_node);
+	remove_related_vma(anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
 	empty = list_empty(&anon_vma->head);
Index: Linux/include/linux/swap.h
===================================================================
--- Linux.orig/include/linux/swap.h	2007-03-28 16:33:18.000000000 -0400
+++ Linux/include/linux/swap.h	2007-03-28 16:33:29.000000000 -0400
@@ -214,6 +214,9 @@ static inline int zone_reclaim(struct zo
 
 #ifdef CONFIG_NORECLAIM
 extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+extern int anon_vma_reclaim_limit;
+#endif
 #else
 #define page_reclaimable(P, V) 1
 #endif
Index: Linux/mm/vmscan.c
===================================================================
--- Linux.orig/mm/vmscan.c	2007-03-28 16:33:18.000000000 -0400
+++ Linux/mm/vmscan.c	2007-03-28 16:34:00.000000000 -0400
@@ -1812,6 +1812,10 @@ int zone_reclaim(struct zone *zone, gfp_
 #endif
 
 #ifdef CONFIG_NORECLAIM
+
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+int anon_vma_reclaim_limit = DEFAULT_ANON_VMA_RECLAIM_LIMIT;
+#endif
 /*
  * page_reclaimable(struct page *page, struct vm_area_struct *vma)
  * Test whether page is reclaimable--i.e., should be placed on active/inactive
@@ -1822,7 +1826,8 @@ int zone_reclaim(struct zone *zone, gfp_
  *               If !NULL, called from fault path.
  *
  * Reasons page might not be reclaimable:
- * TODO - later patches
+ * 1) anon_vma [if any] has too many related vmas
+ * [more TBD.  e.g., anon page and no swap available, page mlocked, ...]
  *
  * TODO:  specify locking assumptions
  */
@@ -1832,7 +1837,20 @@ int page_reclaimable(struct page *page, 
 
 	VM_BUG_ON(PageNoreclaim(page));
 
-	/* TODO:  test page [!]reclaimable conditions */
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+	if (PageAnon(page)) {
+		struct anon_vma *anon_vma;
+
+		/*
+		 * anon page with too many related vmas?
+		 */
+		anon_vma = page_anon_vma(page);
+		VM_BUG_ON(!anon_vma);
+		if (anon_vma_reclaim_limit &&
+			anon_vma->count > anon_vma_reclaim_limit)
+			reclaimable = 0;
+	}
+#endif
 
 	return reclaimable;
 }
Index: Linux/include/linux/sysctl.h
===================================================================
--- Linux.orig/include/linux/sysctl.h	2007-03-28 16:33:18.000000000 -0400
+++ Linux/include/linux/sysctl.h	2007-03-28 16:33:29.000000000 -0400
@@ -207,6 +207,7 @@ enum
 	VM_PANIC_ON_OOM=33,	/* panic at out-of-memory */
 	VM_VDSO_ENABLED=34,	/* map VDSO into new processes? */
 	VM_MIN_SLAB=35,		 /* Percent pages ignored by zone reclaim */
+	VM_ANON_VMA_RECLAIM_LIMIT=36, /* max "related vmas" for reclaim  */
 
 	/* s390 vm cmm sysctls */
 	VM_CMM_PAGES=1111,
Index: Linux/kernel/sysctl.c
===================================================================
--- Linux.orig/kernel/sysctl.c	2007-03-28 16:33:18.000000000 -0400
+++ Linux/kernel/sysctl.c	2007-03-28 16:33:29.000000000 -0400
@@ -859,6 +859,18 @@ static ctl_table vm_table[] = {
 		.extra1		= &zero,
 	},
 #endif
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+	{
+		.ctl_name	= VM_ANON_VMA_RECLAIM_LIMIT,
+		.procname	= "anon_vma_reclaim_limit",
+		.data		= &anon_vma_reclaim_limit,
+		.maxlen		= sizeof(anon_vma_reclaim_limit),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
+#endif
 	{ .ctl_name = 0 }
 };
 
Index: Linux/mm/memory.c
===================================================================
--- Linux.orig/mm/memory.c	2007-03-28 16:33:18.000000000 -0400
+++ Linux/mm/memory.c	2007-03-28 16:33:29.000000000 -0400
@@ -1650,8 +1650,11 @@ gotten:
 		ptep_clear_flush(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
-		lru_cache_add_active(new_page);
 		page_add_new_anon_rmap(new_page, vma, address);
+		if (page_reclaimable(new_page, vma))
+			lru_cache_add_active(new_page);
+		else
+			lru_cache_add_noreclaim(new_page);
 
 		/* Free the old page.. */
 		new_page = old_page;
@@ -2149,8 +2152,11 @@ int install_new_anon_page(struct vm_area
 	inc_mm_counter(mm, anon_rss);
 	set_pte_at(mm, address, pte, pte_mkdirty(pte_mkwrite(mk_pte(
 					page, vma->vm_page_prot))));
-	lru_cache_add_active(page);
 	page_add_new_anon_rmap(page, vma, address);
+	if (page_reclaimable(page, vma))
+		lru_cache_add_active(page);
+	else
+		lru_cache_add_noreclaim(page);
 	pte_unmap_unlock(pte, ptl);
 
 	/* no need for flush_tlb */
@@ -2187,8 +2193,11 @@ static int do_anonymous_page(struct mm_s
 		if (!pte_none(*page_table))
 			goto release;
 		inc_mm_counter(mm, anon_rss);
-		lru_cache_add_active(page);
 		page_add_new_anon_rmap(page, vma, address);
+		if (page_reclaimable(page, vma))
+			lru_cache_add_active(page);
+		else
+			lru_cache_add_noreclaim(page);
 	} else {
 		/* Map the ZERO_PAGE - vm_page_prot is readonly */
 		page = ZERO_PAGE(address);
@@ -2334,8 +2343,11 @@ retry:
 		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
-			lru_cache_add_active(new_page);
 			page_add_new_anon_rmap(new_page, vma, address);
+			if (page_reclaimable(new_page, vma))
+				lru_cache_add_active(new_page);
+			else
+				lru_cache_add_noreclaim(new_page);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(new_page);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2007-06-29 22:49 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
2007-06-08 20:02 ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli
2007-06-10 17:36   ` Rik van Riel
2007-06-10 18:17     ` Andrea Arcangeli
2007-06-11 14:58       ` Rik van Riel
2007-06-26 17:08       ` Rik van Riel
2007-06-26 17:55         ` Andrew Morton
2007-06-26 19:02           ` Rik van Riel
2007-06-28 22:44           ` Rik van Riel
2007-06-28 22:57             ` Andrew Morton
2007-06-28 23:04               ` Rik van Riel
2007-06-28 23:13                 ` Andrew Morton
2007-06-28 23:16                   ` Rik van Riel
2007-06-28 23:29                     ` Andrew Morton
2007-06-29  0:00                       ` Rik van Riel
2007-06-29  0:19                         ` Andrew Morton
2007-06-29  0:45                           ` Rik van Riel
2007-06-29  1:12                             ` Andrew Morton
2007-06-29  1:20                               ` Rik van Riel
2007-06-29  1:29                                 ` Andrew Morton
2007-06-28 23:25                   ` Andrea Arcangeli
2007-06-29  0:12                     ` Andrew Morton
2007-06-29 13:38             ` Lee Schermerhorn
2007-06-29 14:12               ` Andrea Arcangeli
2007-06-29 14:59                 ` Rik van Riel
2007-06-29 22:39                 ` "Noreclaim Infrastructure" [was Re: [PATCH 01 of 16] remove nr_scan_inactive/active] Lee Schermerhorn
2007-06-29 22:42                 ` RFC "Noreclaim Infrastructure - patch 1/3 basic infrastructure" Lee Schermerhorn
2007-06-29 22:44                 ` RFC "Noreclaim Infrastructure patch 2/3 - noreclaim statistics..." Lee Schermerhorn
2007-06-29 22:49                 ` "Noreclaim - client patch 3/3 - treat pages w/ excessively references anon_vma as nonreclaimable" Lee Schermerhorn
2007-06-26 20:37         ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli
2007-06-26 20:57           ` Rik van Riel
2007-06-26 22:21             ` Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 02 of 16] avoid oom deadlock in nfs_create_request Andrea Arcangeli
2007-06-10 17:38   ` Rik van Riel
2007-06-10 18:27     ` Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 03 of 16] prevent oom deadlocks during read/write operations Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 04 of 16] serialize oom killer Andrea Arcangeli
2007-06-09  6:43   ` Peter Zijlstra
2007-06-09 15:27     ` Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 05 of 16] avoid selecting already killed tasks Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 06 of 16] reduce the probability of an OOM livelock Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 07 of 16] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 08 of 16] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 09 of 16] fallback killing more tasks if tif-memdie doesn't " Andrea Arcangeli
2007-06-08 21:57   ` Christoph Lameter
2007-06-08 20:03 ` [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
2007-06-08 21:48   ` Christoph Lameter
2007-06-09  1:59     ` Andrea Arcangeli
2007-06-09  3:01       ` Christoph Lameter
2007-06-09 14:05         ` Andrea Arcangeli
2007-06-09 14:38           ` Andrea Arcangeli
2007-06-11 16:07             ` Christoph Lameter
2007-06-11 16:50               ` Andrea Arcangeli
2007-06-11 16:57                 ` Christoph Lameter
2007-06-11 17:51                   ` Andrea Arcangeli
2007-06-11 17:56                     ` Christoph Lameter
2007-06-11 18:22                       ` Andrea Arcangeli
2007-06-11 18:39                         ` Christoph Lameter
2007-06-11 18:58                           ` Andrea Arcangeli
2007-06-11 19:25                             ` Christoph Lameter
2007-06-11 16:04           ` Christoph Lameter
2007-06-08 20:03 ` [PATCH 11 of 16] the oom schedule timeout isn't needed with the VM_is_OOM logic Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 12 of 16] show mem information only when a task is actually being killed Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 13 of 16] simplify oom heuristics Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 14 of 16] oom select should only take rss into account Andrea Arcangeli
2007-06-10 17:17   ` Rik van Riel
2007-06-10 17:30     ` Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 15 of 16] limit reclaim if enough pages have been freed Andrea Arcangeli
2007-06-10 17:20   ` Rik van Riel
2007-06-10 17:32     ` Andrea Arcangeli
2007-06-10 17:52       ` Rik van Riel
2007-06-11 16:23         ` Christoph Lameter
2007-06-11 16:57           ` Rik van Riel
2007-06-08 20:03 ` [PATCH 16 of 16] avoid some lock operation in vm fast path Andrea Arcangeli
2007-06-08 21:26 ` [PATCH 00 of 16] OOM related fixes William Lee Irwin III
2007-06-09 14:55   ` Andrea Arcangeli
2007-06-12  8:58     ` Petr Tesarik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox